DOCUMENT clustering is a special version of data clustering

Size: px

Start display at page:

Download "DOCUMENT clustering is a special version of data clustering"

Nancy Bryan
6 years ago
Views:

1 INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2011, VOL. 57, NO. 3, PP Manuscrpt receved June 19, 2011; revsed September DOI: /v Document Clusterng Concepts, Metrcs and Algorthms Tomasz Tarczynsk Abstract Document clusterng, whch s also refered to as text clusterng, s a technque of unsupervsed document organsaton. Text clusterng s used to group documents nto subsets that consst of texts that are smlar to each orher. These subsets are called clusters. Document clusterng algorthms are wdely used n web searchng engnes to produce results relevant to a query. An example of practcal use of those technques are Yahoo! herarches of documents [1]. Another applcaton of document clusterng s browsng whch s defned as searchng sesson wthout well specfc goal. The browsng technques heavly reles on document clusterng. In ths artcle we examne the most mportant concepts related to document clusterng. Besdes the algorthms we present comprehensve dscusson about representaton of documents, calculaton of smlarty between documents and evaluaton of clusters qualty. Keywords Document clusterng, text mnng, kmeans, herarchcal clusterstng, vector space model. usually dffer a bt from the other ones because they must meet other requrements. Nowadays document clusterng s becomng even more mportant, mostly because databases contanng the documents are growng rapdly. That mght cause two knds of problems. Frst of them s the tme needed to perform a search. Tme complexty of most clusterng algorthm depends lnearly or quadratcaly on the number of documents whch makes. The second ssue s connected wth an accuracy of results. Increase n number of documents ncreases the probablty that document wll be assgned to the wrong cluster. I. INTRODUCTION DOCUMENT clusterng s a specal verson of data clusterng problem. Data clusterng s defned as an organzaton of a set of object nto dsjoned subsets called clusters (see [2]). The assgnment should be made n such a way that objects wthn a cluster are smlar to each other. In text clusterng nstead of usng some general objects we use documents. Clusterng algorthms are often used n web search engnes to automatcally group web pages nto categores whch can be browsed easly by a user. The man purpose of usng document clusterng s to fnd documents relevant to a gven query. To obtan such results a query must be formulated frst. It s not always easy to defne t. The example of such hard to defne query s how to formulate a query f a user wants to retreve artcles descrbng the most mportant recent events n the world. Defnng the query as recent most mportant events probably would not return anythng useful, unless there would be an artcle that s smlarly ttled, ths would also not guarantee the success because ths artcle could be wrtten 10 years ago. In such a case the user does not know exactly what he s lookng for. Ths concept s called browsng. More formally t s defned as a search sesson wthout well specfed goal (see [3]). Ths goal s beng usually defned durng the browsng process. The user s purpose mght be also changed durng the browsng process. The concept of browsng uses text clusterng as a man subroutne of whole process. Those clusterng algorthms T. Tarczynsk s wth the Department of Appled Informatcs, Warsaw Unversty of Lfe Scences, ul. Nowoursynowska 159, Warsaw, Poland (e-mal: tomek.tarczynsk@gmal.com). Fg. 1. An llustraton of clusterng results of objects that can be descrbed usng two attrbutes. Each of the elpses represenst dfferent cluster contanng smlar objects. Ths paper s dvded nto 6 sectons. Secton 2 presents the defntons and notatons that are used n ths paper, whle Secton 3 presents dfferent representatons of documents. Secton 4 descrbes the concept of smlarty and evaluaton of cluster qualty. Secton 5, whch s the man part of ths paper, covers dfferent algorthms used for clusterng. Secton 6 presents conclutons. II. DEFINITIONS AND NOTATIONS The followng conventons are used throughout the document. Captal letters are used to denote sets, whle the lowercase letters are used for vecors are scalars. Let D be a set of all documents that are to be clustered. It s assumed that ths set s always a fnte set wth cardnalty n. Dependng on the context d j can denote the jth document or the vector that represents the jth document (representaton of documents s explaned n the next secton). C s the set of clusters, whle C refers to the th cluster. The cardnalty of set C s equal to k. Let T be the set of all possble words that can occur n

2 272 T. TARCZYNSKI documents. Ths set s also fnte and the cardnalty of ths set s dentoed by t. It s assumed that the sze of ths set does not depend on the number of documents. Most of the tme ths set s referred to as dctonary.the th word n the dctonary s denoted by t. The number of occurences of the word t n the document d j s denoted by n j, whle n j s the total number of words n the jth document. Symbol x refers to the length of vector x and A denotes the cardnalty of the set A. A set of clusters produced durng a clusterng process s called clusterng results. III. REPRESENTATION OF A DOCUMENT Computers process numbers much better than documents and therefore documents are usually represented usng numbers. Many dfferent models that represent documents were proposed. The most wdespread representaton of documents s called Vector Space Model, whch was presented by Salton, Wong and Yang n [4]. In ths model the documents are represented by nonnegatve sparse vectors n a t-dmensonal space. Such vectors wll be denoted by d j, where j denotes the jth document n the set. d j = (w 1j, w 2j,..., w tj ) (1) where w j s the wegth of th term n jth document. Each term usually refers to a sngle word from the dctonary, but some algorthms lke Phrase-Intersecton clusterng use phrases nstead of sngle words. In such a case the dctonary would consst of phrases and t would denote the number of possble phrases. There are several ways of calculatng those weghts, but typcally term frequency-nverse document frequency method s used (see [5]). In ths method each weght s a normalzed product of two components: term frequency and nverse document frequency. Term frequency s calculated as a normalzed number of occurrences of the term n the document. The term frequency of the -th term n the j-th document, T j, s defned as: T j = n j n j (2) Term frequency represents how often a word s used n a document, the more often the hgher the weght. To dstngush common words, whch do not say anythng useful n the context of document clusterng, from the mportant words nverse document frequency s used. The more documents contan partcular word the lower the value of ths term, the number of occurences of ths word wthn sngle document s not relevant, only whether the word occured or not. Ths term s calculated as a logarthm of the nverse of a fracton of documents n whch ths word occurred. For the -th word, the nverse document frequency, I, s defned as: ( I = log n {j : t d j } There s a common problem assocated wth smlar equatons namely dvson by 0. To overcome ths problem two approaches are possble. The dea of the frst one s to add 1 ) (3) to the denomnator, but t changes the results. The second one, whch s more common, s to assume that words that do not occure n any documents are removed. Inverse term frequency solves a problem wth common words, whch should not have any nfluence on the clusterng process. If a word occurs n all documents then the nomnator and denomnator are equal and the value of the whole term s 0. It removes the nfluence of words such as: a, the, so, are, etc. Accordng to [4] the a model wthout Itf has lower precson and recall 1 by 14% on average. The mprovement was measured over the standard term frequency weghtng. More about the VSM can be found n [6] and [7]. IV. DOCUMENT SIMILARITY AND EVALUATION OF CLUSTER QUALITY The frst part of ths secton dscusses the document smlarty measures. Presented smlarty measures assumes that documents are represented usng VSM. Smlarty measures are one of the key parts of almost every clusterng algorthms. They can be used not only to determne the smlarty between two documents but also to calculate the smlarty between a document and a cluster and between two clusters. The most popular document smlarty measures, accordng to [8], are Cosne, Dce, Jaccard. They are all based on the dot product, the only dfference s the normalzaton factor. It s worth notcng that f the vectors are normalzed then cosne and dce measures produce the same result. Sm c (d j, d q ) = w j w q w2 j (4) w2 q Sm d (d j, d q ) = Sm j (d j, d q ) = 2 w j w q w2 j + (5) w2 q w j w q w2 j + w2 q w (6) j w q It s common to calculate document dssmlarty (dstance) nstead of the smlarty. Ths s a bt dfferent approach to the same problem, because nstead of sayng how smlar the documents are we are sayng how dfferent they are. Usually the Eucldan dstance s used n ths approach (see [9]), although some other metrcs lke Manhattan dstance or Mnkowsk dstance can be used, but they are not so commonly used as they are n data clusterng. Document smlarty measures can be extended to a measure resemblance of all documents n a cluster. The whole group of such measures s called ntra-cluster smlarty measures (see [10]). The most wdely used ntra-cluster smlarty measure s group average smlarty. It s defned as an average of parwse smlarty of documents n the cluster, t s gven by: d d(c ) = j,d q C sm(d j, d q ) C 2 (7) If we use a cosne as a document smlarty measure then somewhat suprsngly ths formula can be sgnfcantly smplfed. To smplfy ths formula we need to ntroduce the concept 1 These terms are explaned n the next secton.

3 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 273 of a centrod. The centrod of a cluster s a vector that s used to represent the whole cluster, t s defned as an average of all vectors that represent documents n a cluster, n ths paper we denote the centrod of the th cluster by c. c = 1 C d j (8) d j C The centrod does not have to have normalzed length, n fact t almost never does. It can be easly proved that the ntracluster smlarty s equal to the squared length of a centrod vector. d(c ) = c 2 (9) The calculaton of the smlarty between two clusters can be smplfed to the calculaton of the smlarty of ther centrods. Another approach was presented n [11], where Olsen ntroduced new clusters smlarty measurng metrcs: sngle lnk, average lnk, complete lnk and medan. They were defned as the mnmum dstance between documents n two clusters, an average dstance, a maxmum dstance and a medan of dstances (see also [3]). The last part of ths secton s devoted to external qualty measures. In contrast to the prevously mentoned measures these ones need an external knowledge about real classes. Ths assumpton prevents from usng them n an clusterng algorthm as a crteron functon. Those measures reflect the smlarty between real classes and the ones obtaned n the clusterng process. One the most fundamental measure s the entropy, whch s based on the termodynamcal entropy. The entropy of a sngle cluster s defned as: E j = p j ln(p j ) (10) Where p j s the probablty that a document whch s n the cluster j belongs to the class. Here we assume that 0 ln(0) = 0. To obtan the qualty of all clusters the weghted sum of entropes of all clusters must be calculated. The weght s equal to the sze of the cluster. E = 1 E n (11) n F measure, was frst ntroduced by C. J. van Rjsberge n [12], s a dfferent external qualty measure. To understand ths measure two concepts must be ntroduced. Frst of them s precson. Precson(,j) can be nterpreted as a probablty that randomly chosen document from jth cluster belongs to the th class. The second concept s called recall and t can be nterpreted as a probablty that randomly chosen document from the th class belongs to the jth cluster. The F-measure of the cluster j and class s calculated as harmonc mean of precson and recall, whch s gven by: F (, j) = 2 Recall(, j) P recson(, j) P recson(, j) + Recall(, j) (12) The F-measure of whole cluster result s gven n the followng form F = 1 k n (13) n max j F (, j) Fg. 2. An llustraton of recall and precson. Black dots represent documents that are n one category and the whte ones represent documents that are n another category. Dots that are n an ellpse corespond to the documents n one cluster. Recall of ths cluster s equal to 2 and the precson s equal to More about F-measure and other qualty measures n nfromaton retreval can be found n [13] V. CLUSTERING ALGORITHMS Ths secton descrbes the most mportant clusterng algorthms. Each of presented algorthms fts nto one of two categores: herarchcal clusterng and non-herarchcal clusterng (see [14]). Non-herarchcal clusterng algorthms are often called flat methods or parttonal methods. Herarchcal methods n contrast to the parttonal methods do not create a sngle clusterng result, but the whole herarchy of clusterng. There are two basc approaches to herarchcal clusterng. Frst of them s herarchcal agglomeratve clusterng (HAC) and the second one s herarchcal dvse clusterng (HDC). A. Herarchcal Agglomeratve Algorthms Herarchcal methods are often beleved to produce better clusterng results (see [15]), but ther man dsadvantage s the complexty. The lower bound on tme complexty n herarchcal clusterng s O(n 2 ). The results obtaned by a herarchcal clusterng algorthm can be vewed at dfferent levels. Each of the levels can be seen as separate clusterng result wth dfferent number of clusters, although those levels are strongly connected because the result on each level s based on the result on prevous level. Ths herarchy of results s usually graphcally presented usng a tree. Ths tree s called dendrogram. It s a tree wth a sngle node at the top and n nodes at the bottom. Each of the nodes that are on the bottom represent a cluster wth a sngle document n t. At the begnnng of the agglomeratve approach each document s n ts own cluster, so n the frst level number of clusters s equal to n. In each teraton the two most smlar clusters are merged. Ths process contnues untl only one cluster s left. That strategy s also called bottom-up approach. Exhaustve survey on agglomeratve herarchcal methods can be found n [14]. More about the effcency of HAC can be found n [16]. The tradtonal agglomeratve algorthm can be summarzed as follows: 1) Compute the smlarty matrx A, whose entry [a j ] s the smlarty between th and jth cluster. 2) Merge the two most smlar clusters.

4 274 T. TARCZYNSKI 3) Update the smlarty matrx (only the column that represents new cluster must be updated) 4) Repeat steps 2 and 3 untl all documents are n one cluster. Another huge advantage of ths algorthm s that t can be seen as ether flat clusterng or herarchcal clusterng method. Usng t as a parttonal clusterng mght even produce better results because the obtaned results can be mproved usng K-means algorthm on the desred number of clusters. In ths case the bsectng K-means would just lead to fndng centrod for clusters. Fg. 3. Sample dendrogram. Lookng from the bottom-up approach n the frst teraton documents D 1 and D 2 were merged nto one cluster, n the second one documents D 3 and D 4 were grouped nto one cluster. In the thrd teraton cluster contanng D 1 and D 2 was merged wth cluster that contans D 3 and D 4 and fnally n the last teraton all documents were put nto one cluster. B. Bsectng K-means Bsectng K-means s a herarchcal dvsve algorthm, whch was ntroduced n [15]. Bsectng K-means should not be confused wth K-means algorthm whch s parttonal algorthm. The dvsve strategy s n some sense a reverse of the agglomeratve technque. At the begnnng of the algorthm there s only one cluster whch contans all documents. In each teraton a chosen cluster s dvded nto two clusters. The process stops when each cluster contans only a sngle document. The bsectng K-means s descrbed below: 1) Choose a cluster to splt. 2) Apply K-means clusterng wth k = 2 to ths cluster. (bsectng step) 3) Repeat the bsectng step ITER tmes and choose the splt that produces clusterng wth the hghest overall smlarty. 4) Repeat all steps untl the desred number of clusters s obtaned. Emprcal research n [15] shows that ITER = 5 s a good compromse between the qualty and the tme complexty. The ncrease of the ITER parameter would ncrease both the qualty and the tme complexty. Ths algorthm s very flexble because steps 1 and 2 can be mplemented n many dfferent ways. The most ntutve way to choose a cluster n step 1 s to choose the largest cluster. That was the method used n [15]. The other possble approaches mentoned n [15] are take the cluster wth the least overall smlarty, use a crteron based on both sze and overall smlarty. The bsectng step s even more adjustable snce any nondetermnstc parttonal clusterng algorthm can be used n ths step, although takng K-means seems reasonable because t s a smple and effectve algorthm. The K-means algorthm tselft can be also adjusted n many ways, what wll be descrbed n the next subsecton. C. K-means K-means clusterng (see [17], [18]) s the most wdely used parttonal clusterng technque. Its popularty s due to ts smplcty and good results. The general descrpton of k- means algorthm s as follows: 1) Choose k the number of clusters 2) Choose k ponts as a center of clusters 3) Assgn each document to the nearest cluster 4) If stop crterum s met then stop 5) Recalculate center of clusters 6) Go to the step 3 Choosng k n the frst step depends strongly on the dataset and motvatons that are behnd clusterng, although some systematc approaches for choosng the number of clusters are gven n [19]. The ponts n the second step are usually chosen at random for smplcty, but some modfcatons such as k- means++ [20] mght be used. A comprehensve survey on dfferent methods for ntalzng the K-means clusterng can be found n [21]. Authors compare there 11 dfferent ntalzng methods. From those experments t mght be concluded that the best ntalzng methods are the ones proposed n [22], [23] and [24]. In order to assgn a document to the nearest cluster any of the smlarty measure mentoned n chapter 3 can be used. The stop crteron s usually defned as a stop when no documents have been moved from one cluster to another. In the standard k-means algorthm, the center of a cluster s calculated as a centrod of the documents n that cluster, but many modfcatons to ths pont were proposed. One of them s the k-medans algorthm whch calculates n ths step the medan nstead of the mean, ths medan does not have to be an actual document because medan of each term s calculated separately. Another modfcaton was presented as the k-medod algorthm where one of the documents s chosen to be a center of a cluster. Ths document s usually chosen as a document wth the lowest parwse dssmlarty. D. Algorthms Based on Crteron Functons Many parttonal clusterng algorthms are based on optmzaton of a global crteron functon. In some of these algorthms the crteron functon mght be exchanged easly. Examples of those algorthms are CobWeb [25] and AutoClass [26]. Such algorthms consst of two man components. One of them s the crteron functon used for an optmzaton and the second one s the algorthm tself. Many dfferent clusterng algorthms can easly be created f a lst of crteron functons and optzaton algorthms are avalable because every combnaton of the optmzaton algorthm and the crteron functon creates new clusterng algorthm.

5 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 275 In ths chapter we wll focus mostly on the crteron functon nstead of algorthms, because accordng to [9] and [27] greedy strategy produces comparable results to the results obtaned usng more complcated algorthms. These more complcated algorthms nclude concepts lke teratve schemes based on hll-clmbng methods, spectral-based optmzers and so on. The reader more nterested n these algorthms are reffered to [28], [29], [30], [31], [32] and [33]. A great survey on crteron functons can be found n [9], Table contatng all crteron functons tested n [9] s presented below. f 1 maxmze k n r 1 n 2 cos(d, d j ) (14) r d,d j C r f 2 maxmze k cos(d, C r ) d C r (15) k mnmze n r cos(c r, c) (16) f 3 f 4 f 5 f 6 f 7 maxmze maxmze mnmze mnmze ( k n r k 1 n 2 r cos(d, d j ) d,d j C r k n r cos(c r, c) cos(d, C r ) d C r k n r cos(c r, c) k k cut(s r, S S r ) d,d jnc r cos(c, c j ) cut(v r, V V r ) W (V r ) ) (17) (18) (19) (20) where c denotes the centrod of all documents, V r s the set of vertces assgned to the rth cluser. and W (V r ) s the sum of the weghts of the adjacenct lsts of V r. Here we descrbe an dea that s behnd each of these functons: f 1 t s weghted sum of the average parwse smlartes between the documents n each cluster. The weghts are normalzed due to the sze of the cluster. f 2 t s the sum of the cosne between each document and the centrod of the cluster of ths document. f 3 t s the sum of the cosne between each centrod of clusters and the centrod of all documents. f 4 t s a quotent of f 1 and f 3. f 5 t s a quotent of f 2 and f 3. f 6 t s edge-cut of each partton scaled by the sum of the cluster s nternal edges. t s normalzed edge-cut of the parttonng. f 7 The entropy was used as a measure of qualty of the results. Extensve experments made on 15 dfferent datasets showed that none of these functons s a superor to the others, although some functons performed better. In almost all tested f 2, f 4 and f 5 outperformed all other functons. The dfference between the qualty of those functons was not sgnfcant. It s nterestng that some functons, whch seem to be very smlar lke f 1 and f 2, produced very dfferent results. The results vared by up to twenty percent. E. Word-Intersecton Clusterng Oren Zamr n [34] outlned that reducng the number of returned documents not necessarly makes browsng easer for user. He clams that effcent document clusterng mght be used as a good method for navgatng through a large collecton of documents, furthermore t mght fnd some patterns that would be normally mssed. Zamr also created a 6 pont lst of requrements for clusterng algorthm n context of document browsng. 1) Ease-of-browsng: Results must be presented n such a way that user mmedately can decde whether some content s useful or not. 2) Speed: The clusterng should be fast enough to create results wthn seconds. 3) Scalablty: The method should be able to cluster large set of documents. 4) No preprocessng: The algorthm should not rely on the preprocessng, because t s appled to dynamcally generated set of documents. 5) Clent sde executon: The system should be able to process the clusterng algorthm on the clent sde. It s motvated by a fact that server mght not have enought computaton power to produce clusterng for each user. 6) Snppet-Capable: The algorthm should not requre whole documents, but only small snppets of those documents. Due to clent sde executon ths requrement s necessary, beacuse transferng whole documents could be more tme consumng that the algorthms tself. Word-Intersecton Clusterng s a herarchcal agglomeratve algorthm wth a few new concepts. Frst of them s that clusters are characterzed by the set of words that are n every document n the cluster. Ths automatcally solves a problem wth the descrpton, because those common words mght be used as a descrpton of a cluster. The numbers of words that are shared by all the documents n the cluster s called coheson and t s denoted by h(c ). Another new concept s the score functon s(c), whch expresses the decreasng margnal sgngcance of the coheson. s(c ) = C 1 exp ( βh(c )) 1 + exp ( βh(c )) (21) The most mportant new feauture of ths algorthm s Global Qualty Functon whch quantfes the qualty of a clusterng. It s defned as: GQF (C) = f(c) g( C ) C C s(c ) (22) Where the f(c) s a functon that s proportonal to the normalzed number of clusterd documents. A document s consdered to be clustered when the sze of the cluster that

6 276 T. TARCZYNSKI contans s greater than 1, whle the g( C ) s an ncreasng functon n the number of clusters. The algorthm tself can be summarzed as: 1) Put all documents n dfferent clusters. 2) If there are no two clusters whose merge would ncrese GQF halt. 3) Merge two clusters that creases GQF the most. 4) Go to the step 2. The experments made n [34] showed that ths algorthm s faster that the classcal herachcal agglomeratve algorthm wth the cosne smlarty measure, whch s referred n that paper as COS-GAVG. In addton the qualty of the results obtaned by word-ntersecton clusterng was sgnfcantly better than the resuls produced by COS-GAVG. These results prove that ths algorthm s very promsng. VI. CONCLUSIONS In ths artcle we have presented the most mportant concepts related to document clusterng. Our research confrmed that there s no superor text clusterng algorthm. Document clusterng algorthms vary n both the complexty and the qualty of results. Herarchcal algorthms n general are consdered as the ones that performs better than the flat methods, but on the other hand parttonal algorthms are much faster. It seems that the most promsng algorthms are hybrd methods lke bstectng k-means. Ths algorthm outperformed parttonal clusterng algorthms wthout sgnfcant ncrease n complexty. The clusterng algorthm should be chosen dependng on the avalable tme, herarchcal algorthms are well suted for a precomputaton, whle parttonal algorthms are good for an onlne computaton. The complexty of clusterng algorthms wll play a crucal role n the future, ths wll be due to contnuous ncrese n the number of ducuments. On the other hand the speed of the computers wll ncrease more slowly because the Moore s law 2 s consdered to be broken. Ths premses lead to the concluson that parallel and dstrbuted clusterng algorthms should be more stressed n the future. Many parallel data clusterng algorthms were already desgned (see [35], [11], [36]), but not much has been done n the feld of parallel document clusterng. Besde the complexty another potental problem s assocated wth hgher demands on the qualty. These demands mght lead to the development of a more lngustc approach to the document clusterng problem. Present algorthms do not take nto account an order of the words n a sentence. In some cases the order of words mght be relevant n the process of clusterng. Ths task s extremely hard to mplement, because each language has ts own grammar rules whch determne whether the order n partcular case s mportant or not. REFERENCES [1] Y. Labrou and T. Fnn, Yahoo! as an ontology: usng yahoo! categores to descrbe documents, n Proceedngs of the eghth nternatonal 2 Moore s law states that the number of transstors that can be placed on ntegrated crcuts, wthout any addtonal costs, doubles every two year conference on Informaton and knowledge management, ser. CIKM 99. New York, NY, USA: ACM, 1999, pp [Onlne]. Avalable: [2] A. K. Jan, M. N. Murty, and P. J. Flynn, Data clusterng: a revew, ACM Comput. Surv., vol. 31, pp , September [Onlne]. Avalable: [3] D. R. Cuttng, D. R. Karger, J. O. Pedersen, and J. W. Tukey, Scatter/gather: a cluster-based approach to browsng large document collectons, n Proceedngs of the 15th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 92, New York, NY, USA, 1992, pp [Onlne]. Avalable: [4] G. Salton, A. Wong, and C. S. Yang, A vector space model for automatc ndexng, Commun. ACM, vol. 18, pp , November [Onlne]. Avalable: [5] G. Salton and C. Buckley, Term weghtng approaches n automatc text retreval, Cornell Unversty, Ithaca, NY, USA, Tech. Rep., [6] S. K. M. Wong, W. Zarko, V. V. Raghavan, and P. C. N. Wong, On modelng of nformaton retreval concepts n vector spaces, ACM Trans. Database Syst., vol. 12, pp , June [Onlne]. Avalable: [7] X. Ta, M. Sasak, Y. Tanaka, and K. Kta, Improvement of vector space nformaton retreval model based on supervsed learnng, n Proceedngs of the ffth nternatonal workshop on on Informaton retreval wth Asan languages, ser. IRAL 00, New York, NY, USA, 2000, pp [Onlne]. Avalable: [8] G. Salton, Ed., Automatc text processng. Boston, MA, USA: Addson- Wesley Longman Publshng Co., Inc., [9] Y. Zhao and G. Karyps, Emprcal and theoretcal comparsons of selected crteron functons for document clusterng, Mach. Learn., vol. 55, pp , June [Onlne]. Avalable: [10] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma, Learnng to cluster web search results, n Proceedngs of the 27th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 04, New York, NY, USA, 2004, pp [Onlne]. Avalable: [11] C. F. Olson, Parallel algorthms for herarchcal clusterng, Parallel Comput., vol. 21, [12] C. J. van Rjsbergen, Informaton Retreval, 2nd ed. Newton, MA, USA: Butterworth-Henemann, [13] J. Makhoul, F. Kubala, R. Schwartz, and R. Weschedel, Performance measures for nformaton extracton, n In Proceedngs of DARPA Broadcast News Workshop, 1999, pp [14] A. El-Hamdouch and P. Wllett, Comparson of herarchc agglomeratve clusterng methods for document retreval, The Computer Journal, vol. 32, pp , jan [Onlne]. Avalable: [15] M. Stenbach, G. Karyps, and V. Kumar, A comparson of document clusterng technques, [Onlne]. Avalable: [16] W. H. Day and H. Edelsbrunner, Effcent algorthms for agglomeratve herarchcal clusterng methods, Journal of Classfcaton, vol. 1, pp. 7 24, [Onlne]. Avalable: [17] G. A. Wlkn and X. Huang, A practcal comparson of two k-means clusterng algorthms, BMC Bonformatcs, vol. 9, p. S19, [Onlne]. Avalable: [18] J. Wu, H. Xong, and J. Chen, Adaptng the rght measures for k-means clusterng, n Proceedngs of the 15th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, ser. KDD 09, New York, NY, USA, 2009, pp [Onlne]. Avalable: [19] M. Chang and B. Mrkn, Experments for the number of clusters n k-means, n Progress n Artfcal Intellgence, ser. Lecture Notes n Computer Scence, J. Neves, M. Santos, and J. Machado, Eds. Sprnger Berln / Hedelberg, 2007, vol. 4874, pp [Onlne]. Avalable: 33 [20] D. Arthur and S. Vasslvtsk, k-means++: the advantages of careful seedng, n Proceedngs of the eghteenth annual ACM-SIAM symposum on Dscrete algorthms, ser. SODA 07, Phladelpha, PA, USA, 2007, pp [Onlne]. Avalable: [21] R. Matra, A. Peterson, and A. Ghosh, A systematc evaluaton of dfferent methods for ntalzng the k-means clusterng algorthm, IEEE Transactons on Knowledge and Data Engneerng, 2010.

7 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 277 [22] G. W. Mllgan and P. D. Isaac, The valdaton of four ultrametrc clusterng algorthms, Pattern Recognton, vol. 12, no. 2, pp , [Onlne]. Avalable: [23] P. S. Bradley and U. M. Fayyad, Refnng ntal ponts for k-means clusterng, n Proceedngs of the Ffteenth Internatonal Conference on Machne Learnng, ser. ICML 98, San Francsco, CA, USA, 1998, pp [Onlne]. Avalable: [24] B. Mrkn, Clusterng for Data Mnng: A Data Recovery Approach. Chapman and Hall/CRC, [25] D. H. Fsher, Knowledge acquston va ncremental conceptual clusterng, Mach. Learn., vol. 2, pp , September [Onlne]. Avalable: [26] P. Cheeseman and J. Stutz, Advances n knowledge dscovery and data mnng, U. M. Fayyad, G. Patetsky-Shapro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA, USA: Amercan Assocaton for Artfcal Intellgence, 1996, ch. Bayesan classfcaton (AutoClass): theory and results, pp [Onlne]. Avalable: [27] S. Savares, D. L. Boley, S. Bttant, and G. Gazzanga, Choosng the cluster to splt n bsectng dvsve clusterng algorthms, n Second SIAM Internatonal Conference on Data Mnng, [28] M. Mela and D. Heckerman, An expermental comparson of model-based clusterng methods, Mach. Learn., vol. 42, pp. 9 29, January [Onlne]. Avalable: [29] G. Karyps, E. Han, and V. Kumar, Chameleon: Herarchcal clusterng usng dynamc modelng, Computer, vol. 32, pp , August [Onlne]. Avalable: [30] D. Boley, Prncpal drecton dvsve parttonng, Data Mn. Knowl. Dscov., vol. 2, pp , December [Onlne]. Avalable: [31] H. Zha, X. He, C. Dng, H. Smon, and M. Gu, Bpartte graph parttonng and data clusterng, n Proceedngs of the tenth nternatonal conference on Informaton and knowledge management, ser. CIKM 01, New York, NY, USA, 2001, pp [Onlne]. Avalable: [32] C. H. Zha, H. Zha, X. He, C. Dng, H. Smon, and M. Gu, Spectral relaxaton for k-means clusterng, n Advances n Neural Informaton Processng Systems. MIT Press, 2001, pp [33] I. S. Dhllon and D. S. Modha, Concept decompostons for large sparse text data usng clusterng, Mach. Learn., vol. 42, pp , January [Onlne]. Avalable: [34] O. Zamr, O. Etzon, O. Madan, and R. M. Karp, Fast and ntutve clusterng of web documents, n In Proceedngs of the 3rd Internatonal Conference on Knowledge Dscovery and Data Mnng, 1997, pp [35] M. Dash, S. Petrutu, and P. Scheuermann, Effcent parallel herarchcal clusterng, n Internatonal Europar Conference, [36] Y. Song, W. Chen, H. Ba, C. Ln, and E. Chang, Parallel spectral clusterng, Machne Learnng and Knowledge Dscovery n Databases, pp , [Onlne]. Avalable: 25 [37] Y. Lu, J. Mostafa, and W. Ke, A fast onlne clusterng algorthm for scatter/gather browsng, [38] D. R. Cuttng, D. R. Karger, and J. O. Pedersen, Constant nteractontme scatter/gather browsng of very large document collectons, n Proceedngs of the 16th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 93, New York, NY, USA, 1993, pp [Onlne]. Avalable:

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of