Clustering validation - PDF Free Download

MOHAMMAD REZAEI Clusterng valdaton Publcatons of the Unversty of Eastern Fnland Dssertatons n Forestry and Natural Scences No 5 Academc Dssertaton To be presented by permsson of the Faculty of Scence and Forestry for publc examnaton n Louhela audtorum n Scence Park Buldng at the Unversty of Eastern Fnland, Joensuu, on June 0, 06, at o clock noon. School of Computng

Grano Oy Joensuu, 06 Edtor: Dr. Pertt Pasanen Dstrbuton: Unversty of Eastern Fnland Lbrary / Sales of publcatons P.O.Box 07, FI-800 Joensuu, Fnland tel. +358-50-3058396 http://www.uef.f/krasto ISBN: 978-95-6-44-4 (prnted ISSNL: 798-5668 ISSN: 798-5668 ISBN: 978-95-6-45- (pdf ISSNL: 798-5668 ISSN: 798-5676

Author: Mohammad Rezae Unversty of Eastern Fnland School of Computng P.O.Box 800 JOENSUU FINLAND emal: rezae@cs.uef.f Supervsor: Professor Pas Fränt, PhD. Unversty of Eastern Fnland School of Computng P.O.Box 800 JOENSUU FINLAND emal: frant@cs.uef.f Revewers: Professor Ana Lusa N. Fred, PhD Insttuto Superor Técnco Torre Norte, Insttuto de Telecomuncações Av. Rovsco Pas, 049-00, Lsbon PORTUGAL emal: afred@lx.r.pt Professor James Baley, PhD Unversty of Melbourne Department of Computng and Informaton Systems Vctora 300 AUSTRALIA emal: baley@unmelb.edu.au Opponent: Professor Ioan Tabus, PhD Tampere Unversty of Technology Department of Sgnal Processng P.O.Box 57 330 Tampere FINLAND emal: oan.tabus@tut.f

ABSTRACT: Cluster analyss or clusterng s one of the most fundamental and essental data mnng tasks wth broad applcatons. It ams at fndng a structure n a set of unlabeled data, producng clusters so that obects n one cluster are smlar n some way and dfferent from obects n other clusters. Basc elements of clusterng nclude proxmty measure between obects, cost functon, algorthm, and cluster valdaton. There s a close relatonshp between these elements. Although there has been extensve research on clusterng methods and ther applcatons, less attenton has been pad to the relatonshps between the basc elements. Ths thess frst provdes an overvew of the basc elements of cluster analyss. It then focuses on cluster valdty as four publcatons are devoted to ths element. Chapter sketches the clusterng procedure and provdes defntons of basc components. Chapter revews popular proxmty measures for dfferent types of data. A novel smlarty measure for comparng two groups of words s ntroduced whch s used n the clusterng of tems characterzed by a set of keywords. Chapter 3 presents basc clusterng algorthms and Chapter 4 analyzes cost functons. A clusterng algorthm s expected to optmze a gven cost functon. However, n many cases the cost functon s unknown and hdden wth the algorthm, makng the evaluaton of clusterng results and analyss of the algorthms dffcult. Numerous clusterng algorthms have been developed for dfferent applcaton felds. Dfferent algorthms, or even one algorthm wth dfferent parameters, can gve dfferent results for the same data set. The best clusterng can be selected based on the cost functon f the number of clusters s fxed and the cost functon has been defned, otherwse cluster valdty ndces, nternal and external, are used. Chapter 5 revews several popular nternal ndces. We study the problem of determnng the number of clusters n a data set usng these ndces, and we propose a new nternal ndex for fndng the number of clusters n herarchcal clusterng of words. External

valdty ndces are studed n Chapter 6 and two new external ndces, centrod ndex and par sets ndex, are ntroduced. We present a novel expermental setup based on generated parttons to evaluate external ndces. We also study whether external ndces are applcable to the problem of determnng the number of clusters. The concluson s made that external ndces can be used for the problem, but only n theory and n controlled envronments where the type of data s well known and no surprses appear. In practce, ths s rarely the case. AMS classfcaton: 6H30, 9C0 Unversal Decmal Classfcaton: 004.05.4, 303.7.4, 59.37.8 Lbrary of Congress Subect Headngs: Data mnng; cluster analyss; algorthms Ylenen suomalanen asasanasto: tedonlouhnta; klusteranalyys; valdont; algortmt

Preface Ths PhD dssertaton contans the results of research completed at the School of Computng of the Unversty of Eastern Fnland durng the years 0-06. Many ndvduals have helped me both drectly and ndrectly n my research and wrtng ths thess. I would lke to express my sncere grattude to my supervsor, Professor Pas Fränt, for gvng me the chance to study n the PhD program and for hs support wth research throughout the years. I would never have fnshed ths dssertaton wthout hs help and gudance. I would also lke to thank my colleagues who helped me durng my PhD study specally Dr. Qnpe Zhao. I am thankful to Professor Ana Lusa N. Fred and Professor James Baley, the revewers of the thess, for ther feedback and comments. I extend my heartfelt grattude to my father and mother, my frst teachers. Thank you so much for your help and support. I would lke to express my deepest love and grattude to my wfe and my sons. Ths research has been supported by MOPSI and MOPIS proects, SCITECO and LUMET grants from Unversty of Eastern Fnland, and the Noka FOUNDATION. Joensuu, May 9, 06 Mohammad Rezae

LIST OF ORIGINAL PUBLICATIONS P P P3 P. Fränt, M. Rezae, Q. Zhao, Centrod ndex: cluster level smlarty measure, Pattern Recognton, 47(9, pp. 3034-3045, 04. M. Rezae, P. Fränt, Set matchng measures for external cluster valdty, IEEE Transactons on Knowledge and Data Engneerng, 06, (accepted. M. Rezae, P. Fränt, Can number of clusters be solved by external valdty ndex?, 06, (submtted. P4 Q. Zhao, M. Rezae, P. Fränt, Keyword clusterng for automatc categorzaton, Internatonal Conference on Pattern Recognton (ICPR, pp. 845-848, 0. P5 M. Rezae, P. Fränt, Matchng smlarty for keywordbased clusterng, Jont IAPR Internatonal Workshop, SSPR & SPR 04, Joensuu, (S+SSPR, pp. 93-0, 04. Throughout the thess, these papers wll be referred to by [P]- [P5]. These papers are ncluded at the end of ths thess by the permsson of ther copyrght holders.

AUTHOR S CONTRIBUTION The dea of the paper [P] orgnates from Prof. Pas Fränt. The author contrbuted by refnng the defnton of the centrod ndex and extendng t to the correspondng pont-level ndex. The prncpal deas of the other papers orgnate from the author. Implementatons for the papers [P], [P3], and [P5] were performed completely by the author. The author mplemented the pont-level ndex n [P]. Implementaton of the dea n [P4] was done by the author, except the lbrares and smlarty measures usng WordNet. The author performed all experments for [P]-[P5] and part of the experments for [P]. [P] was wrtten by Prof. Pas Fränt and [P4] by Dr. Qnpe Zhao. The author helped to refne the text and provded materals for some sectons of the papers. The author has wrtten the papers [P], [P3], and [P5].

Lst of symbols N X x P K c n x D number of data obects data obect as vector th data obects th cluster of clusterng soluton P number of clusters centrod of the th cluster number of obects n the th cluster average of all data obects dmenson of data

Contents Introducton... Proxmty measures... 5. ElemEntary Data types... 5. Numercal dstances... 6.3 Non-numercal dstances... 8.4 Semantc smlarty between words... 9.5 Semantc smlarty between groups of words... 3 Clusterng algorthms... 3 3. K-means... 3 3. Random swap... 3 3.3 Agglomeratve clusterng... 4 3.4 DBSCAN... 4 4 Cost functons... 7 4. Total Squared Error (TSE... 7 4. All parwse dstances (APD... 8 4.3 Spannng tree (ST... 9 4.4 K-nearest neghbor connectvty... 9 4.5 Lnkage crtera... 0 5 Internal valdty ndces... 3 5. Internal ndces... 3 5. Sum of squares wthn clusters (SSW... 5 5.3 Sum of squares between clusters (SSB... 6 5.4 Calnsk-Harabasz ndex (CH... 6 5.5 Slhouette coeffcent (SC... 7 5.6 Dunn famly of ndces... 7 5.7 Solvng number of clusters... 8

6 External valdty ndces... 3 6. Desred propertes... 33 6. Par-countng ndces... 35 6.3 Informaton-theoretc ndces... 36 6.4 Set matchng ndces... 38 6.5 Expermental setup for evaluaton... 43 6.6 Solvng the number of clusters... 46 7 Summary of contrbutons... 53 8 Conclusons... 55 References Appendx: Orgnal publcatons

Introducton Clusterng s the dvson of data obects nto groups or clusters such that obects n the same group are more smlar than obects n dfferent groups. Clusterng plays an mportant role n data mnng applcatons such as scentfc data exploraton, nformaton retreval and text mnng, spatal database applcatons, Web analyss, customer relatonshp management (CRM, marketng, medcal dagnostcs, computatonal bology, and vsualzaton []. Data Clusterng method Proxmty measure [P5] Clusterng crteron Clusterng algorthm Clusters Clusterng tendency? How many clusters? [P3][P4] Cluster valdaton Internal ndex [P4] External ndex [P][P] Results nterpretaton Knowledge Fgure.: Basc components of cluster analyss Fgure. shows the components of cluster analyss. Data s represented n terms of features that form d-dmensonal feature vectors. Feature extracton and selecton from orgnal enttes must be performed so that the features provde as much dstncton as possble between dfferent enttes concernng the task of nterest. Ths s performed by an expert n the feld. For example, the extracton of features from a speech sgnal to Dssertaton n Forestry and Natural Scences No 5

dstngush between dfferent people s performed by an expert n the speech processng feld []. Moreover, extracted features may need preprocessng, such as dmensonalty reducton and normalzaton of the features, so that all features have the same scale and contrbute equally. Next, the assumpton s made that the features have been already extracted and the requred preprocessng has been performed. The basc components of cluster analyss are the followng:. Proxmty measure. Clusterng crteron 3. Clusterng algorthm 4. Cluster valdaton 5. Results nterpretaton Smlarty or dssmlarty (dstance measure between two data obects s a basc requrement for clusterng, and t s chosen based on the problem at hand. For example, suppose that the problem concerns a tme analyss of travellng n a cty. Usng Eucldean dstance between two places s not accurate because one cannot typcally travel through buldngs. We study several proxmty measures n Chapter ncludng a new smlarty between two groups of words. Clusterng crteron determnes the type of clusters that are expected. The crteron s expressed as a cost (or obectve functon, or some other rules. For example, for the same data set, one crteron leads to hypersphercal clusters, whereas another leads to elongated clusters []. The cost functon s hdden n many exstng clusterng approaches, however, the functon can be determned through further analyss. We study several cost functons n Chapter 4. Clusterng algorthm s the procedure that groups data n order to optmze the clusterng crteron. Numerous clusterng algorthms have been developed for dfferent felds. Good algorthms fnd a clusterng close to the optmum effcently. In Chapter 3, we revew basc clusterng algorthms. Dfferent clusterng algorthms, and even one algorthm wth dfferent parameters and ntal assumptons, can produce dfferent clusterngs for the same data set. For a fxed number of

Introducton clusters, dfferent results can be evaluated based on the clusterng crteron f avalable. In a general case, cluster valdaton technques are used to evaluate the results of a clusterng algorthm [3], and decde whch clusterng best fts the data. Cluster valdaton s performed usng cluster valdty ndces whch are dvded nto two groups: nternal ndex and external ndex [P]. Internal ndces measure the qualty of a clusterng soluton usng only the underlyng data [4], [5]. External ndces compare two clusterng solutons of the same dataset. They mght compare a clusterng wth ground truth to evaluate a clusterng algorthm. Both nternal and external ndces are used for determnng the number of clusters. We study cluster valdty ndces n Chapters 5 and 6. The goal of clusterng s to provde meanngful nsghts to the data n order to develop a better understandng of the data. Therefore, n many cases, the expert n the applcaton feld s encouraged to nterpret the resultng parttons and ntegrate the results wth other expermental evdence and analyss n order to draw the rght conclusons. Dssertaton n Forestry and Natural Scences No 5 3

Mohammad Rezae: Clusterng Valdaton 4 Dssertaton n Forestry and Natural Scences No 5

Proxmty measures A data obect represents an entty and s descrbed by attrbutes or features wth a certan type, such as a number or a word. Attrbutes are often represented by a multdmensonal vector [6]. The type of attrbutes s one of the factors that determnes how to measure the smlarty between two obects. Other factors are related to the problem at hand. For example, the smlarty of two words for some applcatons s measured by consderng the letters n the words. However, for other applcatons, ths does not provde good results, and the semantc smlarty between two words s requred. A dssmlarty or smlarty measure can be effectve wthout beng a metrc [7], but sometmes metrc requrements are desrable. A dssmlarty metrc must satsfy the followng condtons [7]: Non-negatvty: D(x, x 0 Symmetry: D(x, x = D(x, x Reflexvty: D(x, x = 0 f and only f x=x. Trangular nequalty: D(x, x+ D(x, xk D(x, xk A smlarty metrc satsfes the followng: Lmted range: S(x, x S0 Symmetry: S(x, x = S(x, x Reflexvty: S(x, x = S0 f and only f x=x. Trangular nequalty: S(x, x S(x, xk S(x, xk (S(x, x+s(x, xk. ELEMENTARY DATA TYPES Numerc: Numerc data are classfed n two groups: nterval and rato. The nterval between each consecutve pont of measurement s equal to every other for nterval data, such as Dssertaton n Forestry and Natural Scences No 5 5

Mohammad Rezae: Clusterng Valdaton tme and temperature. They do not have a meanngful zero pont. For example, 00.00 am s not the absence of tme. The dfference between 0:5 and 0:30 has exactly the same value as the dfference between 8:00 and 8:5. In rato data, such as the number of people n lne, a value of zero ndcates an absence of whatever s measured. Another classfcaton for numerc data ncludes dscrete data and contnuous data. Categorcal: Every obect belongs to one of a lmted number of possble categores, states, or names. Categorcal data are classfed nto two groups: nomnal and ordnal. Categores n nomnal data such as marrage status (marred, wdow, sngle are not ordered. Bnary data can be consdered as nomnal data wth only two states: 0 and. On the other hand, categores n ordnal data, such as degree of pan (severe, moderate, mld, none are ordered.. NUMERICAL DISTANCES Eucldean dstance Eucldean dstance s the most common metrc that s used for numercal vector obects. For two d dmensonal obects x and x, Eucldean dstance s calculated as follows: d l d x x l / l (. Centrod-based clusterng algorthms, such as K-means, that use Eucldean dstance tend to provde hypersphercal clusters [6]. Eucldean dstance s a specal case (p= of a more general metrc called Mnkowsk dstance: d d l / p p l l x x (. Another popular and specal case of Mnkowsk dstance s Manhattan or cty-block dstance where p=, see Fgure.: 6 Dssertaton n Forestry and Natural Scences No 5

Proxmty Measures d d l l x x l (.3 A clusterng algorthm that uses Manhattan dstance tends to buld hyper-rectangular clusters [6]. 0 X = (,8 D example x = (,8 x = (6,3 5 5 Eucldean dstance d(, 6 8 3 4 X = (6,3 Manhattan dstance d(, 6 8 3 9 0 5 0 4 Fgure.: Eucldean and Manhattan dstances (http://cs.uef.f/pages/frant/cluster/notes.html Mahalonobs dstance All the obects n a cluster affect on Mahalonobs dstance between two obects by applyng wthn group covarance matrx S. Clusterng algorthms that use ths dstance tend to buld hyper-ellpsodal clusters. T d ( x x S ( x x (.4 The wthn group covarance matrx for uncorrelated features becomes an dentty matrx and, therefore, Mahalonobs dstance smplfes to Eucldean dstance [6]. Dssertaton n Forestry and Natural Scences No 5 7

Mohammad Rezae: Clusterng Valdaton.3 NON-NUMERICAL DISTANCES Cosne smlarty Cosne smlarty s the most popular metrc used n document clusterng and s based on the angle between the vectors of two obects. s X X X X (.5 The more smlar two obects are, the more parallel they are n the feature space, and the greater the cosne value. The Cosne value does not provde nformaton on the magntude of the dfference. Hammng dstance Hammng dstance s used for comparng categorcal data and strngs of equal length. It counts the number of dfferent elements n two obects [8]: d l l d dl( x, x, l 0, l l dl( x, x, x x l l l l x x (.6 Followng are some examples: Cables, Tablet d= 0000, 000 d=3 (male, blond, blue, A, (female, blond, brown, A d= Gower smlarty s a varant of Hammng dstance, whch s normalzed by the number of attrbutes and has been extended for mxed categorcal and numercal data [9]. The smple form of Gower smlarty for categorcal data can be wrtten as follows: S d l l l Sl ( x, x, d, l l Sl ( x, x 0, x x l l l x x l (.7 8 Dssertaton n Forestry and Natural Scences No 5

Proxmty Measures Edt dstance Levenshten or edt dstance measures the dssmlarty of two strngs (e.g., words by countng the mnmum number of nsertons, deletons, and substtutons requred to transform one strng to the other. Several varants exst. For example, longest common subsequence (LCS allows only nsertons and deletons [0]. We descrbe the edt dstance by an example: the dssmlarty between ktten and sttng. Transformng ktten nto sttng can be performed n three steps as follows: Substtute s wth k: stten Substtute e wth : sttn Insert g at the end: sttng Therefore, the edt dstance between the two words s 3..4 SEMANTIC SIMILARITY BETWEEN WORDS Semantc smlarty between two words s measured accordng to ther meanng rather than ther syntactcal representaton. Measures for the semantc smlarty of words can be categorzed as corpus-based, search engne-based, knowledge-based and hybrd. Corpus-based measures such as pont-wse mutual nformaton (PMI [] and latent semantc analyss (LSA [] defne the smlarty based on large corpora and term cooccurrence. The number of occurrences and co-occurrences of two words n a large number of documents s used to approxmate ther smlarty. A hgh smlarty s acheved when the number of co-occurrences s only slghtly lower than the number of occurrences of each word. Search engne-based measures such as Google dstance are based on web counts and snppets from the results of a search engne [] [3] [4]. Flckr dstance frst searches for two target words separately through mage tags and then uses mage content to calculate the dstance between two words [5]. Dssertaton n Forestry and Natural Scences No 5 9

Mohammad Rezae: Clusterng Valdaton Knowledge-based measures use lexcal databases such as WordNet [6] or CYC [6]. These databases can be consdered computatonal formats of large amounts of human knowledge. The knowledge extracton process s tme consumng and the database depends on human udgment. Moreover, t does not scale easly to new words, felds, and languages [7] [8]. WordNet s a taxonomy that requres a procedure to derve a smlarty score between words. Despte ts lmtatons, t has been successvely used for clusterng [P4]. Fgure. llustrates a small part of the WordNet herarchy where mammal s the least subsummer of wolf and huntng dog. Depth of a word s the number of lnks between t and the root word n WordNet. As an example, the Wu and Palmer measure [9] s defned as follows: S( w, w depth( LCS( w, w (.8 depth( w depth( w where LCS s the least common subsummer of the words w and w. anmal fsh mammal reptle amphban 3 wolf horse dog cat S wup 3 4 0.89 mare stallon huntng dog 4 dachshund terrer Fgure.: Part of WordNet taxonomy Jang-Contrath [6] s a hybrd of corpus-based and knowledge-based methods n that t extracts the nformaton content of two words and ther least subsumer n a corpus. Methods based on Wkpeda or smlar webstes are also hybrd n the sense that they use organzed corpora wth lnks between documents [0]. 0 Dssertaton n Forestry and Natural Scences No 5

Proxmty Measures.5 SEMANTIC SIMILARITY BETWEEN GROUPS OF WORDS The semantc clusterng of obects such as documents, web stes, and moves based on ther keywords requres a smlarty measure between two sets of keywords. Exstng measures nclude mnmum, maxmum, and average smlarty. Consder the bpartte graph n Fgure.3 where the smlarty between every two words s wrtten on ther correspondng lnk. Mnmum and maxmum measures are based on the lnks wth mnmum (0.0 and maxmum (0.84 values. The average measure consders all the lnks and calculates the average value (0.57. These measures have fundamental lmtatons n provdng a reasonable smlarty value between two sets of words [P5]. For example, the mnmum and average measures gve a lower value than.00 for two sets wth the same words. Maxmum measure gves.00 for two dfferent sets whch have only one common word. Hyve Sampon lomamökt restaurant cafetera cafe 0.70 0. 0.84 0.67 0.0 0.80 0.67 0. 0.80 max mn Average = 0.57 sauna holday cottage Fgure.3: Mnmum and maxmum smlartes between two locaton-based servces s derved by consderng two keywords wth mnmum and maxmum smlartes In [P5], we present a new measure based on matchng the words of two groups assumng that a smlarty measure between two ndvdual words s avalable. The proposed matchng smlarty measure s based on a greedy parng algorthm whch frst fnds the two most smlar words across Dssertaton n Forestry and Natural Scences No 5

Mohammad Rezae: Clusterng Valdaton the sets, and then teratvely matches next smlar words. Fnally, the remanng non-pared keywords (of the obect wth more keywords are ust matched wth the most smlar words n the other obect. Fgure.4 llustrates the matchng process between two sample obects. Veslepps Tahko Spa restaurant gym skng dance spa.00.00 0.30 0.67.00 S = 0.79 restaurant gym spa Fgure.4: Matchng between the words of two obects. Consder two obects wth N and N keywords so that N>N. We defne normalzed smlarty between the two obects as follows: S N S( w, w N p( (.9 where S(w,w measures the smlarty between two words, and p( provdes the matched word for w n the other obect. The proposed measure elmnates the dsadvantages of mnmum, maxmum, and average smlarty measures. Dssertaton n Forestry and Natural Scences No 5

3 Clusterng algorthms 3. K-MEANS K-means s a parttonal clusterng algorthm that ams at mnmzng the total squared error (TSE. To cluster N data obects nto K clusters, K centrods are ntally selected n some way, for example, through randomly chosen data obects. Two steps of the algorthm are then teratvely performed: assgnment and update, for a fxed number of teratons or untl convergence. In the frst step, obects are assgned to ther nearest centrod. In the second step, new centrods are calculated by averagng the obects n each cluster []. Tme complexty s O(IKN, where I s the number of teratons []. K-means suffers from several drawbacks [6]. The man drawback s that the result s hghly dependent on the ntal selecton of centrods. Dfferent centrods lead to dfferent local optmums that may be very far away from the global one. Consequently, many varants of K-means have been proposed to tackle the obstacles. For example, several technques such as K- means++ [3] have been proposed for the better selecton of ntal centrods. Iteratve methods such as genetc algorthm [4] and random swap [5] mprove results by modfyng the centrods. 3. RANDOM SWAP The randomzed local search or random swap algorthm [5] selects one of the centrods n a gven clusterng randomly and moves t to another locaton. K-means s then appled to fne tune the clusterng result. The process s repeated for a gven number of teratons chosen as an nput parameter. In each teraton, the new resultng clusterng s accepted f t mproves TSE, and s Dssertaton n Forestry and Natural Scences No 5 3

Mohammad Rezae: Clusterng Valdaton then used for the next teraton. Wth large number of teratons, typcally 5,000, the method usually provdes good results. Ths tral-and-error approach s smple to mplement and very effectve n practce. 3.3 AGGLOMERATIVE CLUSTERING Agglomeratve clusterng s a bottom-up approach n whch each obect s ntally consdered as ts own cluster. Two clusters are then teratvely merged based on a crteron [6]. Several crtera have been proposed for selectng the next two clusters to be merged such as sngle-lnkage, average-lnkage, complete-lnkage, centrod-lnkage, and Ward s method [7]. Classcal agglomeratve clusterng usng any of these crtera s not approprate for large-scale data sets due to the quadratc computatonal complextes n both executon tme and storng space. The tme complexty of the basc agglomeratve clusterng s O(N 3. The fast algorthm ntroduced n [8] employs a nearest neghbor table that only uses O(N memory and reduces the tme complexty to O(N, where <<N. Even ths algorthm can stll be too slow for real-tme applcatons. In [6], an algorthm based on k-nearest neghbor graph s proposed to mprove the speed close to O(NlogN wth a slght decrease n accuracy. However, graph creaton s the bottleneck of the algorthm and should be solved. Otherwse, ths step domnates the tme complexty. Agglomeratve clusterng s senstve to nose and outlers. It does not consder an obect after t s assgned to a cluster, and therefore, prevous msclassfcatons cannot be corrected afterwards [6]. 3.4 DBSCAN Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN s a densty-based clusterng algorthm whch ams at fndng arbtrary shaped clusters and elmnate nose. It 4 Dssertaton n Forestry and Natural Scences No 5

Clusterng Algorthms creates clusters from the ponts whose neghborhood wthn a gven radus (eps contans a mnmum number (mnpt of other ponts [9]. Usng every such a pont, the algorthm grows a cluster by onng other ponts that are close to the cluster. The results are ndependent of the order of processng the obects. Three types of ponts are defned, see Fgure 3.. Core ponts contan at least mnpt (5 n ths example ponts n ther eps neghborhood. Border ponts do not contan enough ponts n ther neghborhood but they fall n the neghborhood of some core ponts. Other ponts are consdered nose or outlers. A pont x s drectly densty reachable from x f x s a core pont and x s n ts eps neghborhood. A pont x s defned densty reachable from a core pont x f a chan of ponts from x to x exst so that each pont s drectly densty reachable from the prevous pont. The concept of densty connectvty s also defned to descrbe the relatons between the border ponts that belong to the same cluster but are not densty reachable from each other. Two ponts are densty connected f they are densty reachable from a common core pont. A cluster s bult from a core pont and ts neghborng obects n eps dstance, and t grows usng the concepts of densty-reachable and denstyconnected. Two condtons should be held:. If x s n cluster C, and x s densty reachable from x, then x also belongs to cluster C. If x and x belongs to cluster C, x and x are densty connected The results are hghly dependent on the nput parameters eps and mnpt. Fndng approprate parameters for a data set s not trval, and the problem becomes more complcated when dfferent parts of data requre dfferent parameters []. Several methods such as Orderng Ponts To Identfy the Clusterng Structure (OPTICS [30] have been proposed to address ths problem. Tme complexty of the orgnal DBSCAN s O(N but efforts [3] [3] have been made to reduce t close to O(N. Dssertaton n Forestry and Natural Scences No 5 5

Mohammad Rezae: Clusterng Valdaton Nose eps Outler Border Cluster Cluster Core Fgure 3.: Three types of ponts are defned n the DBSCAN algorthm; two clusters are dentfed n ths example, where eps= and mnpt=5. 6 Dssertaton n Forestry and Natural Scences No 5

4 Cost functons An obectve functon or cost functon measures the error n a clusterng. The optmal clusterng s acheved by mnmzng the cost functon. However, not all clusterng algorthms are based on mnmzng a cost functon. Some nclude the cost functon hdden wthn the algorthm. Ths makes the evaluaton of clusterng results and analyss of the algorthms dffcult. For example, DBSCAN produces a clusterng heurstcally wth two gven nput parameters. Dfferent parameter values result n dfferent clusterngs. No obectve functon has been reported to decde whch clusterng s the best. There s however a cost functon but t may be hdden. Ths chapter addresses several cost functons that are used n exstng clusterng methods. 4. TOTAL SQUARED ERROR (TSE Total squared error (TSE s the obectve functon for most centrod-based clusterng algorthms such as k-means, whch s the sum of varances n ndvdual clusters. Gven data nputs x, =..N, centrods c, =..k, and labels of data l, =..N, l=..k, TSE s defned as [6]: TSE N x c l (4. Mean squared error (MSE equals normalzed TSE by the total number of obects. There s no dfference between mnmzng MSE and TSE. MSE N N x c l (4. For a fxed number of clusters k, the best clusterng s the one that provdes mnmum TSE. However, when the number of Dssertaton n Forestry and Natural Scences No 5 7

Mohammad Rezae: Clusterng Valdaton clusters vares, the clusterng that best fts the data cannot be concluded merely based on TSE because ncreasng k wll always provde a smaller TSE. Ths would lead all ponts nto ther own clusters. The TSE n equaton (4. can be used only for the data that the centrod of a cluster can be calculated by averagng the obects n the cluster. 4. ALL PAIRWISE DISTANCES (APD Ths cost functon consders all parwse dstances (APD between the obects n a cluster. The centrod s not needed. Therefore, APD can be used for any type of data f the dstance between every two obects s avalable. The crteron s defned as: APD x, x Cl x x (4.3 It can be shown for Eucldean dstance that [33]: APD APD APD n TSE n TSE... APD... n TSE k k k (4.4 where APD, n, and TSE are the sum of all parwse dstances, the number of obects, and the total squared error n cluster, respectvely. It s shown n [34] that applyng all parwse dstances as the clusterng crteron leads to more balanced clusters than TSE. TSE can be calculated for non-numerc data wthout havng centrods as follows. The sum of all parwse dstances s calculated for each cluster, and the result s dvded by the number of obects n the cluster gvng the total squared error TSE. Summng up the total squared errors of all clusters results n TSE. 8 Dssertaton n Forestry and Natural Scences No 5

Cost Functons 4.3 SPANNING TREE (ST The cost functon s the sum of the costs of spannng trees (ST of the ndvdual clusters. The optmal soluton for the cost functon s acheved from the mnmum spannng tree (MST of the data obects. Gven the MST n Fgure 4. (left, we can get three clusters by cuttng the two largest lnks. Ths cost functon s sutable for detectng well separated arbtrary shaped clusters. However, t fals n real lfe data sets wth nose, see Fgure 4. (rght. Nose Fgure 4.: Spannng trees of clusters are used to derve the cost functon. 4.4 K-NEAREST NEIGHBOR CONNECTIVITY Ths cost functon measures connectedness by countng the number of k nearest neghbors of each obect that are placed n dfferent cluster than the obect [35]. It s calculated as:, f x Pl K CONN x ( x x ( x (4.5 xpl x nn( x 0, otherwse where x s the th nearest neghbor of x, and Pl represents the cluster that x belongs to.the number of neghbors k s an nput parameter. The cost functon should be mnmzed. The optmal case s when all k nearest neghbors of an obect locate n the same cluster of the obect. The mpact of the frst neghbor on the cost functon s the hghest, and t decreases for the next Dssertaton n Forestry and Natural Scences No 5 9

Mohammad Rezae: Clusterng Valdaton neghbors by the factor /, =..k. The 5 nearest neghbors of one obect s depcted n Fgure 4., from whch the fourth and ffth neghbors are from the other cluster. The error s calculated as /4+/5=0.45. Summng up the errors for all the ponts gves the value of cost functon. Fgure 4.: Fve nearest neghbors are consdered to calculate the cost functon. For the selected pont, two neghbors are located n the other cluster. 4.5 LINKAGE CRITERIA In agglomeratve clusterng, a global cost functon has not been defned n the lterature. Instead, a merge cost s defned whch ams at optmzng the clusterng locally. Several crtera such as sngle-lnk and complete-lnk are used for mergng two clusters, see Fgure 4.3. We reveal the global cost functon through analyzng the local ones. Sngle-lnk crteron s the dstance between the two most smlar obects n two clusters. The goal of sngle-lnk s to fnd clusters wth the hghest connectvty. Two obects n a cluster can be far away but connected through other ponts n the cluster. The cost functon s the sum of the costs of spannng trees of ndvdual clusters. Sngle-lnk can be related to Kruskal s algorthm whch s known to be optmal for MST. It can be shown that k clusters correspond to the MST forest of k trees. Complete-lnk crteron s the dstance between the two most dssmlar obects n two clusters. Complete-lnk ams at fndng homogenous clusters so that the maxmum dstance between the obects n each cluster s mnmzed. Once two new clusters are merged, the resultng dstance s the maxmum dstance over all 0 Dssertaton n Forestry and Natural Scences No 5

Cost Functons clusters whch ndcates the worst cluster. Gven a clusterng, the largest parwse dstance n each cluster s determned. The overall cost functon s the maxmum of the largest dstances from all clusters. We call the cost functon MAX-MAX. Agglomeratve clusterng usng the complete-lnk crteron does not guarantee the optmal soluton for the MAX-MAX cost, see Fgure 4.4. Average-lnk crteron selects the two clusters that the average dstance between all pars of obects n them s mnmum. The correspondng cost functon s therefore all parwse dstances. Centrod-lnk crteron s the dstance between the centrods of two clusters. It can be used only for data n whch the centrods of clusters can be derved. Ward s crteron selects the clusters to be merged that result n a mnmum ncrease n TSE [36]. The ncrease of TSE resulted from mergng two clusters and s calculated as: nn TSE n n c c (4.6 where c and c are the centrods, and n and n are the number of obects n the two clusters. Sngle-lnk Complete-lnk Average-lnk Fgure 4.3: Dstance between two clusters Dssertaton n Forestry and Natural Scences No 5

Mohammad Rezae: Clusterng Valdaton Complete lnk Random swap 5 3 8 0 9 7 6 4 Fgure 4.4: Complete lnk agglomeratve clusterng (left results n a hgher value of the cost functon MAX-MAX comparng to the random swap algorthm (rght. The numbers show the order of merges. Dssertaton n Forestry and Natural Scences No 5

5 Internal valdty ndces Clusterng s defned as an optmzaton problem n whch the qualty s evaluated drectly from the optmzaton crteron. Straghtforward crteron works wth a fxed number of clusters k. Internal valdty ndces extend ths to varable k. 5. INTERNAL INDICES Internal ndces use a clusterng and the underlyng data set to assess the qualty of the clusterng [37]. They are desgned based on the goal of clusterng, placng smlar obects n the same cluster and dssmlar obects n dfferent clusters. Accordngly, two concepts are defned: ntra-cluster smlarty and ntercluster smlarty. Intra-cluster smlarty (e.g. compactness, connectedness, and homogenety measures the smlarty of the obects wthn a cluster, and nter-cluster smlarty or separaton measures how dstant ndvdual clusters (or ther obects are. Compactness s sutable for the clusterng algorthms that tend to provde sphercal clusters. Examples nclude centrodbased clusterng algorthms such as K-means, and average-lnk agglomeratve clusterng. Connectedness s sutable for denstybased algorthms such as DBSCAN [37]. Several varants of compactness and connectedness exst. The average of parwse ntra-cluster dstances and the average of centrod-based smlartes are representatves of compactness. A popular measure of connectedness s k-nearest neghbor connectvty whch counts volatons of nearest neghbor relatonshps [37]. A good clusterng of a data set s expected to provde well separated clusters [38]. Separaton s defned n dfferent ways. Three common methods are the dstance between the closest obects, the most dstant obects, and the centers of two clusters [39]. Dssertaton n Forestry and Natural Scences No 5 3

Mohammad Rezae: Clusterng Valdaton Several nternal ndces have been proposed that combne compactness and separaton [3] [37] [39] [40] [4] [4]. Popular ndces are lsted n Table 5.. Most of the ndces have been nvented for determnng the number of clusters that fts the data. Table 5.: Selecton of popular nternal valdty ndces SSW [43] SSB [43] Calnsk-Harabasz [44] Ball&Hall [45] Xu-ndex [46] Dunn s ndex [47] Daves&Bouldn [48] SC [49] N K x c n l c x SSB /( K SSW /( N K SSW / K D log ( M SSW /( DN mn mn d( c, c M max dam( c k where M d ( c, c mn x x' k xc, x' c dam( c max x x' K K.. M, k max x, x' c k R where MSE MSE R and c c MSE p n n x p c N b( x p a( x p N max( a( x, b( x where p log K and 4 Dssertaton n Forestry and Natural Scences No 5

Internal Valdty Indces Dssertaton n Forestry and Natural Scences No 5 5 q p C x x n q q p p x x n x a,, ( q p C x C x q p N q p q p N q p x x x a x x x b, mn ( mn ( BIC [43] M n D K N L log( ( Xe-Ben [50] } mn{ s t s t N K k c c N c x u WB [5] SSB K * SSW 5. SUM OF SQUARES WITHIN CLUSTERS (SSW Sum of squares wthn clusters (SSW [43] or wthn cluster varance s equal to the TSE, see Fgure 5.. The ndex can only be used for numercal data because t requres centrods of clusters. SSW measures the compactness of clusters, and s sutable for centrod-based clusterng, where hypersphercal clusters are desred. The value of SSW always decreases as the number of clusters ncreases. Fgure 5.: Illustraton of the sum of squares wthn clusters

Mohammad Rezae: Clusterng Valdaton 5.3 SUM OF SQUARES BETWEEN CLUSTERS (SSB The sum of squares between clusters (SSB [43] measures the degree of separaton between clusters by calculatng between cluster varance. The separaton between clusters s determned accordng to the dstances of centrods to the mean vector of all obects, see Fgure 5.. The factor n n the formula presented n Table 5. ndcates that a cluster wth a bgger sze has more mpact on the ndex. Ths crteron requres the centrods or prototypes of clusters and all data. Increasng the number of clusters usually results n a larger SSB value. x Fgure 5.: Illustraton of the sum of squares between clusters. 5.4 CALINSKI-HARABASZ INDEX (CH The Calnsk-Harabasz (CH [44] ndex uses the rato of separaton and compactness to provde the best possble separaton and compactness smultaneously. A maxmum of the ndex value ndcates the best clusterng wth a hgh separaton and low error n compactness. A hgher number of clusters for a data set provdes hgher SSB and lower SSW. However, the decrease n SSW s more than that of SSB. Therefore, the penalty factor (K- prevents the concluson of a hgher number of clusters than the correct one. The term N-K s consdered to support cases n whch the number of clusters s comparable to 6 Dssertaton n Forestry and Natural Scences No 5

Internal Valdty Indces the total number of obects. However, usually N s much hgher than K, and the term can be shortened to N. Ths ndex, smlar to SSB and SSW, s lmted to numercal data wth hypersphercal clusters. 5.5 SILHOUETTE COEFFICIENT (SC Slhouette coeffcent (SC [49] measures how well each obect s placed n ts cluster, and separated from the obects n other clusters. The average dssmlarty of each obect x wth all obects n the same cluster s calculated as a(x, whch ndcates how well x s assgned to ts cluster. Lowest average dssmlarty of x to other clusters s calculated as b(x. N b( x a( x SC= N max( a( x, b( x p (5. The dssmlarty between two obects s suffcent for calculatng the ndex. Therefore, SC can be used for any type of data, and any clusterng structure. 5.6 DUNN FAMILY OF INDICES Dunn ndex [47] s defned as follows: DI K mn mn d( c, c K max dam( c k K k (5. where d(c,c s the dssmlarty between two clusters and dam(ck=max d(x, x s the dameter of cluster ck, where x, x ck. The numerator of the equaton s a measure of separaton, the dstance between the two closest clusters. The dameter of a cluster shows the dsperson (opposte to compactness of the cluster. The cluster wth the maxmum dameter s consdered. A larger value of the ndex ndcates a better clusterng of a data set wth more compact and well separated clusters. Dssertaton n Forestry and Natural Scences No 5 7

Mohammad Rezae: Clusterng Valdaton Dunn ndex s senstve to nose, and has a hgh tme complexty [5]. Three related ndces have been ntroduced n [5] based on Dunn ndex to allevate these lmtatons. They are called Dunn-lke ndces. 5.7 SOLVING NUMBER OF CLUSTERS To determne the number of clusters, clusterng s appled to the data set for a range of k[kmn, Kmax], and the valdty ndex values are calculated. The best number of clusters k* s selected accordng to the extremum of the valdty ndex. Fgure 5.3 shows data set S wth 5 clusters and the normalzed values of SSW and SSB. Random swap clusterng algorthm [5] s appled when the number of clusters s vared n the range [, 5]. Fgure 5.3: Data set S (left, and the measured values of SSW and SSB (rght The error n compactness measured by SSW decreases, and the separaton measured by SSB ncreases, as the number of clusters ncreases. However, the decreasng and ncreasng rates sgnfcantly reduce after k=5, a knee pont that ndcates the correct number of clusters. Although several methods for detectng the knee pont have been summarzed n [43] but none of them work n all cases. It would be easer to use a valdty 8 Dssertaton n Forestry and Natural Scences No 5

Internal Valdty Indces ndex that provdes a clear mnmum or maxmum value at the correct number of clusters. For example, CH [44] provdes a maxmum by consderng both SSW and SSB, and also a penalty factor on the number of clusters k, see Fgure 5.4. Fgure 5.4: Determnng the number of clusters for the data set S usng CH ndex Most of the exstng nternal ndces requre the prototypes of the clusters but these are not always easy to calculate, such as n a clusterng of words based on ther semantc smlarty. In [P4], we ntroduce a new nternal ndex to be used for determnng the number of clusters n a herarchcal clusterng of words. To fnd out whch level of the herarchy provdes the best categorzaton of the data, an nternal ndex needs to evaluate the compactness wthn clusters and separaton between clusters at each level. We defne the proposed ndex as the rato of compactness and separaton: C( k SC( k (5.3 S( k C( k max{max JC( w, w, w w c } I k t S( k t k st, mn, JC( w, w k( k /, w c, w c t t s / N (5.4 (5.5 Dssertaton n Forestry and Natural Scences No 5 9

Mohammad Rezae: Clusterng Valdaton where w s the th keyword, ct s the cluster t at the level of herarchy where the number of clusters s k, JC s the Jang & Conrath functon that measures the dstance of two words, I s the number of clusters wth only one word, and N s the total number of words. Compactness measures the maxmum parwse dstance n each cluster, and takes the maxmum value among all clusters. Compactness for clusters wth a sngle obect cannot be consdered zero because the clusterng n whch each obect s n ts own cluster would then result n the best compactness. To avod ths, we add the factor I/N to the compactness equaton. In the begnnng of clusterng, when each obect belongs to ts own cluster, the compactness equals because I=N. Separaton measures the mnmum dstance between the words of every two clusters and sums up the values. Normalzaton by k(k- provdes a value n the same scale as compactness. A good clusterng provdes a small dstance value for compactness and a large dstance value for separaton. Therefore, the level of the herarchy wth k clusters that results n the mnmum SC s selected as the best level. 30 Dssertaton n Forestry and Natural Scences No 5

6 External valdty ndces External valdty ndces measure how well the results of a clusterng match the ground truth (f avalable or another clusterng [53] [P]. They are the crtera for testng and evaluatng clusterng results and for the analyss of clusterng tendency n a data set. Some authors defne an external ndex for comparng a clusterng wth ground truth [4] [37] and defne relatve ndex for comparng two clusterngs of a data set [3] [5]. However, many others classfy both as external ndex. External ndces have been used n ensemble clusterng [40] [54] [55] [56], genetc algorthms [57], and evaluatng the stablty of k-means [55]. In ths secton, we frst ntroduce several propertes for a valdty ndex based on whch ts performance can be evaluated. We then provde a revew of the external ndces n three categores: par-countng, nformaton theoretc, and set-matchng, see Table 6., [P]. Fnally, we descrbe our new setup of experments for evaluatng the external ndces. Gven two parttons P={P, P,,PK} of K clusters and G={G, G,,GK } of K clusters, an external valdty ndex measures the smlarty between P and G. Most external ndces are derved usng the values n the contngency table of P and G, see Table 6.. The table s a matrx where n s the number of obects that are both n clusters P and G: n= PG, n and m are the sze of clusters P and G respectvely. Table 6.: External valdty ndces Rand ndex [58] Adusted Rand ndex [59] Par-countng measures a d RI N ( N / RI E( RI ARI E( RI Informaton theoretc measures Dssertaton n Forestry and Natural Scences No 5 3

Mohammad Rezae: Clusterng Valdaton Mutual nformaton [60] Normalzed mutual nformaton [60] MI K K' n Nn log N n m MI ( P, G NMI ( H ( P H ( G / MI ( P, G NMI H ( P H ( G Normalzed Varaton of H ( P H ( G MI ( P, G NVI Informaton [6] H ( P H ( G F measure [6] Crteron H [63] Set-matchng measures K n FM n max N n m H max N K n K' Normalzed N max n Van Dongen [64] NVD N K Purty [5] Purty max n, ( N Centrod ndex [P] Centrod smlarty ndex [P] Centrod rato [65] Par sets ndex [P] K K CI ( P, G orphan( G K ' max K CI ( P, G max( CI ( P, G, CI ( G, P K n K n CSI N, : ndces of matched clusters CR K unstable par 0 stable par S E( S S E( S, max( K, K' E( S max( K, K' 0 S E( S K K' mn( K, K' n S max( n, m / K, : ndces of pared clusters n 3 Dssertaton n Forestry and Natural Scences No 5

External Valdty Indces Table 6.: Contngency table for two parttons P and G G G G G K P n n n n K n P n n n n K n P n n n n K n P K n K n K n K n KK n K m m m m K N 6. DESIRED PROPERTIES An external valdty ndex needs to satsfy several propertes to be consstent and comparable for dfferent data sets and clusterng structures. Normalzaton transforms the ndex wthn a fxed range, for example [0, ], whch makes comparson easer for data sets of a dfferent sze and structure. Normalzaton s the most commonly agreed property n the clusterng communty [66], and s usually performed as: n I d I d mn( I d ( P, G (6. max( I mn( I d d where mn(id and max(id are the mnmum and maxmum values of Id. Index values are expected to be constant when dfferent random clusterngs are compared wth a ground truth [59]. A random partton s created by selectng a random number of clusters of random sze. The smlarty between the random partton and the ground truth orgnates merely by chance. Take an example of Rand ndex: the value of the ndex for two random parttons s not a constant, and s n a narrower range of [0.5, ] nstead of [0, ]. By correcton for chance or adustment, the expected value of an ndex E(I s transformed to zero (smlarty or one (dssmlarty [59] [67]. Adustment and normalzaton can be performed ontly as follows: Dssertaton n Forestry and Natural Scences No 5 33

Mohammad Rezae: Clusterng Valdaton Dssmlarty : Smlarty : ad I d I ad s Id mn( Id ( P, G E( I mn( I Is E( Is ( P, G max( I E( I d s d s (6. where the mnmum (smlarty or maxmum (dssmlarty s replaced by the expected value E(I. Metrc property has also been consdered. Although a smlarty/dssmlarty measure can be effectve wthout beng a metrc [7], t s sometmes preferred. Consderng dssmlarty ndex I and clusters P, P and P3, metrc propertes requre [] [68]:. Non-negatvty: Id(P,P 0. Reflexvty: Id(P,P=0 f and only f P=P 3. Symmetry: Id(P,P=Id(P,P 4. Trangular nequalty: Id(P,P+Id(P,P3 Id(P,P3 A smlarty metrc satsfes the followng []:. Lmted Range: Is(P,P I0<. Reflexvty: Is(P,P= I0 f and only f P=P 3. Symmetry: Is(P,P=Is(P,P 4. Trangular nequalty: Is(P,P Is(P,P3 Is(P,P3 ( Is(P,P+Is(P,P3 The trangular nequalty for a smlarty ndex Is s derved here accordng to the correspondng nequalty for a dssmlarty ndex whch s defned as c/is (c>0. However, other forms of the nequalty are possble by defnng other dssmlartes such as max(is-is. It s trval to show that f c/is (or max(is-is s a dssmlarty metrc, Is s a smlarty metrc as well []. Hence, metrc propertes for a smlarty ndex can be checked for ts correspondng dssmlarty [P]. Cluster sze mbalance sgnfes that a data set can nclude clusters wth large dfference n ther szes. Some researchers argue that clusters wth larger szes have more mportance than smaller clusters but we assume that each cluster has the same mportance ndependent of ts sze. Invarance n the sze of clusters s therefore another desred property of an ndex. The sze of a data set should not affect the ndex ether [P]. 34 Dssertaton n Forestry and Natural Scences No 5

External Valdty Indces An ndex should be ndependent of the number of clusters. Some ndces such as Rand ndex (RI [58] gve hgher smlarty when more clusters [68]. An ndex should also be applcable for comparng two clusterngs wth dfferent number of clusters. Monotoncty s another requred property. Ths property states that the smlarty of two clusterngs monotoncally decreases as ther dfference ncreases [P]. Once these desred propertes are met, then ndex values for dfferent data sets are on the same scale and comparable. For nstance, f an ndex gves 90% and 70% smlartes, 90% should represent hgher smlarty. However, ths s true only f the ndex s ndependent of the data set and ts clusterng structure [P]. 6. PAIR-COUNTING INDICES Par-countng measures count the pars of ponts on whch two clusterngs agree or dsagree. For nstance, f two obects n one cluster n the frst partton are also placed n the same cluster n the second partton, then ths s consdered an agreement. Most exstng external valdty ndces are classfed n ths group [P]. Four values are defned: a represents the number of pars that are n the same cluster both n P and G; b represents the number of pars that are n the same cluster n P but n dfferent clusters n G; c represents the number of pars that are n dfferent clusters n P but n the same cluster n G; d represents the number of pars that are n dfferent clusters both n P and G. Values a and d count agreements whle values b and c count dsagreements. Examples of each case are llustrated n Fgure 6.. The values of a, b, c, and d can be calculated from the contngency table [59] as follows: a b ( K K' K' n ( n K K' m n Dssertaton n Forestry and Natural Scences No 5 35