Using internal evaluation measures to validate the quality of diverse stream clustering algorithms

Vetnam J Comput Sc (2017) 4:171 183 DOI 10.1007/s40595-016-0086-9 REGULAR PAPER Usng nternal evaluaton measures to valdate the qualty of dverse stream clusterng algorthms Marwan Hassan 1 Thomas Sedl 2 Receved: 23 December 2015 / Accepted: 30 September 2016 / Publshed onlne: 14 October 2016 The Author(s) 2016. Ths artcle s publshed wth open access at Sprngerlnk.com Abstract Measurng the qualty of a clusterng algorthm has shown to be as mportant as the algorthm tself. It s a crucal part of choosng the clusterng algorthm that performs best for an nput data. Streamng nput data have many features that make them much more challengng than statc ones. They are endless, varyng and emergng wth hgh speeds. Ths rased new challenges for the clusterng algorthms as well as for ther evaluaton measures. Up tll now, external evaluaton measures were exclusvely used for valdatng stream clusterng algorthms. Whle external valdaton requres a ground truth whch s not provded n most applcatons, partcularly n the streamng case, nternal clusterng valdaton s effcent and realstc. In ths artcle, we analyze the propertes and performances of eleven nternal clusterng measures. In partcular, we apply these measures to carefully syntheszed stream scenaros to reveal how they react to clusterngs on evolvng data streams usng both k-means-based and densty-based clusterng algorthms. A seres of expermental results show that dfferent from the case wth statc data, the Calnsk-Harabasz ndex performs the best n copng wth common aspects and errors of stream clusterng for k-means-based algorthms, whle the revsed valdty ndex performs the best for densty-based ones. Keywords Stream clusterng Internal evaluaton measures Clusterng Valdaton MOA B Marwan Hassan m.hassan@tue.nl Thomas Sedl sedl@dbs.f.lmu.de 1 Archtecture of Informaton Systems Group, Endhoven Unversty of Technology, Endhoven, The Netherlands 2 Database Systems Group, LMU Munch, Munch, Germany 1 Introducton Clusterng of data objects s a well-establshed data mnng task that ams at groupng these objects. The groupng s made such that smlar objects are aggregated together n the same group (or cluster) whle dssmlar ones are grouped n dfferent clusters. In ths context, the defnton of smlarty, and thus the fnal clusterng s hghly dependent on the appled dstance functon between the data objects. Dfferent to classfcaton, clusterng does not use a subset of the data objects wth known class labels to learn a classfcaton model. As a completely unsupervsed task, clusterng calculates the smlarty between objects wthout havng any nformaton about ther correct dstrbuton (also known as the ground truth). The latter fact motvated the research n the feld of clusterng valdaton notably more than the feld of classfcaton evaluaton. It has been even stated that clusterng valdaton s regarded as mportant as the clusterng tself [32]. There are two types of clusterng valdaton [31]. The external valdaton, whch compares the clusterng result to a reference result whch s consdered as the ground truth. If the result s somehow smlar to the reference, we regard ths fnal output as a good clusterng. Ths valdaton s straghtforward when the smlarty between two clusterngs has been well-defned, however, t has fundamental caveat that the reference result s not provded n most real applcatons. Therefore, external evaluaton s largely used for synthetc data and mostly for tunng clusterng algorthms. Internal valdaton s the other type clusterng evaluaton, where the evaluaton of the clusterng s compared only wth the result tself,.e., the structure of found clusters and ther relatons to each other. Ths s much more realstc and effcent n many real-world scenaros as t does not refer to any assumed references from outsde whch s not always fea-

172 Vetnam J Comput Sc (2017) 4:171 183 sble to obtan. Partcularly, wth the huge ncrease of the data sze and dmensonalty as n recent applcatons wth streamng data outputs, one can hardly clam that a complete knowledge of the ground truth s avalable or always vald. Obvously, clusterng evaluaton s a stand-alone process that s not ncluded wthn clusterng task. It s usually performed after the fnal clusterng output s generated. However, nternal evaluaton methods have been used n the valdaton phase wthn some clusterng algorthms lke k- means [29], k-medods [26], EM [8] and k-center [19]. Stream clusterng deals wth evolvng nput objects where the dstrbuton, the densty and the labels of objects are contnuously changng [16]. Whether t s hgh-dmensonal stream clusterng [14,24], herarchcal stream clusterng [15,23] or sensor data clusterng [19 21], evaluatng the clusterng output usng external evaluaton measures (lke SubCMM [17,18]) requres a ground truth that s very dffcult to obtan n the above-mentoned scenaros. For the prevous reasons, we focus n ths artcle on the nternal clusterng valdaton and study ts usablty for drftng streamng data. To farly dscuss the ablty of nternal measures to valdate the qualty of dfferent types of stream clusterng algorthms. We expand the study to cover both a k-means-based stream clusterng algorthm [1] as well as a densty-based stream clusterng one [6]. Ths s manly motvated by the fact that those algorthms are good representatves of the two man dfferent categores of stream clusterng algorthms. The remander of ths artcle s organzed as follows: Sect. 2 examnes some popular crtera of decdng whether found clusters are vald, and the general procedure we used n ths artcle to evaluate stream clusterng. In Sect. 3, we lst eleven dfferent mostly used nternal evaluaton measures and shortly show how they are actually exploted n clusterng evaluaton. In Sect. 4, we ntroduce a set of thorough experments on dfferent knds of data streams wth dfferent errors to show the behavors of these nternal measures n practce wth a k-means-based stream clusterng algorthm. In addton, we nvestgate more concretely how the nternal measures react to stream-specfc propertes of data. To do ths, several common error scenaros n stream clusterngs are smulated and also evaluated wth nternal clusterng valdaton. In Sect. 5, the nternal evaluaton measures are agan used to valdate a densty-based stream clusterng. Ths s done by frst extractng a ground truth of the clusterng qualty usng external evaluaton measures and then checkng whch of the nternal measures has the hghest correlaton wth that ground truth. Fnally, n Sect. 6, we summarze the contents of ths artcle. Ths artcle further dscusses the ntal techncal results ntroduced n [22] and extends them by elaboratng the algorthmc descrpton n Sect. 2, enrchng the results n Sect. 4 and ntroducng Sect. 5 completely. 2 Internal clusterng valdaton In ths secton, we descrbe our concept of nternal clusterng valdaton and how they are realzed for exstng nternal valdaton measures. Addtonally, we wll show an abstract procedure to make use of these measures n streamng envronments n practce. 2.1 Valdaton crtera Contrary to external valdaton, nternal clusterng valdaton s based only on the ntrnsc nformaton of the data. Snce we can only refer to the nput dataset tself, nternal valdaton needs assumptons about a good structure of found clusters whch are normally gven by reference result n external valdaton. Two man concepts, the compactness and the separaton, are the most popular ones. Most other concepts are actually just combnatons of varatons of these two [34]. The Compactness measures how closely data ponts are grouped n a cluster. Grouped ponts n the cluster are supposed to be related to each other, by sharng a common feature whch reflects a meanngful pattern n practce. Compactness s normally based on dstances between n-cluster ponts. The very popular way of calculatng the compactness s through varance,.e., average dstance to the mean, to estmate how objects are bonded together wth ts mean as ts center. A small varance ndcates a hgh compactness (cf. Fg. 1). Quanttatvely, one way of calculatng the compactness usng the average dstance s explaned n Eq. 1. The Separaton measures how dfferent the found clusters are from each other. Users of clusterng algorthms are not nterested n smlar or vague patterns when clusters are not well-separated (cf. Fg. 2). A dstnct cluster that s far from the others corresponds to a unque pattern. Smlar to the compactness, the dstances between objects are wdely used to measure separaton, e.g., parwse dstances between cluster centers, or parwse mnmum dstances between objects n dfferent clusters. Separaton s an nter-cluster crteron n the sense of relaton between clusters. An example of how to quanttatvely calculate the separaton usng the average dstance s explaned n Eq. 2. Fg. 1 Clusters on the left have better compactness than the ones on the rght

Vetnam J Comput Sc (2017) 4:171 183 173 3 Consdered exstng nternal evaluaton measures Fg. 2 Clusters on the left have better separaton than the ones on the rght 2.2 General procedure Usng a carefully generated synthetc data set, where we know the underlyng parttonng and the dstrbuton of the data, we apply the nternal valdaton measures usng dfferent parameters of the clusterng algorthms. The target s now to observe whch of the evaluaton measures s reachng ts best value when settng the parameters of the selected clusterng algorthm to best reflect the dstrbuton of the data set. We collect the values of the nternal measures for each batch, and fnally average the values of all batches. An abstract procedure of ths process s lsted n Algorthm 1. The algorthm explans both cases of a k-means-based algorthm and a DBSCAN-based algorthm. Algorthm 1: InternalValdatonProcedure() Prepare the current stream batch from the dataset ntalze the clusterng algorthm (a k-means-based or a DBSCAN-based one); ntalze a set T of all combnatons of meanngful ranges for each parameter; foreach parameter settng ps T do Run the selected clusterng algorthm wth the parameter settng ps; foreach batch n the stream do Compute the correspondng nternal valdaton ndex of the clusterng output; end Average the clusterng qualty of the valdaton ndex over all batches from the stream; end f the current algorthm s k-means based then Check whch ndex s reachng ts best values wth the correct number of generated clusters k n the data set; end else Check whch parameter settng ps T causes best values of external evaluaton measures over the current DBSCAN-based algorthm; Check whch nternal ndex has the hghest correlaton wth the external measures w.r.t. ps ; end In ths secton, we brefly revew the most used eleven nternal clusterng measures n recent works. One can easly fgure out of each measure whch desgn crtera s chosen and how they are realzed n mathematcal form. We wll frst ntroduce mportant notatons used n the formula of these measures: D s the nput dataset, n s the number of ponts n D, g s the center of whole dataset D, P s the number of dmensons of D, NC s the number of clusters, C s the -th cluster, n s the number of data ponts n C, c s the center of cluster C, σ(c ) s the varance vector of C, and d(x, y) s the dstance between ponts x and y. For the convenence, we wll put an abbrevaton for each measure and use t through the rest of ths artcle. Frst, some measures are desgned to evaluate ether only one of compactness or separaton. The smplest one s the Root-mean-square standard devaton (RMSSTD): RMSSTD = ( x C x c 2 ) 1/2 (1) P (n 1) Ths measure s the square root of the pooled sample varance of all the attrbutes, whch measures only the compactness of found clusters [10]. Another measure whch consders only the separaton between clusters s the R- squared (RS) [10]: x D RS = x g 2 x C x c 2 x D x (2) g 2 RS s the complement of the rato of sum of squared dstances between objects n dfferent clusters to the total sum of squares. It s an ntutve and smple formulaton of measurng the dfferences between clusters. Another measure consderng only separaton s the Modfed Hubert Ɣ statstc (Ɣ) [25]: Ɣ = 2 n(n 1), j {1 NC}, = j x C y C j d(x, y) d(c, c j ) Ɣ calculates the average weghted parwse dstances between data ponts belongng to dfferent clusters by multplyng them by the dstances between the centers of ther clusters. The followng measures are desgned to reflect both compactness and separaton at the same tme. Naturally, consderng only one of the two crtera s not enough to evaluate complex clusterngs. We wll ntroduce frst the Calnsk- Harabasz ndex (CH) [5]: (3)

174 Vetnam J Comput Sc (2017) 4:171 183 CH = d2 (c, g)/(nc 1) x C d 2 (x, c )/(n NC) CH measures the two crtera smultaneously wth the help of average between and wthn cluster sum of squares. The numerator reflects the degree of separaton n the way of how much the cluster centers are spread, and the denomnator corresponds to compactness, to reflect how close the n-cluster objects are gathered around the cluster center. The followng two measures also share ths type of formulaton,.e., numerator-separaton/denomnator-compactness. Frst, the I ndex (I) [30]: I = ( 1 NC x D d(x, g) x C d(x, c ) max, j (4) d(c, c j )) P (5) To measure separaton, I adopts the maxmum dstance between cluster centers. For compactness, the dstance from a data pont to ts cluster center s used lke CH. Another famous measure s the Dunn s ndces (D) [9]: D = mn mn j ( mnx C,y C j d(x, y) ) max k ( maxx,y Ck d(x, y) ) (6) D uses the mnmum parwse dstance between ponts n dfferent clusters as the nter-cluster separaton and the maxmum dameter among all clusters as the ntra-cluster compactness. As mentoned above, CH, I, and D follow the form (Separaton)/(Compactness), though they use dfferent dstances and dfferent weghts of the two factors. The optmal cluster number can be acheved by maxmzng these three ndces. Another commonly used measure s Slhouette ndex (S) [33]: S = 1 NC 1 n x C b(x) a(x) (7) max[b(x), a(x)] where a(x) = n 1 1 y C,y =x d(x, y) [ ] and b(x) = mn 1 j = n j y C j d(x, y). S does not take c or g nto account and uses parwse dstance between all the objects n a cluster for numeratng compactness (a(x)). Here, b(x) measures the separaton wth the average dstance of objects to alternatve cluster,.e., second closest cluster. Daves-Bouldn ndex (DB) [7] sanold but stll wdely used nternal valdaton measure: DB = 1 NC max j = 1 n x C d(x, c ) + n 1 j x C j d(x, c j ) d(c, c j ) (8) DB uses ntra-cluster varance and nter-cluster center dstance to fnd the worst partner cluster,.e., the closest most scattered one for each cluster. Thus, mnmzng DB gves us the optmal number of clusters. The Xe-Ben ndex (XB) [35] s defned as: XB = x C d 2 (x, c ) n mn = j d 2 (c, c j ) Apparently, the smaller the values of XB, the better the clusterng qualty. Along wth DB, XB has a form of (Compactness)/(Separaton) whch s the opposte of CH, I, and D. Therefore, t reaches the optmum clusterng by beng mnmzed. It defnes the nter-cluster separaton as the mnmum square dstance between cluster centers, and the ntra-cluster compactness as the mean square dstance between each data object and ts cluster center. In the followng, we present more recent clusterng valdaton measures. The SD valdty ndex (SD) [12]: SD = NCmax Scat(NC) + Ds(NC) (10) NCmax s the maxmum number of possble clusters Scat(NC) = NC 1 σ(c ) σ(d) Ds(NC) = max, j d(c,c j ) mn, j d(c,c j ) ( j d(c, c j )) 1 SD s composed of two terms; Scat(NC) stands for the scatterng wthn clusters and Ds(NC) stands for the dsperson between clusters. Lke DB and XB, SD measures the compactness wth varance of clustered objects and separaton wth dstance between cluster centers, but uses them n a dfferent way. The smaller the value of SD, the better. A revsed verson of SD s S_Dbw [11]: S_Dbw = Scat(NC) + Dens_bw(NC) (11) Dens_bw(NC) = ( j = f (x, y) = ( max x C 1 NC(NC 1) x C C f (x,u j j ) f (x,c ), ) x C f (x,c j j ) { 0 fd(x, y) >τ, 1 otherwse. where u j s the mddle pont of c and c j, τ s a threshold to determne the neghbors approxmated by the average standard devaton of cluster centers: τ = NC 1 NC =1 σ(c ), and Scat(NC) s the same as that of SD. S_Dbw takes the densty nto account to measure the separaton between clusters. It assumes that for each par of cluster centers, at least one of ther denstes should be larger than the densty of ther mdpont to be a good clusterng. Both SD and S_Dbw ndcate the optmal clusterng when they are mnmzed. ) (9)

Vetnam J Comput Sc (2017) 4:171 183 175 4 Internal valdaton of stream clusterngs In ths secton, we evaluate the result of stream clusterng algorthms wth nternal valdaton measures. 4.1 Robustness to conventonal clusterng aspects The results on usng nternal evaluaton measures for clusterng statc data wth smple errors n [28] prove that the performance of the nternal measures s affected by varous aspects of nput data,.e., nose, densty of clusters, skewness, and subclusters. Each measure of the dscussed 11 evaluaton measures reacts dfferently to those aspects. We perform more complex experments than the ones n [28], thstme on stream clusterngs to see how the nternal measures behave n real-tme contnuous data. We run the CluStream [1]clusterng algorthm wth dfferent parameters, choose the optmal number of clusters accordng to the evaluaton results, and compare t to the true number of clusters. Accordng to [10], RMSSTD, RS and Ɣ have the property of monotoncty and ther curves wll have ether an upwards or a downwards tendency towards the optmum when we monotoncally ncrease (or decrease) the number of clusters (or the parameter at hand). The optmal value for each of these measures s at the shft pont of ther curves whch s also known as the elbow. Streamng data has usually complex propertes that are happenng at the same tme. The experments n [28], however, are lmted to very smple toy datasets reflectng only one clusterng aspect at a tme. To make t more realstc, we use a data stream reflectng fve conventonal clusterng aspects at the same tme. 4.1.1 Expermental settngs To smulate streamng scenaros, we use MOA (Massve Onlne Analyss) [4] framework. We have chosen Random- RBFGenerator, whch emts data nstances contnuously from a set of crcular clusters, as the nput stream generator (cf. Fg. 3). In ths stream, we can specfy the sze, densty, and movng speed of the nstance-generatng clusters, from whch we can smulate the skewness, the dfferent denstes, and the subcluster aspect. We set the parameters as follows: number of generatng clusters = 5, radus of clusters = 0.11, ther dmensonalty = 10, varyng range of cluster radus = 0.07, varyng range of cluster densty = 1, cluster speed = 0.01 per 200 ponts, nose level = 0.1, nose does not appear nsde clusters. The parameters whch are not mentoned are not drectly related to ths experment and are set to the default values of MOA. For the clusterng algorthm, we have chosen CluStream [1]wth k-means as ts macro-clusterer. We vary the parameter k from 2 to 9, where the optmal number of clusters s 5. We set the evaluaton frequency to 1000 ponts and run our Fg. 3 A screenshot of the Dmensons 1 and 2 of the synthetc data stream used n the experment. Colored ponts represent the ncomng nstances, and the colors are faded out as the processng tme passes. Ground truth cluster boundares are drawn n black crcle. Gray crclesndcate the former state, expressng that the clusters are movng. Black (faded out to gray) ponts represent nose ponts. (color fgure onlne) stream generator tll 30000 ponts, whch gves 30 evaluaton results. 4.1.2 Results Table 1 contans the mean value of 30 evaluaton results whch we obtaned n the whole streamng nterval. It shows that RMSSTD, RS, CH, I, and S_Dbw correctly reach ther optmal number of clusters, whle the others do not. Accordng to the results n [28], the optmal value of each of RMSSTD, RS, and Ɣ s dffcult to determne. For ths reason, we do not accept ther results even f some of them show a good performance. In the statc case n [28], CH and I were unable to fnd the rght optmal number of clusters. CH s shown to be vulnerable to nose, snce the nose ncluson (n cases when k < 5) makes the found clusters larger and less compact. However, n the streamng case, most clusterng algorthms follow the onlne-offlne-phases model. The onlne phase removes a lot of nose when summarzng the data nto mcroclusters, and the offlne phase (k-means n the case of CluStream [1]) deals only wth these cleaned summares. Of course, there wll be always a chance to get a summary that s completely formed of nose ponts, but those wll have less mpact over the fnal clusterng than the statc case. Thus, snce not all the nose ponts are ntegrated nto the clusters, the amount of cluster expanson s a bt smaller than the statc case.

176 Vetnam J Comput Sc (2017) 4:171 183 Table 1 Evaluaton results of nternal valdaton on the stream clusterngs k RMSSTD RS Ɣ CH I D S DB XB SD S_Dbw 2 0.0998 0.7992 0.3522 3196 0.4980 0.1921 0.6535 0.5528 0.1065 4.5086 0.2284 3 0.0763 0.8593 0.3724 3619 0.5564 0.2208 0.6003 0.5782 0.1561 6.1623 0.1601 4 0.0621 0.9117 0.3834 3860 0.5840 0.0936 0.6143 0.5531 0.0932 6.7154 0.1251 5 0.0538 0.9330 0.3967 4157 0.6134 0.0669 0.5855 0.5656 0.1143 8.6382 0.1087 6 0.0528 0.9355 0.4007 3510 0.4945 0.0309 0.5200 0.6360 0.1845 11.1729 0.1319 7 0.0481 0.9464 0.4002 3435 0.4697 0.0042 0.4861 0.6610 0.2580 14.7443 0.1192 8 0.0463 0.9512 0.4007 3095 0.4001 0.0099 0.4617 0.6853 0.2977 16.8715 0.1338 9 0.0430 0.9580 0.4026 3154 0.3943 0.0000 0.4544 0.6913 0.3085 19.5362 0.1355 The best obtaned values for each parameter (not necessarly the maxmum or the mnmum) are n bold. The best values for RMSSTD, RS and Ɣ are selected as the frst elbow n ther monotoncally ncreasng or decreasng curves (accordng to [10]) Therefore, the effect of nose to CH s less n the streamng case than the statc one. In the statc case, I was slghtly affected by the dfferent denstes of the clusters, and the reasons were not well revealed. Therefore, t s not surprsng that I performs well as we take average of ts evaluaton results for the whole streamng nterval. The evaluaton of D dd not result wth a very useful output, snce t gves uncondtonal zero values n most evaluaton ponts (before they are averaged as n Table 1). Ths s because the numerator of Equaton (6) could be zero when at least one par of x and y happens to be equal to each other,.e., the dstance between x and y s zero. Ths case rses when C and C j are overlapped and the par (x, y) s elected from the overlapped regon. Streamng data has hgh possblty to have overlapped clusters, and so does the nput of ths experment. Ths drves D to produce zero, makng t an unstable measure n streamng envronments. Smlar to the statc data case, S, DB, XB, and SD perform bad n the streamng settngs. The man reason also les n the overlappng of clusters. Overlappng clusters are the extreme case of subclusters n the experments of the statc case dscussed n [28]. 4.2 Senstvty to common errors of stream clusterng In ths secton, we perform a more detaled study on the behavors of nternal measures n streamng envronments. The prevous experment s more or less a general test on a sngle data stream, so we use here the nternal clusterng ndces on a seres of elaborately desgned experments whch well reflects the stream clusterng scenaros. MOA framework has an nterestng tool called ClusterGenerator, whch can produce a found clusterng by manpulatng ground truth clusters wth a certan error level. It can smulate dfferent knds of error types and even combne them to construct complcated clusterng error scenaros. It s very useful snce we can test the senstvty of evaluaton measures to specfc errors [27]. Evaluatng a varaton of the ground truth seems a bt awkward n the sense of nternal valdaton snce t actually refers to the predefned result. However, ths knd of experment s absolutely meanngful, because we can watch reactons of nternal measures to some errors of nterest. [27] used ths tool to show the behavor of nternal measures, e.g., S, Sum of Squares (SSQ), and C-ndex. Although the error types exploted n [27] are lmted, those measures are not of our nterest and already proved to be bad n the prevous experments. 4.2.1 Expermental Settngs Due to the drftng nature of streamng data, certan errors are common to appear for stream clusterng algorthms. These errors are reflected by a wrong groupng of the drftng objects. The correct groupng of the objects s reflected n the orgnal data set, where we assume that the real dstrbuton of the objects (and thus the groupng) s prevously known. Ths already known assgnment of the drftng objects to ther correct clusters s called the ground truth. The closer the output of a clusterng algorthm to ths ground truth, the better ts qualty. The above-mentoned errors are the devatons of the output of clusterng algorthms from the ground truth. A good valdaton measure should be able to evaluate the amounts of these errors correctly. In the case of nternal valdaton measures, ths should be possble even wthout accessng the ground truth. To obtan a controlled amount of ths error, a smulaton of a stream clusterng algorthm s embedded n the MOA framework [4]. Ths prevous explaned smulaton, called ClusterGenerator, allows the user to control the amount of devatons from the ground truth usng dfferent parameters. The ClusterGenerator has sx error types as ts parameters, and they effectvely reflect common errors of stream clusterngs. Radus ncrease and Radus decrease change the

Vetnam J Comput Sc (2017) 4:171 183 177 Fg. 4 Common errors of stream clusterngs. A sold crcle represents a true cluster, and a dashed crclendcates the correspondng found cluster (error). The cause of the error s the fast evoluton of the stream n the drecton of the arrows. radus of clusters (Fg. 4a, b), whch normally happens n the stream clusterng snce data ponts keep fadng n and out. Thus, n Fg. 4a, for nstance, the ground truth s represented by the sold lne, and the arrows represent the drecton of the evoluton of the data n the ground truth where the cluster s shrnkng. The dashed lne represents, however, the output of the smulated clusterng algorthm usng the ClusterGenerator that suffers from the: Radus Increase error. The same explanaton apples to all other errors depcted n Fg. 4. Cluster add and Cluster remove change the number of found clusters, whch are caused by groupng nose ponts or falsely detectng meanngful patterns as a nose cloud. Cluster jon merges two overlappng clusters as one cluster (Fg. 4c), whch s a crucal error n streamng scenaros. Fnally, Poston offset changes the cluster poston, and ths commonly happens due to the movement of clusters n data streams (Fg. 4d). We perform the experments on all the above error types. We ncrease the level of one error at a tme and evaluate ts output wth CH, I and S_Dbw, whch performed well n the prevous experment. For the nput stream, we use the same stream settngs as n Sect. 4.2.1. 4.2.2 Results In Fg. 5, the evaluaton values are plotted on the y-axs accordng to the correspondng error level on the x-axs. From Fg. 5a, we can see that CH value decreases as the level of Radus ncrease, Cluster add, Cluster jon, and Poston offset errors ncreases. CH correctly and constantly penalzes the four errors, snce smaller CH value corresponds to worse clusterng. However, t shows completely reversed curves n Radus decrease and Cluster remove errors. The reason for wrong rewardng of the Radus decrease error, s that the reducton of the sze of clusters ncreases ther compactness, and thus both CH and I ncrease. The Cluster remove error detecton s a general problem for all nternal measures as they compare ther clusterng result only to ts self. Regardless of the Radus decrease and the Cluster remove errors, CH has generally the best performance on streamng data compared to the other measures. We can see n Fg. 5b that usng I results n a msnterpretaton of the Radus decrease and Cluster remove error stuatons. The reason for t s smlar to that of CH, snce the usage of I results also n adoptng the dstance between objects wthn clusters as the ntra-cluster compactness and the dstance between cluster centers as the nter-cluster separaton. In addton, usng I wrongly favortes the Poston offset error nstead of penalzng t. If the boundares of found clusters are moved besdes the truth, they often mss the data ponts, whch produces a smlar stuaton to Radus decrease whch I s vulnerable to. S_Dbw produces hgh values when t regards a clusterng as a bad result, whch s opposte to the prevous two measures. In Fg. 5c, we can see that t correctly penalzes the three error types Radus ncrease, Cluster add, and Cluster jon. For Poston offset error, one can say that the value s somehow ncreasng but the curve s actually fluctuatng too much. It also fals to penalze Cluster remove correctly. From these results, we can determne that among the dscussed nternal evaluaton measures, CH s the best nternal evaluaton one whch can well handle many stream clusterng errors. Even though S_Dbw performs very well on the statc data (cf. [28]) and on the streamng data n the prevous experments (cf. Sect. 4.2.2), we observed that t has weak capablty to capture common errors of stream clusterng. 5 Internal evaluaton measures of densty-based stream clusterng algorthms In ths secton, we evaluate the performance of nternal stream clusterng measures usng a densty-based stream clusterng

178 Vetnam J Comput Sc (2017) 4:171 183 Fg. 5 Expermental results for each error type. Evaluaton values (y-axs) are plotted accordng to each error level (x-axs). Some error curves are drawn on a secondary axs due to ts range: a Radus decrease and Cluster remove, b Radus decrease, Cluster remove, and Poston offset. algorthm, namely DenStream [6]. We wll start by expermentng DenStream usng external evaluaton measures to get some knd of ground truth, then we wll compare the performance of the nternal evaluaton measures usng how close they are to ths ground truth. Smlar to the prevous secton, we use MOA (Massve Onlne Analyss) [4] framework for the evaluaton. Agan we have used the RandomRBFGenerator to create a 10-dmensonal dataset of 30,000 objects formng 5 drftng clusters wth dfferent and varyng denstes and szes. For DenStream [6] and MOA, we set the parameter settngs as follows: the evaluaton horzon = 1000, the outler mcrocluster controllng threshold: β = 0.15, the ntal number of objects ntponts = 1000, the offlne factor of ɛ compared to the onlne one = 2, the decayng factor λ = 0.25, and the processng speed of the evaluaton = 100. The parameters whch are not mentoned are not drectly related to ths experment and are set to the defaults of MOA.

Vetnam J Comput Sc (2017) 4:171 183 179 5.1 Dervng the ground truth usng external evaluaton measures Internal evaluaton measures do not beneft from the ground truth nformaton provded n the form of cluster label n our dataset. Ths was not a problem n the case of the k- means-based algorthm CluStream [1] dscussed n Sect. 4.1, snce the optmal parameter settng was smply k = 5 as we have generated 5 clusters. In the case of the denstybased stream clusterng algorthm DenStream [6] ths s not as straghtforward. To obtan some knd of ground truth for a densty-based stream clusterng algorthm lke DenStream, we used the results from some external evaluaton measures to derve the parameter settngs for the best and the worst clusterng results. The followng external evaluaton measures were used. The frst one s the F1 [3] measure whch s a wdely used external evaluaton that harmonzes the precson and the recall of the clusterng output. The other one s the purty measure whch s wdely used [2,6,24] to evaluate the qualty of a clusterng. Intutvely, the purty can be seen as the pureness of the fnal clusters compared to the classes of the ground truth. The average purty s defned as follows: purty = NC n d =1 n NC (12) where NC represents the number of clusters, n d denotes the number of objects wth the domnant class label n cluster C and n denotes the number of the objects n the cluster C. The thrd used external evaluaton measure s the number of clusters whch averages prevous numbers of clusters wthn the H wndow. Smlarly, the F1 and the purty are computed over a certan predefned wndow H from the current tme. Ths s done snce the weghts of the objects decay over tme. Thus, the number of found clusters could be any real value, whle F1 and purty could be any real value from 0to1. 5.1.1 Results Table 2 contans the mean value of 5 evaluaton results whch we obtaned n the whole streamng nterval when consderng the external evaluaton measures: F1, purty and the number of clusters for dfferent settngs of the μ and ɛ parameters of DenStream. The bold values of each column represent the best value of the ndex among the outputs of the used parameter settngs. It s the hghest value n the case of F1 and the purty, and the closest value to 5 n the case of the number of clusters. The worst values n each column are underlned. It can be seen from Table 2 that among the Table 2 Evaluaton results of external valdaton on the stream clusterngs μ ɛ F1 Purty Number of clusters 2 0.06 0.6454 1.00 4.7959 2 0.12 0.6250 0.9999 4.5918 2 0.18 0.5873 0.9095 4.0204 3 0.06 0.6220 1.00 4.6531 3 0.12 0.6139 1.00 4.5714 3 0.18 0.5857 0.9169 4.0816 4 0.06 0.6166 1.00 4.7143 4 0.12 0.6157 1.00 4.6735 4 0.18 0.5741 0.9233 4.1020 The best obtaned values for each measure are n bold, the worst ones are underlned selected 9 parameter settngs, μ = 2 and ɛ = 0.18 results n the worst clusterng output of DenStream over the current dataset whle μ = 2 and ɛ = 0.06 results n the best one. Fgure 6 depcts the external evaluaton measures values for these settngs. Our task now s to get the nternal evaluaton measure that shows the hghest correlaton wth ths result. 5.2 The results of usng nternal evaluaton measures for densty-based stream clusterng Fgures 7, 8 and 9 show the mean values of 5 evaluaton results usng all nternal evaluaton measures over the prevous parameter settngs. These results are summarzed n Table 3, where RMSSTD, RS and Ɣ are drectly excluded due to the subjectve process of defnng the frst elbow n ther monotoncally ncreasng or decreasng curves (accordng to [10]). We obtaned these results for the dfferent selected parameter settngs, and the fnal values are summarzed from the measurements n the whole streamng nterval. Table 3 shows that all the nternal evaluaton measures except for SD reach ther worst values (underlned values) exactly at the settng (μ = 2 and ɛ = 0.18). Ths shows that the results of those nternal measures are nlne wth those of the external ones w.r.t. punshng the worst settng. What s left now s to check whch of those measures reaches ts best value at the same settng where the external evaluaton measures are reachng ther best values (.e., μ = 2 and ɛ = 0.06). It can be seen from Table 3 that none of the nternal evaluaton measures s reachng the best value (n bold) at that parameter settng. We have to calculate now whch of those nternal evaluaton measures has the hghest (local) correlaton between ts

180 Vetnam J Comput Sc (2017) 4:171 183 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 MnPonts = 2 MnPonts = 3 MnPonts=4 F1 Purty Num of Clusters Fg. 6 External evaluaton measures on the y-axs usng dfferent parameter settngs of DenStream on the x-axs. 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 MnPonts = 2 MnPonts = 3 MnPonts=4 RMSSTD XB S_Dbw Fg. 7 Performance of the nternal evaluaton measures: RMSSTD, XBand S_Dbw on the y-axs usng dfferent parameter settngs of DenStreamon the x-axs. 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 MnPonts = 2 MnPonts = 3 MnPonts=4 Gamma D SD Fg. 8 Performance of the nternal evaluaton measures: Ɣ, D and SD on the y-axs usng dfferent parameter settngs of DenStream on the x-axs. best value Vbest, and the value calculated at the best ground truth settng (μ = 2 and ɛ = 0.06), we call ths value Vtruth. Let: 9s=1 Vavg = Vs (13) 9 be the average of the values taken for each nternal measure over the each settng s of the 9 consdered parameter settngs. Our target s to get out of the 7 wnnng nternal measures n Table 3, the nternal evaluaton measure that acheves: ( ) V mn best Vtruth Vbest V avg (14)

Vetnam J Comput Sc (2017) 4:171 183 181 1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 =0.06 =0.12 =0.18 MnPonts = 2 MnPonts = 3 MnPonts=4 RS S Fg. 9 Performance of the nternal evaluaton measures: RS and S on the y-axs usng dfferent parameter settngs of DenStream on the x-axs. Table 3 Evaluaton results of nternal valdaton on the stream clusterngs Table 4 Testng the wnnng nternal measures (.e., those whose worst value n Table 3 matched the worst ground truth) μ ɛ CH I D S DB XB SD S_Dbw 2 0.06 50911 58.711 4.2364 0.9441 0.1240 0.0032 2.6188 0.0041 2 0.12 54495 65.131 4.2261 0.9436 0.1297 0.0037 2.5212 0.0042 2 0.18 34451 37.131 2.7138 0.8753 0.2859 0.0557 2.3928 0.0749 3 0.06 66717 88.312 4.3258 0.9451 0.1038 0.0021 2.5936 0.0040 3 0.12 65504 84.805 4.4055 0.9447 0.1132 0.0027 2.5111 0.0041 3 0.18 40237 45.474 2.8965 0.8812 0.2640 0.0501 2.4014 0.0708 4 0.06 74205 102.57 4.4786 0.9462 0.0974 0.0019 2.6039 0.0038 4 0.12 72501 100.93 4.5413 0.9469 0.1059 0.0023 2.5455 0.0038 4 0.18 48553 60.397 3.3772 0.8873 0.2368 0.0382 2.3713 0.0648 The best obtaned values (not necessarly the maxmum or the mnmum) are n bold Internal measure CH I D S DB XB S_Dbw V best V truth V best V avg 1.30807 1.41513 0.48392 0.11849 0.40945 0.08125 0.01208 Thetestsamngtofndwhch of those has the hghest correlaton between ts best value Vbest and ts value at the best ground truth settng Vtruth In other words, we are seekng for the measure whose Vbest has the smallest relatve devaton from Vtruth compared to ts devaton from the mean. It should be noted that the smple tendency check mentoned n Eq. 14 s relable. For a specfc measure, mnmzng the fracton mentoned n Eq. 14 mples that the numerator s consderably smaller than the denomnator. Thus, the devaton of the measure from the ground truth Vtruth s consderably smaller than the devaton from ts own mean. Thus, we can get some knd of guaranty that ths correlaton s strong enough. As we are unable to fnd an nternal measure whose Vbest = V truth, we perform ths approxmaton to fnd the one wth the closest tendency to make Vtruth ts V best. Table 4 shows that S_Dbw has the hghest correlaton between ts V S_Dbw best value and the ground truth V S_Dbw truth value. Ths s because t has the smallest V best V truth V best V avg value hghlghted n bold. Ths means that the among the tested nternal evaluaton measures, S_Dbw has shown the best results when consderng the densty-based stream clusterng algorthm DenStream [6]. Smlar to the statc data case, CH, I, DS, DB, and SD perform bad n the streamng settngs. Ths s dfferent to the k-means stream clusterng case, where CH performed the best. On the other hand, S_Dbw performed the best whch s smlar to the statc case results reported n [28]. XB worked also well. 6 Conclusons and outlook Evaluatng clusterng results s very mportant to the success of clusterng tasks. In ths artcle, we dscussed the nternal clusterng valdaton scheme n both k-means and densty-based stream clusterng scenaros. Ths s much more effcent and easer to apply n the absence of any prev-

182 Vetnam J Comput Sc (2017) 4:171 183 ous knowledge about the data than the external valdaton. We explaned fundamental theores of nternal valdaton measures and ts examples. In the k-means-based case, we performed a set of clusterng valdaton experments that well reflect the propertes of streamng envronment wth fve common clusterng aspects at the same tme. These aspects reflect monotoncty, nose, dfferent denstes of clusters, skewness and the exstence of subclusters n the underlyng streamng data. The three wnners from the frst expermental evaluaton were then further evaluated n the second phase of experments. The senstvty of each of those three measures was tested w.r.t. sx stream clusterng errors. Dfferent to the results ganed n a recent work on statc data, our fnal expermental results on streamng data showed that Calnsk-Harabasz ndex (CH) [5] has, n general, the best performance n k-means-based streamng envronments. It s robust to the combnaton of the fve conventonal aspects of clusterng, and also correctly penalzes the common errors n stream clusterng. In the densty-based case, we performed a set of experments over dfferent parameter settngs usng the DenStream [6] algorthm. We used external evaluaton measures to extract some ground truth. We used the ground truth to defne the best, and the worst parameter settngs. Then, we tested whch of the nternal measures has the hghest correlaton wth the ground truth. Our results showed that the revsed valdty ndex: S_Dbw [11] shows the best performance under densty-based stream clusterng algorthms. Ths s nlne wth the results reported over statc data n [28]. Addtonally, the Xe-Ben ndex (XB) [35] has shown also a good performance. In the future, we want to test those measures on dfferent categores of advanced stream clusterng algorthms lke adaptve herarchcal densty-based ones (e.g., HAStream [15]) or projected/subspace ones (e.g., PreDeConStream [24] and SubClusTree [14]). Addtonally, we want to evaluate the measures when streams of clusters avalable n subspaces [13] are processed by the above algorthms. Open Access Ths artcle s dstrbuted under the terms of the Creatve Commons Attrbuton 4.0 Internatonal Lcense (http://creatvecomm ons.org/lcenses/by/4.0/), whch permts unrestrcted use, dstrbuton, and reproducton n any medum, provded you gve approprate credt to the orgnal author(s) and the source, provde a lnk to the Creatve Commons lcense, and ndcate f changes were made. References 1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clusterng evolvng data streams. In: VLDB, pp. 81 92 (2003) 2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clusterng of hgh dmensonal data streams. In: VLDB, pp. 852 863 (2004) 3. Assent, I., Kreger, R., Müller, E., Sedl, T.: INSCY: Indexng subspace clusters wth n-process-removal of redundancy. In: Proceedngs of the 8th IEEE Internatonal Conference on Data Mnng, ICDM 08, pp. 719 724. IEEE (2008) 4. Bfet, A., Holmes, G., Pfahrnger, B., Kranen, P., Kremer, H., Jansen, T., Sedl, T.: MOA: Massve onlne analyss, a framework for stream classfcaton and clusterng. JMLR 11, 44 50 (2010) 5. Calnsk, T., Harabasz, J.: A dendrte method for cluster analyss. Comm. Stat. 3(1), 1 27 (1974) 6. Cao, F., Ester, M., Qan, W., Zhou, A.: Densty-based clusterng over an evolvng data stream wth nose. In: SIAM SDM, pp. 328 339 (2006) 7. Daves, D., Bouldn, D.: A cluster separaton measure. IEEE PAMI 1(2), 224 227 (1979) 8. Dempster, A.P., Lard, N.M., Rubn, D.B.: Maxmum lkelhood from ncomplete data va the EM algorthm. J. R. Stat. Soc. Ser. B. 39(1), 1 38 (1977) 9. Dunn, J.: Well separated clusters and optmal fuzzy parttons. J. Cybern. 4(1), 95 104 (1974) 10. Halkd, M., Batstaks, Y., Vazrganns, M.: On clusterng valdaton technques. J. Intell. Inf. Syst. 17(2), 107 145 (2001) 11. Halkd, M., Vazrganns, M.: Clusterng valdty assessment: Fndng the optmal parttonng of a data set. In: IEEE ICDM, pp. 187 194 (2001) 12. Halkd, M., Vazrganns, M., Batstaks, Y.: Qualty scheme assessment n the clusterng process. In: PKDD, pp. 265 276 (2000) 13. Hassan, M., Km, Y., Sedl, T.: Subspace MOA: subspace stream clusterng evaluaton usng the MOA framework. In: DASFAA, pp. 446 449 (2013) 14. Hassan, M., Kranen, P., San, R., Sedl, T.: Subspace anytme stream clusterng. In: SSDBM, p. 37 (2014) 15. Hassan, M., Spaus, P., Sedl, T.: Adaptve multple-resoluton stream clusterng. In: MLDM, MLDM 14, pp. 134 148 (2014) 16. Hassan, M.: Effcent clusterng of bg data streams. PhD thess, RWTH Aachen Unversty (2015) 17. Hassan, M., Km, Y., Cho, S., Sedl, T.: Effectve evaluaton measures for subspace clusterng of data streams. In: Trends and Applcatons n Knowledge Dscovery and Data Mnng PAKDD 2013 Internatonal Workshops, pp. 342 353 (2013) 18. Hassan, M., Km, Y., Cho, S., Sedl, T.: Subspace clusterng of data streams: new algorthms and effectve evaluaton measures. J. Intell. Inf. Syst. 45(3), 319 335 (2015) 19. Hassan, M., Müller, E., Sedl, T.: EDISKCO: Energy Effcent Dstrbuted In-Sensor-Network K-center Clusterng wth Outlers. In: Proceedngs of the 3rd Internatonal Workshop on Knowledge Dscovery from Sensor Data, SensorKDD 09 @KDD 09, pp. 39 48. ACM (2009) 20. Hassan, M., Müller, E., Spaus, P., Faqoll, A., Palpanas, T., Sedl, T.: Self-organzng energy aware clusterng of nodes n sensor networks usng relevant attrbutes. In: Proceedngs of the 4th Internatonal Workshop on Knowledge Dscovery from Sensor Data, SensorKDD 10 @KDD 10, pp.39 48. ACM (2010) 21. Hassan, M., Sedl, T.: Dstrbuted weghted clusterng of evolvng sensor data streams wth nose. J. Dg. Inf. Manag. (JDIM) 10(6), 410 420 (2012) 22. Hassan, M., Sedl, T.: Internal clusterng evaluaton of data streams. In: Trends and Applcatons n Knowledge Dscovery and Data Mnng PAKDD 2015 Workshop: QIMIE, 2015. Revsed Selected Papers, pp. 198 209 (2015) 23. Hassan, M., Spaus, P., Cuzzocrea, A., Sedl, T.: Adaptve stream clusterng usng ncremental graph mantenance. In: Proceedngs of the 4th Internatonal Workshop on Bg Data, Streams and Heterogeneous Source Mnng: Algorthms, Systems, Programmng Models and Applcatons, BgMne 2015 at KDD 15, pp. 49 64 (2015)

Vetnam J Comput Sc (2017) 4:171 183 183 24. Hassan, M., Spaus, P., Gaber, M.M., Sedl, T.: Densty-based projected clusterng of data streams. In: Proceedngs of the 6th Internatonal Conference on Scalable Uncertanty Management, SUM 12, pp. 311 324 (2012) 25. Hubert, L., Arabe, P.: Comparng parttons. J. Intell. Inf. Syst. 2(1), 193 218 (1985) 26. Kaufman, L., Rousseeuw, P.: Clusterng by means of medods. Statstcal Data Analyss Based on the L 1 Norm, pp. 405 416 (1987) 27. Kremer, H., Kranen, P., Jansen, T., Sedl, T., Bfet, A., Holmes, G., Pfahrnger, B.: An effectve evaluaton measure for clusterng on evolvng data streams. In: ACM SIGKDD, pp. 868 876 (2011) 28. Lu, Y., L, Z., Xong, H., Gao, X., Wu, J.: Understandng of nternal clusterng valdaton measures. In: ICDM, pp. 911 916 (2010) 29. MacQueen, J.B.: Some methods for classfcaton and analyss of multvarate observatons. In: Proceedngs of 5th Berkeley Symposum on Mathematcal Statstcs and Probablty, volume 1, pp. 281 297. Unversty of Calforna Press (1967) 30. Maulk, U., Bandyopadhyay, S.: Performance evaluaton of some clusterng algorthms and valdty ndces. IEEE PAMI 24, 1650 1654 (2002) 31. Rendón, E., Abundez, I., Arzmend, A., Quroz, E.M.: Internal versus external cluster valdaton ndexes. Int. J. Comp. Comm. 5(1), 27 34 (2011) 32. Ramze Rezaee, M., Leleveldt, B.B.F., Reber, J.H.C.: A new cluster valdty ndex for the fuzzy c-mean. Pattern Recogn. Lett. 19(3 4):237 246 (1998) 33. Rousseeuw, P.: Slhouettes: a graphcal ad to the nterpretaton and valdaton of cluster analyss. J. Comput. Appl. Math. 20(1), 53 65 (1987) 34. Tan, P.N., Stenbach, M., Kumar, V.: Introducton to Data Mnng. Addson-Wesley Longman, Inc. Boston (2005) 35. Xe, X.L., Ben, G.: A valdty measure for fuzzy clusterng. IEEE PAMI 13(8), 841 847 (1991)