An Effective Clustering Mechanism for Uncertain Data Mining Using Centroid Boundary in UKmeans

Size: px

Start display at page:

Download "An Effective Clustering Mechanism for Uncertain Data Mining Using Centroid Boundary in UKmeans"

Rodger Scott
5 years ago
Views:

1 2016 International Computer Symposium An Effective Clustering Mechanism for Uncertain Data Mining Using Centroid Boundary in UKmeans Kuan-Teng Liao, Chuan-Ming Liu Computer Science and Information Engineering National Taipei University of Technology Taipei, TAIWAN Abstract Object errors affect the time cost and effectiveness in uncertain data clustering. For decreasing the time cost and increasing the effectiveness, we propose two mechanisms for the centroid based clustering, UKmeans. The first mechanism is an improved similarity. Similarity is an intuitive factor that immediately affects the time cost and effectiveness. For example, similarity calculations with integration focus on the effectiveness of clustering but ignore the time cost. On the contrary, the similarity calculations by simplified approaches address on the issue of time cost but ignore the effectiveness. In this study, for considering both the time cost and effectiveness, we use a simplified similarity for reducing the time cost, and add additional two factors, namely intersection and density of clusters, to increase the effectiveness of clustering. The former factor can increase the degree of the object belongingness when a cluster overlaps the object. The latter factor can avoid objects to be attracted by clusters which have large errors. The other proposed mechanism is the definition of the centroid boundary. In clustering, the position of a cluster centroid is in an average range which contributes from the belonging objects errors. However, the large average range causes the low effectiveness of clustering. For decreasing the range, we propose the square root boundary mechanism to limit the upper bound of possible positions of centroids to increase the effectiveness of clustering. In experiments, the results suggest that our two mechanisms work well in the time cost and effectiveness and these two mechanisms complete the UKmeans approaches in uncertain data clustering. Index Terms Uncertain data clustering, similarity, cluster boundary, UKmeans I. INTRODUCTION UKmeans clustering approaches have been used for dealing uncertain objects for several years. Like other uncertain clustering approaches, UKmeans clustering exists the time cost and effectiveness imperfections because of considering object errors. The time cost is the common imperfection in uncertain data clustering (UDC) because all possible positions of an uncertain cluster and an uncertain object should be considered when determining the belongingness of an object. Therefore, the clustering processes are usually inefficient and time-consuming. In addition to the time cost, the effectiveness of clustering is another imperfection when clustering uncertain objects because object errors usually lead an object into a wrong cluster. In this study, we propose an improved similarity, namely UKmeans with maximum distance density and weighted intersection (KMDI w ) to consider both the time cost and effectiveness, and define a centroid boundary, namely a square rooted boundary (SRB), to increase the effectiveness of clustering. These two mechanisms are recommended as follows. The KMDI w is an extension from the simplified similarity calculations in [7]. In the aforementioned study, researchers determined the belongingness of an object by a ratio expected distance (ED) which is contributed from minimum and maximum (minmax) distances to replace the ED calculated by integration. Compared to ED using by integration, the ratio ED can save much time in calculations because it ignores all possible positions from an object and a cluster and only considers the shortest and the farthest distance between an object and a cluster. Although the simplified similarity saves time, the effectiveness of clustering is sometimes low. In this study, the two factors, namely intersections and cluster density are considered for simplified similarity to increase the effectiveness of clustering. The intersection factor is the first important factor for increasing the effectiveness. It implies the degree of the overlapping between an uncertain object and a cluster, and is useful to raise the effectiveness. In [1], the researchers showed that an object easily belongs to the overlapping cluster because they may locate in the overlapping area. By using the factor, they obtained higher effectiveness compared to the one using the distance factor. In addition to the intersection factor, the cluster density is also a guide that affects the effectiveness when an object has the same degree in the distance and intersection factors to different clusters. In certain data clustering, a cluster density is used to be a weight to show the importance of clusters. However, in uncertain clustering, the aforementioned importance may result in generating a heavy cluster which has the most objects, and thereby it reduces the effectiveness because the cluster attracts objects easier than other clusters. In the study, an reversed cluster density is used. That means if a cluster contains more objects than others, the cluster provides a small centrifugal force to push objects left. However, this factor may lead objects threshing. For ensuring the stable clustering results, the factor should decay with the loops. In addition to the KMDI w, the SRB is the other mechanism that we propose. It is used for displaying the uncertainty of a centroid. In study [2], the researchers used a geometry /16 $ IEEE DOI /ICS

2 boundary (GB) which is an average value from object errors as a upper bound of the area to show possible positions of a centroid. However, the GB is usually large because objects errors are large. For tightening the boundary, we propose the SRB to minimize the size of the boundary to avoid the occurrence of clusters which have large boundary to reduce objects overlapping. Therefore, the SRB can increase the effectiveness of clustering efficiently. We have three contributions in this study. First, we mix three factors that affects results of clustering to increase the effectiveness of clustering. Second, we inherit the concept of simplified similarity for saving time in clustering. Last, we avoid the occurrence of the large size of the boundary. The paper is organized as follows. In section II, we will succinctly introduce the topics of the similarity and boundary. In section III, we will show our proposed model for clustering uncertain data; and illustrate the performance and accuracy with different approaches in section IV. Section V contains the conclusion and summary. II. RELATED WORK The UDC [1 6, 8] has been studied in the recent decade. It affects the effectiveness and performance because of errors from objects and clusters. Researchers proposed many approaches against these errors to increase the effectiveness and performance. In the following, we briefly review some improvements in the effectiveness term and then discuss them in the time cost term. Similarity is an intuitive factor that affects the effectiveness. According to various object presentations, the calculations of the similarity are also different. For example, the probability density function (pdf) is widely used for modeling an continuous uncertain object [8] and the errors in each dimension (eied) is used the simplified model that constructs from each dimensional error [5, 6] to model an uncertain object. In the pdf model, the similarity Sim is derived from the distance factor by using integration for evaluating the ED between a centroid C c and an uncertain object Ō. The formula is shown in Formula. 1 where f(x) is the pdf of O, andd(.) is the Euclidean distance between the point x (x O) and a centroid. Sim(O, C c )= f(x)d(x, C c )dx (1) In addition to the distance factor, the researchers in [1] considered that the factor, namely prioritized intersection, can increase the effectiveness and they showed that an uncertain object belongs to a cluster if an object has an intersection to a cluster. Compared to the distance factor, they considered that the intersection factor had higher priority (I pr ) when determining the belongingness of an object because the positions of the object and the centroid of the cluster might appear in the overlapping place. Besides the pdf presentation, the similarity calculating by the eied formatted objects, called the simplified similarity, is another popular approach. By an object center vector and object errors ψ(.) which are denoted in formula 2 where m is the dimension of an uncertain object Ō, ando Ei is the error of an object in each dimension, the similarity can be calculated easily. ψ(ō) = m i=1 O2 E i (2) Unlike the similarity using by integration, the simplified similarity only considers two types of distances, namely the minimum and maximum distances which is the lower bound and upper bound of the actual distance (Dist actual ) between a cluster and an object. In [7], the researchers obtained the Dist actual by using a ratio λ ( 0 λ 1) for the minimum distance MinDist(.), and the ratio (1-λ) for the maximum distance MaxDist(.) between an uncertain cluster C and an uncertain object Ō. According to the experiments, they resulted in the discovery that the maximum distance affects higher effectiveness of clustering compared to the minimum distance. The boundary of a centroid is another key that influences the effectiveness. It enhances the probability of object intersection because the intersection factor can increase the effectiveness of clustering. In the study [7], the researchers proposed the GB from object errors that the cluster has to reflect possible positions of a centroid. According to the intersection factor, the GB increases the effectiveness of clustering. The time cost is the other imperfection in UDC. For saving time-consuming, the improvement of time cost using by integration is reducing the visitations for an object. Ngai et al. [6] proposed a minmax distance pruning for decreasing the ED calculations. By an minimum bounding rectangle (MBR) which surrounds an object, some calculations can be omitted if the minimum distance between the object and a cluster is farther than the maximum distance between the object and another cluster. Besides the study [6], Lukic et al. [5] proposed a voronoi diagram (VD) mechanism to decrease the ED calculations and they also enhanced the imperfection from minmax distance pruning. At the beginning, the VD is partitioned by the pairs of centroids to form several individual closed regions, called voronoi cells (VCs). The researchers considered that the calculations can be ignored if the object completely stays in a VC. On the contrary, the belongingness of the remainder objects which are partially in a VC should be determined by ED calculations. In addition to aforementioned two mechanisms, Ngai et al. [6] proposed the boundary of the distance (BD) mechanism which inherits the concept of the triangle inequality to reduce the number of ED calculations. The BD formula shown in equation 3, where O is an uncertain object, C c is the centroid of the cluster, and y is the point in MBR of O. ED(O, C c ) ED(O, y)+ed(c c,y) (3) In the BD mechanism, the ED(O, C c ) depends on (1) the value of ED(O, y) that can preproccess with a point y in MBR of O by integration, and (2) the Euclidean distance between the point y and the centroid, ED(C c,y). Clearly,the BD mechanism can save time because the ED(O, y) only 301

3 calculate once in the first loop. Therefore, the BD can save much time in ED calculations. For decreasing time cost, some researchers used the simplified similarity to reduce the calculations. As mentioned previously, the simplified similarity uses the ratio of the minmax distance as the ED to replace the complex calculations by integration. Therefore, the similarity can be obtained more efficient than the one using by integration. III. PROPOSED APPROACHES We propose two mechanisms, the KMDI w which is an improved similarity and SRB for efficiently promoting the improvements of the time cost and effectiveness in the study. For considering both time cost and effectiveness, we use a concept of the simplified similarity and combine three factors that affect the effectiveness as the KMDI w for clustering. Besides the KMDI w, we also propose the mechanism of the centroid boundary that affects the degree of the intersection for increasing the effectiveness. In the following, we introduce these two mechanisms for clarity. Similarity is the common factor that has an influence in the time cost and the effectiveness in UDC. For decreasing the time cost efficiently, the simplified similarity is appropriate in this study because it only considers the minmax distance in the ED calculations. However, the effectiveness calculated from the simplified similarity is more unfavorable compared to the one calculated by integration. For the increasing the effectiveness, we extend the simplified similarity by inviting some factors that affect the effectiveness. The first factor is the intersection factor. As mentioned previously in section II, the prioritized intersection factor can increase the effectiveness; however, the prioritized intersection factor easily lead objects which have large errors into wrong clusters. To avoid the aforementioned situation, we use the weighted intersection to substitute the prioritized intersection. Unlike the prioritized intersection factor, the weighted intersection factor supplies a weight mechanism for a cluster if an object overlaps the cluster. This mechanism restrains the degree from the overlapping to be a weight and the weight value is between one and two. As a result, the similarity increases slightly when an overlapping occurs. The definition of the intersection I(.) factor can be formulated in formula 4, where d(.) is the Euclidean distance between the center of a cluster and an object. Cond implies the situation of overlapping and is equal to ψ(ō)+ψ( C) d(o, C). 1 Cond 0 I(Ō, C) = 1+ (d(o,c) ψ( C ) ψ(ō) Cond > 0 and ψ( C) <d(o, C) +ψ(ō) 2 ψ(ō) 2 Cond > 0 and ψ( C) d(o, C) +ψ(ō) (4) When an object circumscribes or does not overlap a cluster, the degree of the weighted intersection is equal to one that implies no effect in the similarity. If an object overlaps a cluster and at the same time, one of them does not cover completely to the other one, the degree of the weight is equal (d(o,c) ψ( C) ψ(ō) to 1+ 2 ψ(ō) which implies the degree of the intersection. On the contrary, the degree of the intersection is equal to two if a cluster or an object contain completely to the other. The next factor is the cluster density. The cluster density is the common factor to strengthen the important of clusters that generally have some special properties. In some situations, especially when an object has same the distances and degrees of the intersection, the density factor provides a slight reversed force to push the object to the cluster which has low density. The aforementioned mechanism can avoid the aggregation from most of objects, especially objects with large errors. However, the density factor also carries out the threshing when clustering. To converge the clustering results, the effect of the density factor should be decayed with the times of the loops. The formula is shown as Formula 5 where l implies the loop times, and n is the number of objects that an uncertain cluster C belongs to. M smooth ( C)=1+( 1 n )l (5) When the value of the loop times is large, M( C) 1 except n =1. Concluded with these factors, the similarity between two uncertainty of Ō and C can be presented as Formula 6 Sim(Ō, C) =α I(Ō, C)M smooth ( C) (6) Maxdist(Ō, C) where α is the coefficient between (1) the product of the intersection and the density and (2) the distance. When α<1, the factor of distance constructs more effects compared to the product of the intersection and density; otherwise, the factors of intersection and density are more important than the distance factor. In addition to the KMDI w, we also propose the boundary mechanism, called SRB, to increase the effectiveness of clustering. In section II, we introduce the definition of the GB in which each point reflects the possible position of the centroid. In general terms, the GB can easily obtain from errors and the radius (R GB ) formed from the average errors of uncertain objects. The definition of R GB is shown as Formula 7. n ψ( O k ) R GB ( C) k=1,k C = (7) n When objects in the cluster have large errors, the R GB is large and thereby the cluster will attract objects easily. To reduce the size of radius of the boundary, we use the SRB as the definition of the uncertainty of the boundary to avoid large errors of clusters. Unlike the GB, the SRB generates smaller boundary than the GB because the SRB calculates by the square root of the sum of object errors. The formulas of the SRB radius is shown as Formula 8 and the proof is shown as follows. n ψ( O k ) 2 R SRB ( C) = k=1,k C n (8) 302

that is surrounding by a red circle is GB, and an area of the green circle is SRB Proof.

4 (a) Case: large intersection (b) Case: small intersection (c) Case: tangent (d) Case: no intersection Fig. 1: Four cases of discussions of the areas that are the possible locations of the centroid; where O 1 and O 2 are uncertain objects, C 1 and C 2 are the centers of O 1 and O 2 respectively, the area that is surrounding by a red circle is GB, and an area of the green circle is SRB Proof. R SRB( C) = 1 n ψ(o k) 2 n k=1, O k C (R SRB( C)) 2 =( 1 n )2 n k=1, O k C ψ(o k) 2 k=1, O k C ( 1 n n )2 ( ψ(o k)) 2 =(R GB( C)) 2 Clearly, the size of the GB is equal to the SRB when the cluster only contain an object; on the contrary, the size of the SRB is smaller than the GB. For clearly showing the sizes of the GB and the SRB, we discuss the boundaries from two objects and illustrate the boundaries in four cases in Fig. 1a 1d. As shown in Fig. 1, assume O 1 and O 2 belong to the same cluster, and both of objects have same errors. The boundaries of the cluster can be easily calculated according to the individual definitions. Even though the situations of the intersection are different, both the GB and SRB provide a fixed size of boundary and the size of the SRB is smaller than GB. Algorithm 1 Algorithm of KMDI w 1: Input: Objects O = {o 1,o 2,o 3,..., o i,..., o n} 2: Output: Clusters C = {c 1,c 2,c 3,..., c k } Each c i contains objects 3: define two variables, J and J 4: J: the sum of max similarity; J : the previous sum of max similarity 5: repeat 6: J J 7: for each object o i and each cluster do 8: calculatesim(o i,c j ) 9: select max Sim(o i,c j ) and c j o i 10: add J max Sim(o i,c j ) 11: update SRB of clusters 12: until J J 0 Combined with the KMDI w and the SRB, the complete KMDI w algorithm is shown as Algorithm. 1. IV. SIMULATIONS We simulate two synthetic datasets and two real datasets to reveal the time cost and effectiveness of clustering with different mechanisms in order. In the time cost aspect, we (a) The time cost of one hundred of(b) The time cost in different datasets objects clustering with various mecha-witnisms in different various mechanisms datasets Fig. 2: Distribution of objects in two artificial datasets measure the time in (1) the same number of objects, and (2) the original number of objects in different mechanisms. In the former one, we show the time cost with various dimensions, whereas the latter one show the time spent in different datasets. In the effectiveness aspect, we verify the effectiveness to show the correctness of clustering by three common external criteria, namely accuracy, F 1 score, and purity first. In addition, we analyze the effectiveness from the different boundaries. All the simulations are programmed in C# and processed on Intel(R) Core(TM) i GHz on Window 7. A. dataset In the simulations, we use two synthetic datasets that are generated from the matlab and two real datasets iris and wine quality which are provided by UCI Machine Learning Repository 1 for simulations. In synthetic datasets, we generate a 2-dimensional and a 3-dimensional datasets with Gaussian distribution where the range of μ is [4, 12] and σ is [0.2, 0.8]. Each of synthetic datasets contains four categories. Since these two synthetic datasets and two real datasets are certain and precise values, we add Gaussian noise where μ err and σ err are respectively around [0.15, 0.5] of the μ and σ of the each original dimension to ensure that objects still belong to the original categories for measuring the effectiveness. The information of four datasets is shown in Table. I. 1 UCI Machine Learning Repository. URL: Accessed:

5 (a) synthetic dataset I (accuracy) (b) synthetic dataset II (accuracy) (c) iris dataset (accuracy) (d) wine quality dataset (accuracy) (e) synthetic dataset I (F1 score) (f) synthetic dataset II (F1 score) (g) iris dataset (F1 score) (h) wine quality dataset (F1 score) (i) synthetic dataset I (purity) (j) synthetic dataset II (purity) (k) iris dataset (purity) (l) wine quality dataset (purity) Fig. 3: Results of the effectiveness from various datasets (a) synthetic dataset I (accuracy) (b) synthetic dataset II (accuracy) (c) iris dataset (accuracy) (d) wine quality dataset (accuracy) (e) synthetic dataset I (F1 score) (f) synthetic dataset II (F1 score) (g) iris dataset (F1 score) (h) wine quality dataset (F1 score) (i) synthetic dataset I (purity) (j) synthetic dataset II (purity) (k) iris dataset (purity) (l) wine quality dataset (purity) Fig. 4: Results of the effectiveness of different boundary from various datasets 304

6 parameter/ dataset TABLE I: The information of datasets artificial dataset I artificial dataset II iris dataset wine quality dataset categories attributes dataset size B. compared methods In time cost aspect, we compare the UKmeans with different popular mechanisms mentioned in section II for comparisons. These mechanisms include the probability density function by integration (KP), maximum distance with geometric boundary (KMGB) [7], maximum distance with prioritized intersection with geometric boundary (KMI pr GB) [1], and BD [6]. The KP and BD calculate the similarity by integration. To ensure the results as correct as possible in the KP and BD, we contract 16 d slices for an object in datasets except the wine quality dataset where d is the number of dimensions. In addition to the wine dataset, we adopt 8 d slices in an object for saving time spent. All object errors are uniform distribution. In the effectiveness aspect, we first compare the effectiveness with aforementioned mechanisms. For simplifying the calculations in KMDI w,wesetα as 1 in our simulations. In addition to the comparisons from different mechanisms, we also discuss the effectiveness in the KM, with different boundaries, GB and SRB, for discussions. C. measurements First, we compare the time cost of calculating one hundred of the objects in each dataset with various mechanisms for fairness since the object quantities in datasets are different. Clearly, the KMGB spends the least time cost in all datasets. The KMDI w, an extended simplified similarity, spends higher time cost compared to the KMGB because of the calculations of the density factor and weighted intersection factor. The results are illustrated as Figs 2a. According to the results of clustering one hundred objects, the time cost of each dataset can be revealed in Fig. 2b. In the effectiveness of clustering, the results are shown in Figs 3. In synthetic I, II, and iris datasets, the KMDI w obtains the highest accuracy, F 1 scores, and purity when the number of initial clusters K are less than the number of predefined categories. As increasing the quantities of initial clusters, the accuracy of the KMDI w decreases because centroids which contain the boundary have high probability to attract other objects by the intersections. In the wine quality dataset, all mechanisms show similar accuracy, F 1 scores, and purity because the number of the dimensions are large. In addition to the effectiveness from the similarity, the boundary of centroids is the other mechanism that we concern. For comparing the effect from two boundary, GB and SRB, we assign a fixed number of initial clusters and measure the effectiveness in UKmeans. The KMGB and KMSRB are the abbreviations. The results of the accuracy, F 1 scores and purity in KMSRB are better than the ones in KMGB. The results are illustrated as Fig. 4. V. CONCLUSION We propose two mechanisms towards clustering in the study. First, we present a mechanism for modeling cluster boundary. In the past, the GB is the only one definition for the boundary and honestly reflects the possible locations of centroids from object errors which belong to the cluster. However, the results from the clusters using the GB are unfavorable when object errors are large. The SRB, however, controls the aforementioned results because of small boundary. Therefore, the SRB provides the stable effectiveness of clustering. Next, we propose a mixed similarity mechanism. The mechanism balances (1) some clusters which have large boundary, and (2) the results that most objects do not belong absolutely to the clusters which have large boundary. Finally, we retain the time cost in simplified clustering. Although the time cost in our proposed mechanisms can not excess the KMGB, we also cluster objects in the favorable time cost. REFERENCES [1] C. C. Aggarwal. On density based transforms for uncertain data mining. In Data Engineering, ICDE IEEE 23rd International Conference on, pages IEEE, [2] C. C. Aggarwal and P. S. Yu. A framework for clustering uncertain data streams. In Data Engineering, ICDE IEEE 24th International Conference on, pages IEEE, [3] B. Kao, S. D. Lee, D. W. Cheung, W.-S. Ho, and K. Chan. Clustering uncertain data using voronoi diagrams. In Data Mining, ICDM 08. Eighth IEEE International Conference on, pages IEEE, [4] H.-P. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages ACM, [5] I. Lukic, M. Kohler, and N. Slavek. Improved bisector pruning for uncertain data mining. In Information Technology Interfaces (ITI), Proceedings of the ITI th International Conference on, pages IEEE, [6] W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip. Efficient clustering of uncertain data. In Data Mining, ICDM 06. Sixth International Conference on, pages IEEE, [7] Y. Peng, Q. Luo, and X. Peng. Uck-means: A customized k-means for clustering uncertain measurement data. In Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on, volume 2, pages IEEE, [8] L. Xiao and E. Hung. An efficient distance calculation method for uncertain objects. In Computational Intelligence and Data Mining, CIDM IEEE Symposium on, pages IEEE,

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering