Evolution-Based Clustering Technique for Data Streams with Uncertainty

Size: px

Start display at page:

Download "Evolution-Based Clustering Technique for Data Streams with Uncertainty"

Kristin Wilkerson
5 years ago
Views:

1 Kasetsart J. (Nat. Sci.) 46 : (2012) Evolution-Based Clustering Technique for Data Streams with Uncertainty Wicha Meesuksabai*, Thanapat Kangkachit and Kitsana Waiyamai ABSTRACT The evolution-based stream clustering method supports the monitoring and change detection of clustering structures. This paper presented HUE-Stream which extends E-Stream and E-Stream++ by introducing a distance function, cluster representation and histogram management for the different types of clustering structure evolution. Compared with UMicro and LuMicro, HUE-Stream produces higher clustering quality and is more robust over highly uncertain data streams; however, it requires longer processing time due to the fact that HUE-Stream detects change in the clustering structure evolution too frequently (in every round). To improve the processing time, proper periods of clustering structure evolution change detection were determined. With these proper periods, the processing time was greatly improved, while retaining the clustering quality. Compared to actual class of data in the KDDCup 1999 network intrusion detection dataset, a comparable number of clusters was obtained in all stream progressions. Keywords: data streams with uncertainty, heterogeneous data, heterogeneous attributes, clustering structure evolution detection, evolution-based clustering INTRODUCTION Recently, clustering data streams have become a research topic of growing interest. One main characteristic of data streams is to have an infinite evolving structure that can be generated at a rapid rate. A stream clustering method that supports the monitoring and the change detection of clustering structures is called an evolutionbased stream clustering method. Apart from an infinite data volume, data streams also contain errors or only partially complete information, called data uncertainty. This paper focused on developing an evolution-based stream clustering method that supports uncertainty in data. Many techniques have been proposed for clustering data streams. Most research has focused on clustering techniques for numerical data (Aggarwal et al., 2003, 2004; Udommanetanakit et al., 2007; Aggarwal and Yu, 2008; Aggarwal, 2009). Few have been proposed to deal simultaneously with heterogeneous data streams, including numeral and categorical attributes (Yang and Zhou, 2006; Kosonpothisakun et al., 2009; Huang et al., 2010). However, very few have proposed to monitor and detect change in the clustering structures. E-Stream (Udommanetanakit et al., 2007) proposed an evolution-based clustering method for numerical data streams. E-Stream++ (Kosonpothisakun et al., 2009) extended it by Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900, Thailand. * Corresponding author, g @ku.ac.th Received date : 24/01/12 Accepted date : 02/04/12

2 Kasetsart J. (Nat. Sci.) 46(4) 639 integrating both numerical and categorical data streams. However, both methods do not support uncertainty in data streams. To support uncertainty in data streams, Aggarwal and Yu (2008) introduced the uncertain clustering feature for cluster representation and proposed a technique named UMicro. Later, they continued to study the problem of high dimensional projected clustering of uncertain data streams. However, both techniques still had low clustering quality (Aggarwal, 2009). The LuMicro (Chen et al., 2009) technique has been proposed to improve clustering quality, with the support of uncertainty in numerical attributes and not in categorical attributes. However, all these works have not been proposed in the context of evolution-based stream clustering. Meesuksabai et al. (2011) presented an evolution-based stream clustering technique called HUE-Stream that supports uncertainty in both numerical and categorical attributes. HUE-Stream was extended from the E-Stream (Udommanetanakit et al., 2007) technique and the E-Stream++ (Kosonpothisakun et al., 2009) technique which is an evolution-based stream clustering technique that integrates both numerical data streams without support of uncertainty. A distance function, cluster representation and histogram management were introduced for the different types of clustering structure evolution. A distance function with a probability distribution of two objects was introduced to support uncertainty in categorical attributes. To detect change in clustering structure, the proposed distance function was used to merge clusters and find the closest cluster of given incoming data and proposed histogram management splits the clusters into categorical data. This paper does further analysis on how HUE-Stream enhances the quality of clustering when compared to the state-of-the-art stream clustering methods that support data uncertainty. An efficient and effective clustering method must be able to group data with the same class into the same cluster. The number of resulting clusters should vary significantly with the number of data classes but not be bound by the threshold specifying the number of clusters. Further, a clustering method should support the different types of clustering structure evolution of: appearance, disappearance, self-evolution, merge and split. Those clustering structure evolution types have a direct impact on the efficiency and effectiveness of the stream clustering method. However, performing clustering structure evolution change detection in every round is time consuming. To avoid this, in every round detection, proper periods of clustering structure change detection need to be determined. HUE-Stream was compared with UMicro and LuMicro to assess its clustering quality. The comparison was performed in terms of the effectiveness (accuracy with progression of stream, accuracy with an increasing level of uncertainty), and efficiency (processing time and number of data points proceeded per second). HUE-Stream produces higher clustering quality and is more robust over highly uncertain data streams; however, it requires higher processing time. To improve the processing time, proper periods of clustering structure evolution change detection were determined. With these proper periods, processing time was greatly improved, while retaining clustering quality. Compared to the actual class of data in the KDDCup 1999 network intrusion detection dataset, a comparable number of clusters was obtained in all stream progressions. MATERIALS AND METHODS Uncertain dataset The KDD-CUP 99 benchmark dataset (real-world dataset) in the UCI KDD archive was used. It contains 494,020 records and is composed of 34 numerical attributes and 7 categorical attributes. To evaluate the effectiveness of clustering algorithms over an uncertain data

3 640 Kasetsart J. (Nat. Sci.) 46(4) stream, attribute uncertainty was performed on the KDD-CUP 99 dataset by converting each one into probability vectors. For numerical attributes, the converting technique described by Chen et al. (2009) was used for simulation of the discrete probability scenarios. Each numerical attribute in a given tuple can have several possible instances with different probability. The number of instances was generated by using a random variable with uniform distribution on [1, τ], where τ is the maximum number of instances specified by the user. The probability for each instance is sampled from a corresponding Gaussian distribution satisfying N(0, δ). In addition, in order to ensure the sum of probabilities is 1, normalization was employed as the final step. Categorical attributes were made uncertain by converting them into probability vectors using the approach described by Qin et al. (2009). For example, when introducing 10% uncertainty, this attribute will take the original value with 90% probability, and 10% probability to take any of the other values. Suppose in the original accurate dataset A j = v 1, we will assign = 90%, and assign (2 j k) to ensure = 10%. To compare the effectiveness of the HUE-Stream algorithm with the UMicro and LuMicro algorithms, all of these algorithms were implemented in C++ running on a personal computer with a 3.2 GHz Intel Core i5 central processing unit, with the Debian GNU/Linux 5.0.8, operating system and memory size of 8GB. Basic concepts of evolution-based stream clustering The data stream consisted of a set of (n + C) dimensional tuples X 1 X k arriving at time stamps T 1 T k. Each data point X i contains n numerical attributes and c categorical attributes, denoted by. The number of valid values for a categorical attribute x n+k is V k, where 1 k c, and j-th valid value for x n+k is V jk, where 1 j V k. An isolated data point is a data point that is not a member of any clusters. Isolated data remain in the system for cluster appearance computations. An inactive cluster is a cluster that has a low weight. It can become an active cluster if its weight is increased until it reaches the active cluster s threshold. An active cluster is a cluster that can assemble incoming data if there is a sufficient similarity score. A cluster is a collection of data that has been memorized for processing in the system. It can be an isolated piece of data, an inactive cluster or an active cluster. A fading function decreases the weight of data over time. In a data stream that has evolving data, older data should have lesser weight. Let λ be the decay rate and t be elapsed time, f(t) = 2 -λt. The weight of a cluster is the number of data elements in a cluster. The weight is determined according to the fading function. Initially, each data element has a weight of 1. A cluster can increase its weight by assembling incoming data points or merging with other clusters. Tuple-level and dimension-level uncertainty In fact, there are many existing categories of assumption to model the uncertainty in a data stream (Chen et al., 2009). This current paper assumed the focus was mainly on the discrete probability density function which has been widely used and is easy to apply in practice. For each uncertain tuple X i, its possible values in d-th dimension can be defined by probability distribution vector as Equation 1: (1) Dimension-level uncertainty of the j-th dimension of a tuple X i denoted by can be defined by Equation 2:

4 Kasetsart J. (Nat. Sci.) 46(4) 641 (2) Let X i be a tuple, a tuple-level uncertainty of X i denoted by U(X i ) is an average of its dimension-level uncertainties defined as U(X i ) = is an α-bin histogram of numerical data values with α equal width intervals, i.e., the j-th numerical dimension with l-th bin histogram of HN(t) at time t is where Equation 3: Therefore, uncertainty of all k-tuple data streams can be calculated as the average of their tuple-level uncertainties. Cluster representation using fading cluster structure with histogram A fading cluster structure with histogram (FCH) has been introduced in E-Stream (Udommanetanakit, 2007). In the current paper, FCH is extended to support uncertainty in both numerical and categorical data. An uncertain cluster feature for a cluster C contained a set of tuples X i X k arriving at time stamps T i T k is defined as FCH = a s d e s c r i b e d below. is a vector of weighted sum of tuple expectation for each dimension at time t, i.e., the value of the j th entry is by., denoted is a vector of weighted sum of squares of tuple expectation for each dimension at time t, i.e., the value of the j th entry denoted by is is the sum of all weights of data points in a cluster C at time t, i.e., is the sum of all weight of tuple uncertainty in cluster C at time t, i.e., (3) is a β-bin histogram of categorical data values with top β accumulated probabilities of valid categorical values and their weighted frequency for each categorical dimension at time t. The histogram of j-th categorical dimension of C is given by Equation 4: (4) where is an a-th valid categorical value in j-th categorical dimension, is accumulated probability of which can be formulated as Equation 5: where and frequency of = (5) is an accumulated weight which can be formulated as Note that and are used in calculating the distance between two sets of categorical data as described in the next section. Distance functions A distance function plays an important role in data clustering tasks. To deal with

5 642 Kasetsart J. (Nat. Sci.) 46(4) uncertainty in both categorical and numerical data, the current proposed new distance functions that take into account the uncertainty described as follows. Cluster-Point distance can be formulated as Equation 6: Where (Equations 13 and 14): (12) (6) where is the distance between the expected center of numerical attributes in cluster C, denoted by, and the expected value of the j-th numerical attributes of a data point denoted by as Equation 7: (7) where, and and can be derived from the mean and standard deviation of expected values of all data points in cluster C; is the distance derived from categorical attributes (Equation 8):where (Equations 9 and 10): (8) (9) (10) Cluster-cluster distance can be formulated as: is the distance between the expected center of numerical attributes in cluster and which can be defined as Equation 11: (11) and is the distance derived from categorical attributes, which can be defined as Equation 12: (13) (14) Evolution of stream clustering structure To capture the characteristics of data streams which have an evolving nature, the evolution-based stream clustering method is composed of different steps. In the beginning, incoming data are considered as isolated clusters. Then, a cluster is formed when a sufficiently dense region appears. An inactive cluster is changed to an active cluster when its weight reaches a given threshold. When a set of clusters has been identified, incoming data must be assigned to a closest cluster based on a similarity score. To detect change in the clustering structure, the following clustering evolutions are checked and handled: appearance, disappearance, self-evolution, merge and split. Appearance, self-evolution and merge evolution operations are determined using the appropriate distance function. In the following, the different distance functions are described. Appearance: A new cluster can appear if there is a sufficiently dense group of data points in one area. Initially, such elements appear as a group of outliers, but (as more data appears in a neighborhood) they are recognized as a cluster. Groups of dense data points may locate close to each other by considering the point-point distance and cluster-point distance as in Equation 11, respectively. Disappearance: Existing clusters can disappear because the existence of data is

6 Kasetsart J. (Nat. Sci.) 46(4) 643 diminished over time. Clusters that contain only old data fade and eventually disappear because they do not represent the presence of data. The disappearance of existing clusters happens when their data points have the least recent time stamp and their weights are less than the remove threshold. Self-evolution: In each active cluster, its characteristics can be evolved over time because the weight of old data points fades so that the new data points can affect the characteristic of the cluster in a short time. By doing this, fading Equation 1 and the distance between cluster and data point are used to find the closet cluster of a new data point which will change the cluster characteristic. Merge: A pair of clusters can be merged if their characteristics are very similar (two overlap clusters). Merged clusters must cover the behavior of the pair by considering the distance between two clusters. Split: A cluster can be split into two smaller clusters if the behavior inside the cluster is obviously separated. Cluster split is based on the distribution of feature values as summarized by the cluster histogram. So, a histogram of cluster data values is utilized to identify cluster splits. HUE-Stream algorithm The HUE-Stream algorithm extended the E-Stream (Udommanetanakit et al., 2007) and E-Stream++ (Kosonpothisakun et al., 2009) algorithms to support uncertainty in a heterogeneous data stream. Indeed, the HUE- Stream supports the monitoring and the change detection of clustering structures evolving over time. Five types of clustering structure evolution are supported: appearance, disappearance, selfevolution, merge and split. The main algorithm of HUE-Stream is given in Figure 1. In line 1, the algorithm starts by retrieving a new data point and labeling it with the current timestamp. In line 2, it fades all clusters weights, since it is not necessary to check the change of clustering structure for every incoming data point. Therefore, lines 3 to 9 check for the change of clustering structure every T time periods. The process of cluster evolution detection start with the deletion of any clusters having insufficient weight in line 4. Then, in line 5, it splits a cluster when the behavior inside the cluster is obviously separated. In line 6, it merges a pair of clusters for which the characteristics are very similar. In line 7, it checks the number of clusters and merges the closest pairs if the number of clusters exceeds the limit. In line Figure 1 HUE-Stream algorithm. Algorithm HUE-Stream 1 retrieve new data X i and label it with timestamp t x,i 2 FadingAll 3 if (t x,i mod T) = 0 then 4 DeleteCluster 5 CheckSplit 6 MergeOverlapCluster 7 LimitMaximumCluster 8 FlagActiveCluster 9 end if 10 (Uncertainty[], index[]) FindCandidateClosestCluster 11 if sizeof(index[]) > 0 12 index FindClosestCluster 13 add x i to FCH index 14 else 15 create new FCH from X i 16 end if 17 waiting for new data

7 644 Kasetsart J. (Nat. Sci.) 46(4) 8, this process ends with scanning clusters in the system to find active clusters that have reached sufficient weight. Lines 10 to 16 find the closest cluster to contain the incoming data point with respect to the distance and uncertainty of that cluster. An isolated data point will be created where there is no cluster to contain the new data point. The flow of control then returns to the top of the algorithm and waits for a new data point. In Figure 2, the details of each step are described. FadingAll: this procedure performs fading of all the existing clusters in the system. New data points are the focus rather than old data points. Then, a fading function is used to decrease the weight of old data over time. When a cluster in the system has insufficient weight (FCH i.w < fade_threshold), it will be deleted from the system. LimitMaximumCluster: this procedure is used to limit the number of clusters. It checks whether the number of clusters is greater than its maximum_ cluster. If the number of clusters exceeds this value, then the closest pair of clusters will be merged until the number of remaining clusters is less than or equal to the threshold. FlagActiveCluster: this procedure is used to check the status of the current active cluster. When the weight of any cluster is greater than or equal to active_threshold then it will be flagged as an active cluster. Otherwise, the flag is cleared. FindCandidateClosestCluster a n d FindClosestCluster: Both procedures are shown in Figure 2. They are used to compute the distance in order to find the index of the candidate closest to the active cluster and to choose the candidate most appropriate by computing maximum uncertainty value changes (Equation 15):. (15) MergeOverlapCluster: The merge procedure of HUE-Stream is given in Figure 2. This procedure scans all the clusters in the system for merging pairs of similar clusters. If the cluster-to-cluster distance is less than the merge_threshold, then a couple of those clusters will be merged. The merged cluster should express the characteristics of the two similar clusters. As a consequence, the FCH of merged cluster can be formulated as follows: Let C 1 and C 2 denote two sets of cluster. FCH(C 1 C 2 ) can be calculated based on FCH(C 1 ) and FCH(C 2 ) The values of entires W(C,t), U(C,t) in FCH are the sum of the corresponding entires in FCH(C 1 ) and FCH(C 2 ). To obtain the merged histogram of numerical data values,, first the minimum and maximum value in each numerical dimension of the pair must be found. Then this range is divided into α intervals with equal length. Finally, the frequency of each merged interval is computed from the histogram of the pair. For calculating the merged histogram of categorical data values,, two sets of top α-bin categorical data values in each dimension of the pair are unioned. Then, the union set is ordered by its frequency in descending order. Finally, only the top β-bin categorical data values are stored in. CheckSplit: The CheckSplit procedure of HUE- Stream is given in Figure 2. This procedure is used to verify the splitting condition of each cluster using the histogram, for all attributes are verified to find the split-position. If the split-position occurs in a numerical or categorical attribute, the weight will be recalculated based on the histogram of the splitting attribute. Then, cluster representation of the new clusters is determined based on the calculated weight. For numerical attributes, in Figure 3 a valley which lies between two peaks of the

8 Kasetsart J. (Nat. Sci.) 46(4) 645 histogram is considered as the splitting criteria. If the splitting valley is found at more than one point, the best splitting valley is the minimum value valley. When the cluster splits, the histogram is split in that dimension and other dimensions are weighted based on the split dimension. The splitting valley must be statistically significantly lower than the lower peak. For categorical attributes, the splitting-attribute is an attribute that has Procedure MergeOverlapCluster 1 for i 1 to FCH 2 for j i + 1 to FCH 3 overlap[i,j] dist(fchi,fchj) 4 m merge_threshold 5 if overlap[i,j] > m*(fchi.sd+fchj.sd) 6 if dist cat [i,j] < m *mindist cat (FCH a, FCH b ) 7 if (i, j) not in S 8 merge(fch i, FCH j ) Procedure CheckSplit 1 for i 1 to FCH 2 for j 1 to number of numerical attributes 3 find valley and peek 4 if chi-square test(valley,peek) > significance 5 split using numerical attribute 6 for j 1 to number of categorical attributes 7 find maximum different of bin k, bin k+1 8 if chi-square test(bin k,bin k+1 ) > significance 9 split using categorical attribute 10 if split using only numerical or categorical 11 split FCH i 12 S S {(I, FCH )} 13 else if numerical and categorical 14 (n1,n2) split using numerical 15 (c1,c2) split using categorical 16 if max(c1,c2) > max(n1,n2) 17 split FCH i using categorical 18 if max(c1,c2) <= max(n1,n2) 19 split FCH i using numerical 20 S S {(i, FCH )} Procedure FindCandidateClosestCluster 1 for i 1 to FCH 2 if FCH i is active cluster 3 dist[i] dist(fch i, x i ) 4 if dist[i] < radius_factor/4 Procedure FindClosestCluster 1 for i 1 to set_candidate 2 index_u[i, FCH i.u] FCH i.u 3 index[i, FCH i.u] max_of_ FCH i.u 4 return i 5 set_candidate i 6 return set_candidate Figure 2 MergeOverlapCluster, CheckSplit, FindCandidateClosestCluster and FindClosestCluster.

9 646 Kasetsart J. (Nat. Sci.) 46(4) significantly accumulated probability more than others within the same cluster. The split-position is a position between a pair of adjacent values whose accumulated probabilities have the greatest difference, as shown in Figure 4. If the splitposition occurs between the first and second bars, that cluster will not be split into two small clusters because the first value is the only outstanding member in the top-β of the splitting attribute. RESULTS AND DISCUSSION Experimental setup The performance of HUE-Stream was evaluated by comparison with UMicro and LuMicro in terms of effectiveness (accuracy with progression of stream, accuracy with increasing level of uncertainty), sensitivity (accuracy with varying number of clusters) and efficiency (processing time and number of data points proceeded per second). Parameter settings of the three algorithms are shown in Table 1. Effectiveness test with respect to number of clusters The effectiveness of HUE-Stream, UMicro and LuMicro were evaluated in terms of Dimension Split histogram 1 st split histogram 2 nd split histogram Split dimension Other dimension Figure 3 Histogram management in a split dimension and other dimension of numerical. Dimension Split histogram 1 st split histogram 2 nd split histogram Split dimension Other dimension Figure 4 Histogram management in a split dimension and other dimension of categorical data. Table 1 Parameter settings of HUE-Stream, UMicro and LuMicro algorithms. HUE-Stream UMicro LuMicro stream_speed 100 stream_speed 100 stream_speed 100 horizon 2 horizon 2 horizon 2 decay_rate 0.1 decay_rate 0.1 decay_rate 0.1 radius_factor 4 radius_factor 3 radius_factor 3 remove_threshold 0.1 candidate_cluster 10 merge_threshold 1.25 active_threshold 5

10 Kasetsart J. (Nat. Sci.) 46(4) 647 purity and f-measure. More specifically, purity and f-measure were studied with regard to their sensitivity to the maximum number of clusters, a threshold that is set by the three algorithms. Figures 5 and 6 show the maximum number of clusters increased by doubling as follows: 13, 26, 52 and 104. Figure 5 shows the purity and f-measure of the three methods with increasing progression of the streams. For these experiments, the time horizon was set to 2, the uncertainty level was set to 5% and T was set to1. In terms of purity, comparable results were obtained as it can be seen that the purity is almost not sensitive to the number of clusters for the three methods; however, this was not so for the f-measure. In terms of the f-measure, the experimental results showed that HUE-Stream outperforms UMicro and LuMicro in almost all of the different sets of maximum number of clusters thresholds of the three algorithms. Average purity Average f-measure Average purity Average purity Average f-measure Average f-measure Figure 5 Purity and f-measure with respect to different maximum numbers of clusters for (A) Purity of Hue-Stream; (B) f-measure of Hue-Stream; (C) Purity of UMicro; (D) f-measure of LuMicro; (E) Purity of UMicro; and (F) f-measure of LuMicro.

11 648 Kasetsart J. (Nat. Sci.) 46(4) Figure 6 shows that the f-measure of UMicro and LuMicro had a tendency to decrease when increasing the maximum number of clusters. However, HUE-Stream was not sensitive to the increasing number of clusters because its number of clusters is derived from the behavior of the data streams. As long as the maximum number of clusters is not exceeded, HUE-Stream still yields good results. It is important to remember the fact that the purity score considers only the correctness of data points in each cluster and not how many of them are grouped all together within the same cluster (as does the f-measure). Figure 6 shows the number of actual classes compared to the number of clusters obtained by each algorithm in each stream progression. It can be clearly seen that HUE- Stream is independent of the maximum number of clusters threshold. HUE-Stream performed the data point grouping operation based on their real behavior. This is not the case with UMicro which is dependent on the maximum number of clusters threshold and in particular in the data stream range between 200,000 and 300,000, UMicro generates too many small clusters while HUE-Stream and LuMicro generate only one cluster and merge the overlap clusters. In the rest of the data streams, LuMicro produces significantly more clusters than HUE-Stream. This explains why HUE-Stream outperforms LuMicro in terms of the f-measure, since there is only one class in the actual data. Effectiveness test with respect to uncertainty level Facing uncertain data streams with different probability distributions, the clustering algorithm had to deal with the uncertainty of the record value. In order to evaluate the effectiveness and robustness of the three algorithms in integrating uncertain data streams, attribute uncertainty was performed by converting each of data streams Number of clusters in system Number of clusters in system Number of clusters in system Figure 6 Number of clusters at different maximum numbers of clusters for (A) UMicro; (B) LuMicro; and (C) Hue-Stream.

12 Kasetsart J. (Nat. Sci.) 46(4) 649 into probability vectors. Figure 7 shows purity and f-measure with increasing uncertainty levels. In these experiments, the maximum number of clusters of HUE-Stream, UMicro and LuMicro was set to 26, 100 and 100, respectively. In addition, T=1 was set for HUE-Stream to allow the detection of change in the clustering structure at every incoming data point. It was clear that for each level, purity and f-measure reduced with the increasing level of uncertainty. However, HUE- Stream still produced a better f-measure than LuMicro and UMicro. Processing time enhancement Figure 8 shows the processing time for the three algorithms which all exhibit a linear relationship between the runtime and the number of data points. In this experiment, the parameter setting were as follows. The thresholds for the maximum number of clusters of HUE-Stream, LuMicro and Umicro were set to 26, 100 and 100, respectively. The uncertainty level was set to 5% and T=1. Although HUE-Stream produced the best clustering quality and was more robust with highly uncertain data streams, it required longer processing time compared to UMicro and LuMicro. This was due to the fact that HUE- Stream detects change in the clustering structure evolution too frequently (in every round). Thus, proper periods of clustering structure change detection need to be determined. In order to improve the processing time, the aim was to detect the clustering structure evolution at proper periods (and not in each round). Wan and Wang (2010) proposed the proper 1 ε periods of checking to be T = [ log( )] where λ ε -1 Average purity Average f-measure Number of uncertainty levels Number of uncertainty levels Figure 7 Purity (A) and f-measure (B) for the three algorithms studied with increasing uncertainty level. Runtime (s) Data stream (points x10,000) Figure 8 Processing time with progression of streams.

13 650 Kasetsart J. (Nat. Sci.) 46(4) λ corresponds to a decay factor in the fading function and ε is a remove-threshold. Therefore, the experiments were carried out by varying the periods in multiples of data points in each period (time horizon stream speed = 200). In Figures 9 and 10, the processing time and processing rate obtained by varying T to 200, 400, 600, 800, 1,000 are shown to be significantly better than that of T=1. It can also be noticed that while varying T from 200 to 1,000, the processing time and processing rate were not much different. Figure 10 shows the purity and f-measure obtained by different checking periods. In terms of purity, all periods generated comparable results with T=1 (almost equal to 1.0). In terms of f-measure, at a stream progression between 50,000 and 200,000, all the periods generated almost the same result (same number of clusters) as shown in Figure 11. In contrast, at a stream progression between 200,000 and 300,000, only the period T=1 could capture the change in clustering structure and generate only one cluster. Meanwhile, the other periods still maintained the same number of clusters generated from the previous stream progression. In the rest of the stream progression, all the periods generated the same f-measure except at period T = 200 which also retained the same number of clusters. In summary, the proper periods for clustering structure evolution change detection were T=400, 600 and 800. CONCLUSION The uncertainty in the data stream significantly affects the clustering structure. A HUE-Stream algorithm was proposed for clustering Runtime (s) Data stream (points x10,000) Data stream (points x10,000) Figure 9 Processing time (A) and processing rate (B) of HUE-Stream with respect to different periods of clustering evolution detection. Average purity Average f-measure Number of data points/s Figure 10 Purity (A) and f-measure (B) with respect to different periods of clustering evolution detection.

Kasetsart J. (Nat. Sci.) 46(4) 651 Number of clusters in system Figure 11 Number of clusters with progression of stream and different periods of checking evolutions.

14 Kasetsart J. (Nat. Sci.) 46(4) 651 Number of clusters in system Figure 11 Number of clusters with progression of stream and different periods of checking evolutions. heterogeneous data streams with uncertainty. A distance function, cluster representation and histogram management were introduced for supporting different types of clustering structure evolution namely, appearance, disappearance, self-evolution, merge and split. HUE-Stream was compared with a real-world dataset against UMicro and LuMicro in terms effectiveness, sensitivity and efficiency. The HUE-Stream algorithm outperformed UMicro and LuMicro in terms of f-measure, and had a comparable purity score. While UMicro and LuMicro were sensitive to the input parameter number of clusters, HUE-Stream was robust and able to determine almost the exact number of clusters that suited based on the behavior of the data stream and the number of clusters in the system. HUE-Stream produced higher clustering quality and was more robust over highly uncertain data streams; however, it required longer processing time. In order to improve processing time, proper periods of clustering structure evolution change detection were determined. With these proper periods, the processing time was able to be greatly improved, while retaining the clustering quality. Compared to the actual class of data, the comparable number of clusters was obtained in all stream progressions. LITERATURE CITED Aggarwal, C.C On high dimensional projected clustering of uncertain data streams, pp In Proceedings of the 9th International Conference on Data Engineering. Shanghai, China. Aggarwal, C.C., J. Han, J. Wang and P.S. Yu A framework for clustering evolving data streams, pp In Proceedings of the 29th International Conference on Very Large Data Bases. Berlin, Germany. Aggarwal, C.C., J. Han, J. Wang and P.S. Yu, A framework for projected clustering of high dimensional data streams, pp In Proceedings of the 13th International Conference on Very Large Data Bases. Toronto, Canada. Aggarwal, C.C. and P.S. Yu A framework for clustering uncertain data streams, pp In Data Engineering, Proceedings of the 8th International Conference on Data Engineering. Cancun, Mexico. Chen, Z., G. Ming and Z. Aoying, Tracking high quality clusters over uncertain data streams, pp In Proceedings of the 9th International Conference on Data Engineering. Shanghai, China.

15 652 Kasetsart J. (Nat. Sci.) 46(4) Huang, G.Y., D.P. Liang, C.Z. Hu and J.D. Ren An algorithm for clustering heterogeneous data streams with uncertainty, pp In Proceedings of International Conference on Machine Learning and Computing. Qingdao, China. Kosonpothisakun, P., T. Kangkachit and K. Waiyamai, E-Stream++: Stream clustering technique for supporting numerical and categorical data, pp In Proceedings of the 13th National Computer Science and Engineering Conference. Bangkok, Thailand. Meesuksabai, W., T. Kangkachit and K. Waiyamai, HUE-Stream: Evolution-based clustering technique for heterogeneous data streams with uncertainty, pp In Proceedings of the 7th International Conference on Advanced Data Mining and Applications. Beijing, China. Qin, B., Y. Xia, S. Prabhakar and Y. Tu, A rule-based classification algorithm for uncertain data, pp In Proceedings of the 9th International Conference on Data Engineering. Shanghai, China. Udommanetanakit, K., T. Rakthanmanon and K. Waiyamai, E-Stream: Evolution-based technique for stream clustering, pp In Proceedings of the 3rd International Conference on Advanced Data Mining and Applications. Harbin, China. Wan, R. and L. Wang, Clustering over evolving data stream with mixed attributes. Journal of Computational Information Systems Yang, C. and J. Zhou HClustream: A novel approach for clustering evolving heterogeneous data stream, pp In Proceedings of the 6th IEEE International Conference on Data Mining Workshops. Hong Kong, China.

E-Stream: Evolution-Based Technique for Stream Clustering

E-Stream: Evolution-Based Technique for Stream Clustering Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai Department of Computer Engineering, Faculty of Engineering Kasetsart University,