Evolution-Based Clustering Technique for Data Streams with Uncertainty

Size: px
Start display at page:

Download "Evolution-Based Clustering Technique for Data Streams with Uncertainty"

Transcription

1 Kasetsart J. (Nat. Sci.) 46 : (2012) Evolution-Based Clustering Technique for Data Streams with Uncertainty Wicha Meesuksabai*, Thanapat Kangkachit and Kitsana Waiyamai ABSTRACT The evolution-based stream clustering method supports the monitoring and change detection of clustering structures. This paper presented HUE-Stream which extends E-Stream and E-Stream++ by introducing a distance function, cluster representation and histogram management for the different types of clustering structure evolution. Compared with UMicro and LuMicro, HUE-Stream produces higher clustering quality and is more robust over highly uncertain data streams; however, it requires longer processing time due to the fact that HUE-Stream detects change in the clustering structure evolution too frequently (in every round). To improve the processing time, proper periods of clustering structure evolution change detection were determined. With these proper periods, the processing time was greatly improved, while retaining the clustering quality. Compared to actual class of data in the KDDCup 1999 network intrusion detection dataset, a comparable number of clusters was obtained in all stream progressions. Keywords: data streams with uncertainty, heterogeneous data, heterogeneous attributes, clustering structure evolution detection, evolution-based clustering INTRODUCTION Recently, clustering data streams have become a research topic of growing interest. One main characteristic of data streams is to have an infinite evolving structure that can be generated at a rapid rate. A stream clustering method that supports the monitoring and the change detection of clustering structures is called an evolutionbased stream clustering method. Apart from an infinite data volume, data streams also contain errors or only partially complete information, called data uncertainty. This paper focused on developing an evolution-based stream clustering method that supports uncertainty in data. Many techniques have been proposed for clustering data streams. Most research has focused on clustering techniques for numerical data (Aggarwal et al., 2003, 2004; Udommanetanakit et al., 2007; Aggarwal and Yu, 2008; Aggarwal, 2009). Few have been proposed to deal simultaneously with heterogeneous data streams, including numeral and categorical attributes (Yang and Zhou, 2006; Kosonpothisakun et al., 2009; Huang et al., 2010). However, very few have proposed to monitor and detect change in the clustering structures. E-Stream (Udommanetanakit et al., 2007) proposed an evolution-based clustering method for numerical data streams. E-Stream++ (Kosonpothisakun et al., 2009) extended it by Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900, Thailand. * Corresponding author, g @ku.ac.th Received date : 24/01/12 Accepted date : 02/04/12

2 Kasetsart J. (Nat. Sci.) 46(4) 639 integrating both numerical and categorical data streams. However, both methods do not support uncertainty in data streams. To support uncertainty in data streams, Aggarwal and Yu (2008) introduced the uncertain clustering feature for cluster representation and proposed a technique named UMicro. Later, they continued to study the problem of high dimensional projected clustering of uncertain data streams. However, both techniques still had low clustering quality (Aggarwal, 2009). The LuMicro (Chen et al., 2009) technique has been proposed to improve clustering quality, with the support of uncertainty in numerical attributes and not in categorical attributes. However, all these works have not been proposed in the context of evolution-based stream clustering. Meesuksabai et al. (2011) presented an evolution-based stream clustering technique called HUE-Stream that supports uncertainty in both numerical and categorical attributes. HUE-Stream was extended from the E-Stream (Udommanetanakit et al., 2007) technique and the E-Stream++ (Kosonpothisakun et al., 2009) technique which is an evolution-based stream clustering technique that integrates both numerical data streams without support of uncertainty. A distance function, cluster representation and histogram management were introduced for the different types of clustering structure evolution. A distance function with a probability distribution of two objects was introduced to support uncertainty in categorical attributes. To detect change in clustering structure, the proposed distance function was used to merge clusters and find the closest cluster of given incoming data and proposed histogram management splits the clusters into categorical data. This paper does further analysis on how HUE-Stream enhances the quality of clustering when compared to the state-of-the-art stream clustering methods that support data uncertainty. An efficient and effective clustering method must be able to group data with the same class into the same cluster. The number of resulting clusters should vary significantly with the number of data classes but not be bound by the threshold specifying the number of clusters. Further, a clustering method should support the different types of clustering structure evolution of: appearance, disappearance, self-evolution, merge and split. Those clustering structure evolution types have a direct impact on the efficiency and effectiveness of the stream clustering method. However, performing clustering structure evolution change detection in every round is time consuming. To avoid this, in every round detection, proper periods of clustering structure change detection need to be determined. HUE-Stream was compared with UMicro and LuMicro to assess its clustering quality. The comparison was performed in terms of the effectiveness (accuracy with progression of stream, accuracy with an increasing level of uncertainty), and efficiency (processing time and number of data points proceeded per second). HUE-Stream produces higher clustering quality and is more robust over highly uncertain data streams; however, it requires higher processing time. To improve the processing time, proper periods of clustering structure evolution change detection were determined. With these proper periods, processing time was greatly improved, while retaining clustering quality. Compared to the actual class of data in the KDDCup 1999 network intrusion detection dataset, a comparable number of clusters was obtained in all stream progressions. MATERIALS AND METHODS Uncertain dataset The KDD-CUP 99 benchmark dataset (real-world dataset) in the UCI KDD archive was used. It contains 494,020 records and is composed of 34 numerical attributes and 7 categorical attributes. To evaluate the effectiveness of clustering algorithms over an uncertain data

3 640 Kasetsart J. (Nat. Sci.) 46(4) stream, attribute uncertainty was performed on the KDD-CUP 99 dataset by converting each one into probability vectors. For numerical attributes, the converting technique described by Chen et al. (2009) was used for simulation of the discrete probability scenarios. Each numerical attribute in a given tuple can have several possible instances with different probability. The number of instances was generated by using a random variable with uniform distribution on [1, τ], where τ is the maximum number of instances specified by the user. The probability for each instance is sampled from a corresponding Gaussian distribution satisfying N(0, δ). In addition, in order to ensure the sum of probabilities is 1, normalization was employed as the final step. Categorical attributes were made uncertain by converting them into probability vectors using the approach described by Qin et al. (2009). For example, when introducing 10% uncertainty, this attribute will take the original value with 90% probability, and 10% probability to take any of the other values. Suppose in the original accurate dataset A j = v 1, we will assign = 90%, and assign (2 j k) to ensure = 10%. To compare the effectiveness of the HUE-Stream algorithm with the UMicro and LuMicro algorithms, all of these algorithms were implemented in C++ running on a personal computer with a 3.2 GHz Intel Core i5 central processing unit, with the Debian GNU/Linux 5.0.8, operating system and memory size of 8GB. Basic concepts of evolution-based stream clustering The data stream consisted of a set of (n + C) dimensional tuples X 1 X k arriving at time stamps T 1 T k. Each data point X i contains n numerical attributes and c categorical attributes, denoted by. The number of valid values for a categorical attribute x n+k is V k, where 1 k c, and j-th valid value for x n+k is V jk, where 1 j V k. An isolated data point is a data point that is not a member of any clusters. Isolated data remain in the system for cluster appearance computations. An inactive cluster is a cluster that has a low weight. It can become an active cluster if its weight is increased until it reaches the active cluster s threshold. An active cluster is a cluster that can assemble incoming data if there is a sufficient similarity score. A cluster is a collection of data that has been memorized for processing in the system. It can be an isolated piece of data, an inactive cluster or an active cluster. A fading function decreases the weight of data over time. In a data stream that has evolving data, older data should have lesser weight. Let λ be the decay rate and t be elapsed time, f(t) = 2 -λt. The weight of a cluster is the number of data elements in a cluster. The weight is determined according to the fading function. Initially, each data element has a weight of 1. A cluster can increase its weight by assembling incoming data points or merging with other clusters. Tuple-level and dimension-level uncertainty In fact, there are many existing categories of assumption to model the uncertainty in a data stream (Chen et al., 2009). This current paper assumed the focus was mainly on the discrete probability density function which has been widely used and is easy to apply in practice. For each uncertain tuple X i, its possible values in d-th dimension can be defined by probability distribution vector as Equation 1: (1) Dimension-level uncertainty of the j-th dimension of a tuple X i denoted by can be defined by Equation 2:

4 Kasetsart J. (Nat. Sci.) 46(4) 641 (2) Let X i be a tuple, a tuple-level uncertainty of X i denoted by U(X i ) is an average of its dimension-level uncertainties defined as U(X i ) = is an α-bin histogram of numerical data values with α equal width intervals, i.e., the j-th numerical dimension with l-th bin histogram of HN(t) at time t is where Equation 3: Therefore, uncertainty of all k-tuple data streams can be calculated as the average of their tuple-level uncertainties. Cluster representation using fading cluster structure with histogram A fading cluster structure with histogram (FCH) has been introduced in E-Stream (Udommanetanakit, 2007). In the current paper, FCH is extended to support uncertainty in both numerical and categorical data. An uncertain cluster feature for a cluster C contained a set of tuples X i X k arriving at time stamps T i T k is defined as FCH = a s d e s c r i b e d below. is a vector of weighted sum of tuple expectation for each dimension at time t, i.e., the value of the j th entry is by., denoted is a vector of weighted sum of squares of tuple expectation for each dimension at time t, i.e., the value of the j th entry denoted by is is the sum of all weights of data points in a cluster C at time t, i.e., is the sum of all weight of tuple uncertainty in cluster C at time t, i.e., (3) is a β-bin histogram of categorical data values with top β accumulated probabilities of valid categorical values and their weighted frequency for each categorical dimension at time t. The histogram of j-th categorical dimension of C is given by Equation 4: (4) where is an a-th valid categorical value in j-th categorical dimension, is accumulated probability of which can be formulated as Equation 5: where and frequency of = (5) is an accumulated weight which can be formulated as Note that and are used in calculating the distance between two sets of categorical data as described in the next section. Distance functions A distance function plays an important role in data clustering tasks. To deal with

5 642 Kasetsart J. (Nat. Sci.) 46(4) uncertainty in both categorical and numerical data, the current proposed new distance functions that take into account the uncertainty described as follows. Cluster-Point distance can be formulated as Equation 6: Where (Equations 13 and 14): (12) (6) where is the distance between the expected center of numerical attributes in cluster C, denoted by, and the expected value of the j-th numerical attributes of a data point denoted by as Equation 7: (7) where, and and can be derived from the mean and standard deviation of expected values of all data points in cluster C; is the distance derived from categorical attributes (Equation 8):where (Equations 9 and 10): (8) (9) (10) Cluster-cluster distance can be formulated as: is the distance between the expected center of numerical attributes in cluster and which can be defined as Equation 11: (11) and is the distance derived from categorical attributes, which can be defined as Equation 12: (13) (14) Evolution of stream clustering structure To capture the characteristics of data streams which have an evolving nature, the evolution-based stream clustering method is composed of different steps. In the beginning, incoming data are considered as isolated clusters. Then, a cluster is formed when a sufficiently dense region appears. An inactive cluster is changed to an active cluster when its weight reaches a given threshold. When a set of clusters has been identified, incoming data must be assigned to a closest cluster based on a similarity score. To detect change in the clustering structure, the following clustering evolutions are checked and handled: appearance, disappearance, self-evolution, merge and split. Appearance, self-evolution and merge evolution operations are determined using the appropriate distance function. In the following, the different distance functions are described. Appearance: A new cluster can appear if there is a sufficiently dense group of data points in one area. Initially, such elements appear as a group of outliers, but (as more data appears in a neighborhood) they are recognized as a cluster. Groups of dense data points may locate close to each other by considering the point-point distance and cluster-point distance as in Equation 11, respectively. Disappearance: Existing clusters can disappear because the existence of data is

6 Kasetsart J. (Nat. Sci.) 46(4) 643 diminished over time. Clusters that contain only old data fade and eventually disappear because they do not represent the presence of data. The disappearance of existing clusters happens when their data points have the least recent time stamp and their weights are less than the remove threshold. Self-evolution: In each active cluster, its characteristics can be evolved over time because the weight of old data points fades so that the new data points can affect the characteristic of the cluster in a short time. By doing this, fading Equation 1 and the distance between cluster and data point are used to find the closet cluster of a new data point which will change the cluster characteristic. Merge: A pair of clusters can be merged if their characteristics are very similar (two overlap clusters). Merged clusters must cover the behavior of the pair by considering the distance between two clusters. Split: A cluster can be split into two smaller clusters if the behavior inside the cluster is obviously separated. Cluster split is based on the distribution of feature values as summarized by the cluster histogram. So, a histogram of cluster data values is utilized to identify cluster splits. HUE-Stream algorithm The HUE-Stream algorithm extended the E-Stream (Udommanetanakit et al., 2007) and E-Stream++ (Kosonpothisakun et al., 2009) algorithms to support uncertainty in a heterogeneous data stream. Indeed, the HUE- Stream supports the monitoring and the change detection of clustering structures evolving over time. Five types of clustering structure evolution are supported: appearance, disappearance, selfevolution, merge and split. The main algorithm of HUE-Stream is given in Figure 1. In line 1, the algorithm starts by retrieving a new data point and labeling it with the current timestamp. In line 2, it fades all clusters weights, since it is not necessary to check the change of clustering structure for every incoming data point. Therefore, lines 3 to 9 check for the change of clustering structure every T time periods. The process of cluster evolution detection start with the deletion of any clusters having insufficient weight in line 4. Then, in line 5, it splits a cluster when the behavior inside the cluster is obviously separated. In line 6, it merges a pair of clusters for which the characteristics are very similar. In line 7, it checks the number of clusters and merges the closest pairs if the number of clusters exceeds the limit. In line Figure 1 HUE-Stream algorithm. Algorithm HUE-Stream 1 retrieve new data X i and label it with timestamp t x,i 2 FadingAll 3 if (t x,i mod T) = 0 then 4 DeleteCluster 5 CheckSplit 6 MergeOverlapCluster 7 LimitMaximumCluster 8 FlagActiveCluster 9 end if 10 (Uncertainty[], index[]) FindCandidateClosestCluster 11 if sizeof(index[]) > 0 12 index FindClosestCluster 13 add x i to FCH index 14 else 15 create new FCH from X i 16 end if 17 waiting for new data

7 644 Kasetsart J. (Nat. Sci.) 46(4) 8, this process ends with scanning clusters in the system to find active clusters that have reached sufficient weight. Lines 10 to 16 find the closest cluster to contain the incoming data point with respect to the distance and uncertainty of that cluster. An isolated data point will be created where there is no cluster to contain the new data point. The flow of control then returns to the top of the algorithm and waits for a new data point. In Figure 2, the details of each step are described. FadingAll: this procedure performs fading of all the existing clusters in the system. New data points are the focus rather than old data points. Then, a fading function is used to decrease the weight of old data over time. When a cluster in the system has insufficient weight (FCH i.w < fade_threshold), it will be deleted from the system. LimitMaximumCluster: this procedure is used to limit the number of clusters. It checks whether the number of clusters is greater than its maximum_ cluster. If the number of clusters exceeds this value, then the closest pair of clusters will be merged until the number of remaining clusters is less than or equal to the threshold. FlagActiveCluster: this procedure is used to check the status of the current active cluster. When the weight of any cluster is greater than or equal to active_threshold then it will be flagged as an active cluster. Otherwise, the flag is cleared. FindCandidateClosestCluster a n d FindClosestCluster: Both procedures are shown in Figure 2. They are used to compute the distance in order to find the index of the candidate closest to the active cluster and to choose the candidate most appropriate by computing maximum uncertainty value changes (Equation 15):. (15) MergeOverlapCluster: The merge procedure of HUE-Stream is given in Figure 2. This procedure scans all the clusters in the system for merging pairs of similar clusters. If the cluster-to-cluster distance is less than the merge_threshold, then a couple of those clusters will be merged. The merged cluster should express the characteristics of the two similar clusters. As a consequence, the FCH of merged cluster can be formulated as follows: Let C 1 and C 2 denote two sets of cluster. FCH(C 1 C 2 ) can be calculated based on FCH(C 1 ) and FCH(C 2 ) The values of entires W(C,t), U(C,t) in FCH are the sum of the corresponding entires in FCH(C 1 ) and FCH(C 2 ). To obtain the merged histogram of numerical data values,, first the minimum and maximum value in each numerical dimension of the pair must be found. Then this range is divided into α intervals with equal length. Finally, the frequency of each merged interval is computed from the histogram of the pair. For calculating the merged histogram of categorical data values,, two sets of top α-bin categorical data values in each dimension of the pair are unioned. Then, the union set is ordered by its frequency in descending order. Finally, only the top β-bin categorical data values are stored in. CheckSplit: The CheckSplit procedure of HUE- Stream is given in Figure 2. This procedure is used to verify the splitting condition of each cluster using the histogram, for all attributes are verified to find the split-position. If the split-position occurs in a numerical or categorical attribute, the weight will be recalculated based on the histogram of the splitting attribute. Then, cluster representation of the new clusters is determined based on the calculated weight. For numerical attributes, in Figure 3 a valley which lies between two peaks of the

8 Kasetsart J. (Nat. Sci.) 46(4) 645 histogram is considered as the splitting criteria. If the splitting valley is found at more than one point, the best splitting valley is the minimum value valley. When the cluster splits, the histogram is split in that dimension and other dimensions are weighted based on the split dimension. The splitting valley must be statistically significantly lower than the lower peak. For categorical attributes, the splitting-attribute is an attribute that has Procedure MergeOverlapCluster 1 for i 1 to FCH 2 for j i + 1 to FCH 3 overlap[i,j] dist(fchi,fchj) 4 m merge_threshold 5 if overlap[i,j] > m*(fchi.sd+fchj.sd) 6 if dist cat [i,j] < m *mindist cat (FCH a, FCH b ) 7 if (i, j) not in S 8 merge(fch i, FCH j ) Procedure CheckSplit 1 for i 1 to FCH 2 for j 1 to number of numerical attributes 3 find valley and peek 4 if chi-square test(valley,peek) > significance 5 split using numerical attribute 6 for j 1 to number of categorical attributes 7 find maximum different of bin k, bin k+1 8 if chi-square test(bin k,bin k+1 ) > significance 9 split using categorical attribute 10 if split using only numerical or categorical 11 split FCH i 12 S S {(I, FCH )} 13 else if numerical and categorical 14 (n1,n2) split using numerical 15 (c1,c2) split using categorical 16 if max(c1,c2) > max(n1,n2) 17 split FCH i using categorical 18 if max(c1,c2) <= max(n1,n2) 19 split FCH i using numerical 20 S S {(i, FCH )} Procedure FindCandidateClosestCluster 1 for i 1 to FCH 2 if FCH i is active cluster 3 dist[i] dist(fch i, x i ) 4 if dist[i] < radius_factor/4 Procedure FindClosestCluster 1 for i 1 to set_candidate 2 index_u[i, FCH i.u] FCH i.u 3 index[i, FCH i.u] max_of_ FCH i.u 4 return i 5 set_candidate i 6 return set_candidate Figure 2 MergeOverlapCluster, CheckSplit, FindCandidateClosestCluster and FindClosestCluster.

9 646 Kasetsart J. (Nat. Sci.) 46(4) significantly accumulated probability more than others within the same cluster. The split-position is a position between a pair of adjacent values whose accumulated probabilities have the greatest difference, as shown in Figure 4. If the splitposition occurs between the first and second bars, that cluster will not be split into two small clusters because the first value is the only outstanding member in the top-β of the splitting attribute. RESULTS AND DISCUSSION Experimental setup The performance of HUE-Stream was evaluated by comparison with UMicro and LuMicro in terms of effectiveness (accuracy with progression of stream, accuracy with increasing level of uncertainty), sensitivity (accuracy with varying number of clusters) and efficiency (processing time and number of data points proceeded per second). Parameter settings of the three algorithms are shown in Table 1. Effectiveness test with respect to number of clusters The effectiveness of HUE-Stream, UMicro and LuMicro were evaluated in terms of Dimension Split histogram 1 st split histogram 2 nd split histogram Split dimension Other dimension Figure 3 Histogram management in a split dimension and other dimension of numerical. Dimension Split histogram 1 st split histogram 2 nd split histogram Split dimension Other dimension Figure 4 Histogram management in a split dimension and other dimension of categorical data. Table 1 Parameter settings of HUE-Stream, UMicro and LuMicro algorithms. HUE-Stream UMicro LuMicro stream_speed 100 stream_speed 100 stream_speed 100 horizon 2 horizon 2 horizon 2 decay_rate 0.1 decay_rate 0.1 decay_rate 0.1 radius_factor 4 radius_factor 3 radius_factor 3 remove_threshold 0.1 candidate_cluster 10 merge_threshold 1.25 active_threshold 5

10 Kasetsart J. (Nat. Sci.) 46(4) 647 purity and f-measure. More specifically, purity and f-measure were studied with regard to their sensitivity to the maximum number of clusters, a threshold that is set by the three algorithms. Figures 5 and 6 show the maximum number of clusters increased by doubling as follows: 13, 26, 52 and 104. Figure 5 shows the purity and f-measure of the three methods with increasing progression of the streams. For these experiments, the time horizon was set to 2, the uncertainty level was set to 5% and T was set to1. In terms of purity, comparable results were obtained as it can be seen that the purity is almost not sensitive to the number of clusters for the three methods; however, this was not so for the f-measure. In terms of the f-measure, the experimental results showed that HUE-Stream outperforms UMicro and LuMicro in almost all of the different sets of maximum number of clusters thresholds of the three algorithms. Average purity Average f-measure Average purity Average purity Average f-measure Average f-measure Figure 5 Purity and f-measure with respect to different maximum numbers of clusters for (A) Purity of Hue-Stream; (B) f-measure of Hue-Stream; (C) Purity of UMicro; (D) f-measure of LuMicro; (E) Purity of UMicro; and (F) f-measure of LuMicro.

11 648 Kasetsart J. (Nat. Sci.) 46(4) Figure 6 shows that the f-measure of UMicro and LuMicro had a tendency to decrease when increasing the maximum number of clusters. However, HUE-Stream was not sensitive to the increasing number of clusters because its number of clusters is derived from the behavior of the data streams. As long as the maximum number of clusters is not exceeded, HUE-Stream still yields good results. It is important to remember the fact that the purity score considers only the correctness of data points in each cluster and not how many of them are grouped all together within the same cluster (as does the f-measure). Figure 6 shows the number of actual classes compared to the number of clusters obtained by each algorithm in each stream progression. It can be clearly seen that HUE- Stream is independent of the maximum number of clusters threshold. HUE-Stream performed the data point grouping operation based on their real behavior. This is not the case with UMicro which is dependent on the maximum number of clusters threshold and in particular in the data stream range between 200,000 and 300,000, UMicro generates too many small clusters while HUE-Stream and LuMicro generate only one cluster and merge the overlap clusters. In the rest of the data streams, LuMicro produces significantly more clusters than HUE-Stream. This explains why HUE-Stream outperforms LuMicro in terms of the f-measure, since there is only one class in the actual data. Effectiveness test with respect to uncertainty level Facing uncertain data streams with different probability distributions, the clustering algorithm had to deal with the uncertainty of the record value. In order to evaluate the effectiveness and robustness of the three algorithms in integrating uncertain data streams, attribute uncertainty was performed by converting each of data streams Number of clusters in system Number of clusters in system Number of clusters in system Figure 6 Number of clusters at different maximum numbers of clusters for (A) UMicro; (B) LuMicro; and (C) Hue-Stream.

12 Kasetsart J. (Nat. Sci.) 46(4) 649 into probability vectors. Figure 7 shows purity and f-measure with increasing uncertainty levels. In these experiments, the maximum number of clusters of HUE-Stream, UMicro and LuMicro was set to 26, 100 and 100, respectively. In addition, T=1 was set for HUE-Stream to allow the detection of change in the clustering structure at every incoming data point. It was clear that for each level, purity and f-measure reduced with the increasing level of uncertainty. However, HUE- Stream still produced a better f-measure than LuMicro and UMicro. Processing time enhancement Figure 8 shows the processing time for the three algorithms which all exhibit a linear relationship between the runtime and the number of data points. In this experiment, the parameter setting were as follows. The thresholds for the maximum number of clusters of HUE-Stream, LuMicro and Umicro were set to 26, 100 and 100, respectively. The uncertainty level was set to 5% and T=1. Although HUE-Stream produced the best clustering quality and was more robust with highly uncertain data streams, it required longer processing time compared to UMicro and LuMicro. This was due to the fact that HUE- Stream detects change in the clustering structure evolution too frequently (in every round). Thus, proper periods of clustering structure change detection need to be determined. In order to improve the processing time, the aim was to detect the clustering structure evolution at proper periods (and not in each round). Wan and Wang (2010) proposed the proper 1 ε periods of checking to be T = [ log( )] where λ ε -1 Average purity Average f-measure Number of uncertainty levels Number of uncertainty levels Figure 7 Purity (A) and f-measure (B) for the three algorithms studied with increasing uncertainty level. Runtime (s) Data stream (points x10,000) Figure 8 Processing time with progression of streams.

13 650 Kasetsart J. (Nat. Sci.) 46(4) λ corresponds to a decay factor in the fading function and ε is a remove-threshold. Therefore, the experiments were carried out by varying the periods in multiples of data points in each period (time horizon stream speed = 200). In Figures 9 and 10, the processing time and processing rate obtained by varying T to 200, 400, 600, 800, 1,000 are shown to be significantly better than that of T=1. It can also be noticed that while varying T from 200 to 1,000, the processing time and processing rate were not much different. Figure 10 shows the purity and f-measure obtained by different checking periods. In terms of purity, all periods generated comparable results with T=1 (almost equal to 1.0). In terms of f-measure, at a stream progression between 50,000 and 200,000, all the periods generated almost the same result (same number of clusters) as shown in Figure 11. In contrast, at a stream progression between 200,000 and 300,000, only the period T=1 could capture the change in clustering structure and generate only one cluster. Meanwhile, the other periods still maintained the same number of clusters generated from the previous stream progression. In the rest of the stream progression, all the periods generated the same f-measure except at period T = 200 which also retained the same number of clusters. In summary, the proper periods for clustering structure evolution change detection were T=400, 600 and 800. CONCLUSION The uncertainty in the data stream significantly affects the clustering structure. A HUE-Stream algorithm was proposed for clustering Runtime (s) Data stream (points x10,000) Data stream (points x10,000) Figure 9 Processing time (A) and processing rate (B) of HUE-Stream with respect to different periods of clustering evolution detection. Average purity Average f-measure Number of data points/s Figure 10 Purity (A) and f-measure (B) with respect to different periods of clustering evolution detection.

14 Kasetsart J. (Nat. Sci.) 46(4) 651 Number of clusters in system Figure 11 Number of clusters with progression of stream and different periods of checking evolutions. heterogeneous data streams with uncertainty. A distance function, cluster representation and histogram management were introduced for supporting different types of clustering structure evolution namely, appearance, disappearance, self-evolution, merge and split. HUE-Stream was compared with a real-world dataset against UMicro and LuMicro in terms effectiveness, sensitivity and efficiency. The HUE-Stream algorithm outperformed UMicro and LuMicro in terms of f-measure, and had a comparable purity score. While UMicro and LuMicro were sensitive to the input parameter number of clusters, HUE-Stream was robust and able to determine almost the exact number of clusters that suited based on the behavior of the data stream and the number of clusters in the system. HUE-Stream produced higher clustering quality and was more robust over highly uncertain data streams; however, it required longer processing time. In order to improve processing time, proper periods of clustering structure evolution change detection were determined. With these proper periods, the processing time was able to be greatly improved, while retaining the clustering quality. Compared to the actual class of data, the comparable number of clusters was obtained in all stream progressions. LITERATURE CITED Aggarwal, C.C On high dimensional projected clustering of uncertain data streams, pp In Proceedings of the 9th International Conference on Data Engineering. Shanghai, China. Aggarwal, C.C., J. Han, J. Wang and P.S. Yu A framework for clustering evolving data streams, pp In Proceedings of the 29th International Conference on Very Large Data Bases. Berlin, Germany. Aggarwal, C.C., J. Han, J. Wang and P.S. Yu, A framework for projected clustering of high dimensional data streams, pp In Proceedings of the 13th International Conference on Very Large Data Bases. Toronto, Canada. Aggarwal, C.C. and P.S. Yu A framework for clustering uncertain data streams, pp In Data Engineering, Proceedings of the 8th International Conference on Data Engineering. Cancun, Mexico. Chen, Z., G. Ming and Z. Aoying, Tracking high quality clusters over uncertain data streams, pp In Proceedings of the 9th International Conference on Data Engineering. Shanghai, China.

15 652 Kasetsart J. (Nat. Sci.) 46(4) Huang, G.Y., D.P. Liang, C.Z. Hu and J.D. Ren An algorithm for clustering heterogeneous data streams with uncertainty, pp In Proceedings of International Conference on Machine Learning and Computing. Qingdao, China. Kosonpothisakun, P., T. Kangkachit and K. Waiyamai, E-Stream++: Stream clustering technique for supporting numerical and categorical data, pp In Proceedings of the 13th National Computer Science and Engineering Conference. Bangkok, Thailand. Meesuksabai, W., T. Kangkachit and K. Waiyamai, HUE-Stream: Evolution-based clustering technique for heterogeneous data streams with uncertainty, pp In Proceedings of the 7th International Conference on Advanced Data Mining and Applications. Beijing, China. Qin, B., Y. Xia, S. Prabhakar and Y. Tu, A rule-based classification algorithm for uncertain data, pp In Proceedings of the 9th International Conference on Data Engineering. Shanghai, China. Udommanetanakit, K., T. Rakthanmanon and K. Waiyamai, E-Stream: Evolution-based technique for stream clustering, pp In Proceedings of the 3rd International Conference on Advanced Data Mining and Applications. Harbin, China. Wan, R. and L. Wang, Clustering over evolving data stream with mixed attributes. Journal of Computational Information Systems Yang, C. and J. Zhou HClustream: A novel approach for clustering evolving heterogeneous data stream, pp In Proceedings of the 6th IEEE International Conference on Data Mining Workshops. Hong Kong, China.

E-Stream: Evolution-Based Technique for Stream Clustering

E-Stream: Evolution-Based Technique for Stream Clustering E-Stream: Evolution-Based Technique for Stream Clustering Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai Department of Computer Engineering, Faculty of Engineering Kasetsart University,

More information

Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection

Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection Evolution-Based Clustering of High Dimensional Data Streams with Dimension Projection Rattanapong Chairukwattana Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: g521455024@ku.ac.th

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Towards New Heterogeneous Data Stream Clustering based on Density

Towards New Heterogeneous Data Stream Clustering based on Density , pp.30-35 http://dx.doi.org/10.14257/astl.2015.83.07 Towards New Heterogeneous Data Stream Clustering based on Density Chen Jin-yin, He Hui-hao Zhejiang University of Technology, Hangzhou,310000 chenjinyin@zjut.edu.cn

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

An Efficient RFID Data Cleaning Method Based on Wavelet Density Estimation

An Efficient RFID Data Cleaning Method Based on Wavelet Density Estimation An Efficient RFID Data Cleaning Method Based on Wavelet Density Estimation Yaozong LIU 1*, Hong ZHANG 1, Fawang HAN 2, Jun TAN 3 1 School of Computer Science and Engineering Nanjing University of Science

More information

Data Stream Clustering Using Micro Clusters

Data Stream Clustering Using Micro Clusters Data Stream Clustering Using Micro Clusters Ms. Jyoti.S.Pawar 1, Prof. N. M.Shahane. 2 1 PG student, Department of Computer Engineering K. K. W. I. E. E. R., Nashik Maharashtra, India 2 Assistant Professor

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Classification model with subspace data-dependent balls

Classification model with subspace data-dependent balls Classification model with subspace data-dependent balls attapon Klakhaeng, Thanapat Kangkachit, Thanawin Rakthanmanon and Kitsana Waiyamai Data Analysis and Knowledge Discovery Lab Department of Computer

More information

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2 International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1,

More information

Random Sampling over Data Streams for Sequential Pattern Mining

Random Sampling over Data Streams for Sequential Pattern Mining Random Sampling over Data Streams for Sequential Pattern Mining Chedy Raïssi LIRMM, EMA-LGI2P/Site EERIE 161 rue Ada 34392 Montpellier Cedex 5, France France raissi@lirmm.fr Pascal Poncelet EMA-LGI2P/Site

More information

A Novel Method of Optimizing Website Structure

A Novel Method of Optimizing Website Structure A Novel Method of Optimizing Website Structure Mingjun Li 1, Mingxin Zhang 2, Jinlong Zheng 2 1 School of Computer and Information Engineering, Harbin University of Commerce, Harbin, 150028, China 2 School

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Sequences Modeling and Analysis Based on Complex Network

Sequences Modeling and Analysis Based on Complex Network Sequences Modeling and Analysis Based on Complex Network Li Wan 1, Kai Shu 1, and Yu Guo 2 1 Chongqing University, China 2 Institute of Chemical Defence People Libration Army {wanli,shukai}@cqu.edu.cn

More information

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * Jia-Ling Koh and Shu-Ning Shin Department of Computer Science and Information Engineering National Taiwan Normal University

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

Query Independent Scholarly Article Ranking

Query Independent Scholarly Article Ranking Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data

More information

K-means based data stream clustering algorithm extended with no. of cluster estimation method

K-means based data stream clustering algorithm extended with no. of cluster estimation method K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

City, University of London Institutional Repository

City, University of London Institutional Repository City Research Online City, University of London Institutional Repository Citation: Andrienko, N., Andrienko, G., Fuchs, G., Rinzivillo, S. & Betz, H-D. (2015). Real Time Detection and Tracking of Spatial

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters

Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters Dure Supriya Suresh ; Prof. Wadne Vinod ME(Student), ICOER,Wagholi, Pune,Maharastra,India Assit.Professor, ICOER,Wagholi,

More information

A New Feature Local Binary Patterns (FLBP) Method

A New Feature Local Binary Patterns (FLBP) Method A New Feature Local Binary Patterns (FLBP) Method Jiayu Gu and Chengjun Liu The Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA Abstract - This paper presents

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Video Inter-frame Forgery Identification Based on Optical Flow Consistency Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong

More information

ProUD: Probabilistic Ranking in Uncertain Databases

ProUD: Probabilistic Ranking in Uncertain Databases Proc. 20th Int. Conf. on Scientific and Statistical Database Management (SSDBM'08), Hong Kong, China, 2008. ProUD: Probabilistic Ranking in Uncertain Databases Thomas Bernecker, Hans-Peter Kriegel, Matthias

More information

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin A Fast Algorithm for Data Mining Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin Our Work Interested in finding closed frequent itemsets in large databases Large

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Role of big data in classification and novel class detection in data streams

Role of big data in classification and novel class detection in data streams DOI 10.1186/s40537-016-0040-9 METHODOLOGY Open Access Role of big data in classification and novel class detection in data streams M. B. Chandak * *Correspondence: hodcs@rknec.edu; chandakmb@gmail.com

More information

A reversible data hiding based on adaptive prediction technique and histogram shifting

A reversible data hiding based on adaptive prediction technique and histogram shifting A reversible data hiding based on adaptive prediction technique and histogram shifting Rui Liu, Rongrong Ni, Yao Zhao Institute of Information Science Beijing Jiaotong University E-mail: rrni@bjtu.edu.cn

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da

More information

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION Panca Mudjirahardjo, Rahmadwati, Nanang Sulistiyanto and R. Arief Setyawan Department of Electrical Engineering, Faculty of

More information

An ICA-Based Multivariate Discretization Algorithm

An ICA-Based Multivariate Discretization Algorithm An ICA-Based Multivariate Discretization Algorithm Ye Kang 1,2, Shanshan Wang 1,2, Xiaoyan Liu 1, Hokyin Lai 1, Huaiqing Wang 1, and Baiqi Miao 2 1 Department of Information Systems, City University of

More information

Efficient Range Query Processing on Uncertain Data

Efficient Range Query Processing on Uncertain Data Efficient Range Query Processing on Uncertain Data Andrew Knight Rochester Institute of Technology Department of Computer Science Rochester, New York, USA andyknig@gmail.com Manjeet Rege Rochester Institute

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

High-Dimensional Incremental Divisive Clustering under Population Drift

High-Dimensional Incremental Divisive Clustering under Population Drift High-Dimensional Incremental Divisive Clustering under Population Drift Nicos Pavlidis Inference for Change-Point and Related Processes joint work with David Hofmeyr and Idris Eckley Clustering Clustering:

More information

Object Tracking Algorithm based on Combination of Edge and Color Information

Object Tracking Algorithm based on Combination of Edge and Color Information Object Tracking Algorithm based on Combination of Edge and Color Information 1 Hsiao-Chi Ho ( 賀孝淇 ), 2 Chiou-Shann Fuh ( 傅楸善 ), 3 Feng-Li Lian ( 連豊力 ) 1 Dept. of Electronic Engineering National Taiwan

More information

An Adaptive Threshold LBP Algorithm for Face Recognition

An Adaptive Threshold LBP Algorithm for Face Recognition An Adaptive Threshold LBP Algorithm for Face Recognition Xiaoping Jiang 1, Chuyu Guo 1,*, Hua Zhang 1, and Chenghua Li 1 1 College of Electronics and Information Engineering, Hubei Key Laboratory of Intelligent

More information

An Optimization Algorithm of Selecting Initial Clustering Center in K means

An Optimization Algorithm of Selecting Initial Clustering Center in K means 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) An Optimization Algorithm of Selecting Initial Clustering Center in K means Tianhan Gao1, a, Xue Kong2, b,* 1 School

More information

Histogram and watershed based segmentation of color images

Histogram and watershed based segmentation of color images Histogram and watershed based segmentation of color images O. Lezoray H. Cardot LUSAC EA 2607 IUT Saint-Lô, 120 rue de l'exode, 50000 Saint-Lô, FRANCE Abstract A novel method for color image segmentation

More information

High Capacity Reversible Watermarking Scheme for 2D Vector Maps

High Capacity Reversible Watermarking Scheme for 2D Vector Maps Scheme for 2D Vector Maps 1 Information Management Department, China National Petroleum Corporation, Beijing, 100007, China E-mail: jxw@petrochina.com.cn Mei Feng Research Institute of Petroleum Exploration

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei i Han 1 University of Illinois, IBM TJ Watson.

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei i Han 1 University of Illinois, IBM TJ Watson. Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei i Han 1 University of Illinois, IBM TJ Watson Debapriya Basu Determine outliers in information networks Compare various algorithms

More information

Outlier Detection Scoring Measurements Based on Frequent Pattern Technique

Outlier Detection Scoring Measurements Based on Frequent Pattern Technique Research Journal of Applied Sciences, Engineering and Technology 6(8): 1340-1347, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2013 Submitted: August 02, 2012 Accepted: September

More information

Further Applications of a Particle Visualization Framework

Further Applications of a Particle Visualization Framework Further Applications of a Particle Visualization Framework Ke Yin, Ian Davidson Department of Computer Science SUNY-Albany 1400 Washington Ave. Albany, NY, USA, 12222. Abstract. Our previous work introduced

More information

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

DOI:: /ijarcsse/V7I1/0111

DOI:: /ijarcsse/V7I1/0111 Volume 7, Issue 1, January 2017 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey on

More information

Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Bölüm 4. Frequent Patterns in Data Streams w3.gazi.edu.tr/~suatozdemir What Is Pattern Discovery? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together

More information

A Feature Point Matching Based Approach for Video Objects Segmentation

A Feature Point Matching Based Approach for Video Objects Segmentation A Feature Point Matching Based Approach for Video Objects Segmentation Yan Zhang, Zhong Zhou, Wei Wu State Key Laboratory of Virtual Reality Technology and Systems, Beijing, P.R. China School of Computer

More information

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology

More information

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering Digital Image Processing Prof. P.K. Biswas Department of Electronics & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Image Segmentation - III Lecture - 31 Hello, welcome

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

CUT: Community Update and Tracking in Dynamic Social Networks

CUT: Community Update and Tracking in Dynamic Social Networks CUT: Community Update and Tracking in Dynamic Social Networks Hao-Shang Ma National Cheng Kung University No.1, University Rd., East Dist., Tainan City, Taiwan ablove904@gmail.com ABSTRACT Social network

More information

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database Algorithm Based on Decomposition of the Transaction Database 1 School of Management Science and Engineering, Shandong Normal University,Jinan, 250014,China E-mail:459132653@qq.com Fei Wei 2 School of Management

More information

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data Wei Yang 1, Tinghua Ai 1, Wei Lu 1, Tong Zhang 2 1 School of Resource and Environment Sciences,

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques Proceedings of the 2nd International Conference on Intelligent Systems and Image Processing 2014 Proxy Systems Improvement Using Frequent Itemset Pattern-Based Techniques Saranyoo Butkote *, Jiratta Phuboon-op,

More information

Robot localization method based on visual features and their geometric relationship

Robot localization method based on visual features and their geometric relationship , pp.46-50 http://dx.doi.org/10.14257/astl.2015.85.11 Robot localization method based on visual features and their geometric relationship Sangyun Lee 1, Changkyung Eem 2, and Hyunki Hong 3 1 Department

More information

On Biased Reservoir Sampling in the presence of Stream Evolution

On Biased Reservoir Sampling in the presence of Stream Evolution On Biased Reservoir Sampling in the presence of Stream Evolution Charu C. Aggarwal IBM T. J. Watson Research Center 9 Skyline Drive Hawhorne, NY 532, USA charu@us.ibm.com ABSTRACT The method of reservoir

More information

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

UAV Motion-Blurred Image Restoration Using Improved Continuous Hopfield Network Image Restoration Algorithm

UAV Motion-Blurred Image Restoration Using Improved Continuous Hopfield Network Image Restoration Algorithm Journal of Information Hiding and Multimedia Signal Processing c 207 ISSN 2073-422 Ubiquitous International Volume 8, Number 4, July 207 UAV Motion-Blurred Image Restoration Using Improved Continuous Hopfield

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

Classifying Documents by Distributed P2P Clustering

Classifying Documents by Distributed P2P Clustering Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 henrich}@uni-bayreuth.de

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

An Empirical Comparison of Stream Clustering Algorithms

An Empirical Comparison of Stream Clustering Algorithms MÜNSTER An Empirical Comparison of Stream Clustering Algorithms Matthias Carnein Dennis Assenmacher Heike Trautmann CF 17 BigDAW Workshop Siena Italy May 15 18 217 Clustering MÜNSTER An Empirical Comparison

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

A Framework for Clustering Uncertain Data Streams

A Framework for Clustering Uncertain Data Streams A Framework for Clustering Uncertain Data Streams Charu C. Aggarwal, Philip S. Yu IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA { charu, psyu }@us.ibm.com Abstract In recent

More information

Image Segmentation Based on Watershed and Edge Detection Techniques

Image Segmentation Based on Watershed and Edge Detection Techniques 0 The International Arab Journal of Information Technology, Vol., No., April 00 Image Segmentation Based on Watershed and Edge Detection Techniques Nassir Salman Computer Science Department, Zarqa Private

More information

Superpixel Tracking. The detail of our motion model: The motion (or dynamical) model of our tracker is assumed to be Gaussian distributed:

Superpixel Tracking. The detail of our motion model: The motion (or dynamical) model of our tracker is assumed to be Gaussian distributed: Superpixel Tracking Shu Wang 1, Huchuan Lu 1, Fan Yang 1 abnd Ming-Hsuan Yang 2 1 School of Information and Communication Engineering, University of Technology, China 2 Electrical Engineering and Computer

More information

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams VLDB 03 Paper ID: 312 A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu IBM T. J. Watson Research Center & UIUC charu@us.ibm.com, hanj@cs.uiuc.edu,

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes Patricia Iglesias, Emmanuel Müller, Oretta Irmler, Klemens Böhm International Conference on Scientific and Statistical

More information

A DATA DRIVEN METHOD FOR FLAT ROOF BUILDING RECONSTRUCTION FROM LiDAR POINT CLOUDS

A DATA DRIVEN METHOD FOR FLAT ROOF BUILDING RECONSTRUCTION FROM LiDAR POINT CLOUDS A DATA DRIVEN METHOD FOR FLAT ROOF BUILDING RECONSTRUCTION FROM LiDAR POINT CLOUDS A. Mahphood, H. Arefi *, School of Surveying and Geospatial Engineering, College of Engineering, University of Tehran,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information