Entropy Based Adaptive Outlier Detection Technique for Data Streams

Size: px
Start display at page:

Download "Entropy Based Adaptive Outlier Detection Technique for Data Streams"

Transcription

1 Entropy Based Adaptive Detection Technique for Data Streams Yogita 1, Durga Toshniwal 1, and Bhavani Kumar Eshwar 2 1 Department of Computer Science and Engineering, IIT Roorkee, India 2 IBM India Software Labs, Bangalore, India Abstract detection in data streams is an immensely enthralling problem in many application areas such as network intrusion detection, faulty sensor detection, fraud detection in online financial transactions etc. Majority of existing outlier detection techniques have been mainly designed for static datasets and require a global view and multiple scans of data which is not feasible in case of streaming data. In this paper, we propose an entropy based outlier detection technique for streaming data exploiting the fact that presence of an anomalous data object highly increases the entropy of normal data clustering. It maintains clusters of streaming data and finds change in its entropy on incoming data object. If increment in entropy is very large then the data object is marked as candidate outlier and its anomalous behaviour confirmed over multiple sliding windows to minimize the false alarms. The proposed method is incremental and dynamically updates clustering structure and entropy statistics to deal with heavy volume and concept evolution of data streams. The proposed scheme has been evaluated on both synthetic and real world data. Experimental results prove its effectiveness on following performance measures: outlier detection rate, false alarm rate and running time. Keywords: Concept Evolution; Data Streams; Entropy; Detection 1. Introduction detection identifies such data objects that significantly deviate from rest of data [1], [2]. In many applications, outliers are more interesting than the normal patterns of data for example network intrusion detection, fraud detection, fault diagnosis, finding criminal activities in electronic commerce etc. Data streams are potentially infinite sequence of data objects [3] and are produced by many applications such real-time surveillance, environmental monitoring, medical systems, communication networks, online banking, internet traffic etc. detection in streaming data faces following challenges [4]: Entire data stream cannot be stored for multiple scans because of its terrific volume. detection model needs to be updated with incoming data to handle the dynamic nature of data streams. High speed of data streams imposes the limitation of memory space and processing time on outlier mining techniques. A lot of literature is available for outlier detection in static datasets which can be classified as distance based, density based and clustering based methods [5], [6], [7], [8], [9], [10]. But most of existing techniques do not fit in streaming data environment because of their assumption of availability of whole data in memory for multiple scans. Some work has also been done toward distance and density based outlier detection in data streams [11], [12], [13]. These works involve computation of nearest neighbors and depend upon the choice of different parameter. The entropy is a powerful mechanism for measurement of information content or uncertainty of a variable [14]. It is also referred as a measure of randomness of a system. Concept of entropy is very intuitive for outlier detection because presence of outliers increases the entropy (randomness) of dataset [15], [16], [17], [18] and this increment can be used to measure the outliersness of an object. Concept of entropy has also been used in a number of literature works for clustering data [19], [20] but in the present work our focus on outlier detection as oppose finding quality clusters. In this paper, we propose an outlier detection technique for streaming data which uses the concept of entropy to avoid pair wise distance computations and nearest neighbour parameter setting. It makes use of fact that presence of an outlier is likely to increases the entropy of clusters comprising normal data. To exploit this fact, it maintains clusters of the normal data objects and for each incoming new data object finds the change in the entropy of clusters [21]. If increment in entropy is very large then the data object is marked as candidate outlier and their anomalous behaviour confirmed over multiple sliding windows to minimize the false alarms [11], [22]. The proposed method is incremental and dynamically updates clustering structure and entropy statistics to deal with heavy volume and concept evolution in data streams. We have implemented and validated proposed technique on both synthetic and real world datasets [23]. The rest of this paper is organized as follows: Section 2 talks about related work. The proposed outlier detection technique has been presented in section 3. The section 4 focuses on the experimental results and conclusion has been given in section 5.

2 2. Related Work detection has its application in large number of domains that s why it has been a topic of importance for research community. A good survey of it can be find in [1], [2]. Majority of this literature focuses on static dataset. Distance-based outlier detection approaches are presented in [5], [24]. Their definitions of outlier are simple and intuitive but requires user to specify parameter k and d which could be difficult to determine. This idea is further extended in [7], [10], where the outlier factor of each data object is calculated as the sum of distances from its k th nearest neighbors. Density based outlier methods are proposed by Breunig et al. in [6] which captures local outlierness of an object and by Tao et al. in [8], it unifies density-based clustering and outlier detection in [8]. Tony et al. [25] proposed the first isolation method for outlier detection called as Isolation Forest (iforest) which detects outliers purely based on the concept of isolation without using any distance or density measure. Concepts of distance, density and clustering based outlier detection have also been extended to data streams. Two distance based approaches for finding outliers in data streams are given in [11], [12]. Yogita et al. have proposed an unsupervised approach for identifying outliers evolving data streams by weighting attributes in clustering [26]. Most of these methods have applied sliding window model for dealing with data streams. Concept of entropy has been used a lot for solving clustering and outlier detection problem in static [17], [18], [20] and clustering problem in streaming data [21], [19]. But its use for finding outliers in data streams is very intuitive and needs further exploration. Liu and Lu proposed an entropy-based method to approximate the number of outliers for a spatial data set in [15] by using a function of local contrast and local contrast probability of non-spatial and spatial attributes. An entropy-based approach to discover covert timing channels is proposed by Steven et al. in [27] based on the fact that the creation of a covert timing channel affects the entropy of original process. By using information entropy model to measure the uncertainty in rough sets framework Jiang et al. presented a new definition of outliers known as IE (Information Entropy)- based outliers [16]. 3. The Proposed Method This section presents the proposed adaptive outlier detection scheme. Initially preliminary concepts and notations are introduced and then deviation criterion and proposed method is discussed. In the end of this section time and space complexity of proposed method are analysed. 3.1 Preliminary Concepts and Notations 1) Data Stream: A data stream is an infinite sequence of data objects x 1, x 2,..,x n,.., arriving at time stamps T 1,T 2,..,T n,.., Each data object is a multidimensional point with m attributes. Data streams are of tremendous volume and flows at very high speed. In this work sliding window model has been used to process streaming data. It stores only a percentage of data in memory at a time. After processing the current window data, only sufficient summaries of data are maintained in memory and detailed data is discarded. 2) Cluster Summary Structure: A cluster Summary structure CS of a cluster C at time t is defined as follows: CS = (F T, δt) (1) where FT is the frequency table which stores the frequency of each attribute value pair of every attribute in the cluster C. A similar data structure is used in [17] for maintaining the attribute value frequency of complete dataset. It can be updated incrementally on assigning a new data object to the cluster by incrementing the frequency of corresponding attribute value pair of each attribute. δt stores the timestamp of the data object that is least recently added to the cluster. 3) Entropy: It is the measure of information and uncertainty or randomness of a variable [14]. Let x is a random variable and S(x) is the set of values that variable x can take and P(x) represents the probability function of random variable x, then entropy E(x) is defined as given by equation (2). E(x) = P (x)log 2 (P (x)) (2) xɛs(x) The entropy of a multivariate vector X = (x 1,...,x i,..., x m ) having m attributes and x i is a discrete random variable, can be calculated as defined by equation (3). E(X) = x1ɛs(x1) xiɛs(xi) xmɛs(xm) P (X)log 2 (P (X)) where P(x 1,...x i,..., x m ) is the multivariate probability distribution function. 3.2 Deviation Criterion for Detection Given a data stream and clusters of normal data, outlier detection aims to identify such data objects that deviate heavily from the clusters based on a deviation criterion. A deviation criterion is very important factor for outlier detection and measures outlierness of a data object. In this paper, we have proposed an entropy based deviation criterion PCE(X) as defined in equation (4). It gives the percentage change in the entropy of clustering Cl on assigning a data object X to a nearest cluster (nearest cluster means that cluster for which increase in entropy is minimal out of all clusters). As outliers highly increases the entropy of clustering so value of PCE will be large and positive for (3)

3 them. As oppose to outliers value of PCE for normal data objects, will be either negative or small and positive. ( ) E(Cl + X) E(Cl) P CE(X) = 100 (4) E(C) Where E(Cl) is the entropy of a clustering Cl and E(Cl + X) is the new entropy of clustering Cl on assigning data object X to nearest cluster C. E(Cl) is defined by the equation (5) and it represents the weighted sum of entropies of all the clusters [21]. E(Cl) = k Ck (E(Ck) (5) D To simplify the computation of entropy of a cluster in streaming environment, we have assumed the independence of the attributes of the data stream. This assumption transforms the equation (3) into equation (6) and entropy of a cluster Ck can be calculated using it where xi is a data object belongs to cluster Ck. E(Ck) = E(x1) + E(x2)... + E(xi)... + E(xr) (6) PCE is based upon the fact that when a data object is assigned to a cluster, there may be increase or decrease in the entropy of cluster and respectively in the total entropy of the clustering. Intuitively, if data object is inherently similar to cluster then entropy (randomness) of cluster will either decrease or it will increase slightly, while on assigning a dissimilar (outlying) one it will increase very high. So the percentage change in entropy of clustering (PCE) is a justified criterion for differentiating between outliers and normal data objects. 3.3 Proposed Method in Detail The pictorial representation of proposed technique is given in Fig. 1. It comprises mainly four modules that are detailed below. Data streams are of tremendous volume and flows at very high speed that s why cannot be stored in memory for processing. We have used sliding window model to process streaming data. It stores only a percentage of data in memory at a time for processing. After processing the current window data, only sufficient summaries of data are maintained in memory and detailed data is discarded to vacate space for next window data. 1) Initialization: In proposed outlier detection technique, normal behaviour of data is represented by clusters. So this module initializes the normal behaviour (clusters) by performing clustering on sampled normal data (It should not contain outlying objects).these clusters are outputted to candidate outlier detection module only once, when processing of streaming data starts. For this initial clustering any clustering algorithm can be freely chosen. 2) Candidate Detection: Candidate outlier is an object which satisfies deviation criterion. We have used an entropy based deviation criterion PCE defined in equation (4). Algorithm for candidate outlier detection is shown in Algorithm1. It assigns incoming data objects to clusters following the approach given in [21]. Algorithm 1 Candidate Detection Input: Clustering (Cl) - Comprises k, Current Window Data, Threshold Output: Candidate s 1: Repeat for all data objects of current window 2: X = Read next data object in window 3: Temporally assign X to cluster Ci such that on assigning X increase in entropy of Ci is minimum out of all other clusters Cj where j = 1...j...K. 4: Calculate the PCE(X) 5: if PCE(X) > Threshold then 6: X is candidate outlier, initialize counter of X to one and save both X and counter into candidate outlier repository to further verify its deviation in outlier detection module 7: else 8: X is normal data object output it to updating module to update cluster Ci 9: end if 10: End Repeat In candidate outlier detection algorithm, all objects are classified either as candidate outlier or normal data object. To keep a count of number of windows an object found as candidate outlier, a counter is associated with it and counter is set to one initially. Candidate outliers along with counter are stored in candidate repository. Later on outlier detection module takes candidate outliers from repository to verify their outlying nature (deviation) over multiple data windows. This multiple times verification is done because a candidate outlier may be part of an emerging cluster and showing deviation temporarily instead of being actually an outlier [11], [22]. To consider the local as well as global characteristics of data in outlier detection we have used both increments in individual cluster entropy in step 3 and PCE (percentage change in entropy of clustering) in step 5 of algorithm 1 respectively as criterion to decide upon outlying nature (deviation) of data object. 3) Updating: Concept evolution occurs when new classes come to existence and old may extinct from streaming data over time. It crop ups due to the change in the underlying process which is producing streams. In proposed method, smooth evolution of concepts has been addressed in following way: Proposed method is made adaptive by dynamically updating the data clustering and entropy statistics with incoming data streams. This is explained in this section itself. Outlying nature of an object is verified over multiple

4 Data Stream Sliding Window Initialization Candidate Detection Module Normal Data Object Updating Module Update Existing Incorporate Emerging Cluster Inliers Detection Module Discard Current Window Data After Processing Repository Candidate Prune Obsolete Candidate s Repository Figure 1: Proposed Adaptive Detection Technique Algorithm1 Candidate Detection Input Clustering (Cl) - Comprises k, Current Window Data, Threshold Output Candidate s data windows before declaring it as outlier because it Step1 Repeat for all data objects of current window may be part of an emerging cluster and showing deviation temporarily instead of being actually an outlier. Step2 X = Read next data object in window Temporally assign X to cluster C Step3 i such that on assigning X increase in entropy of This is the part of outlier detection module. C i is minimum out of all other clusters C j where j = 1...j...K. There are followingstep4 three typescalculate of possible the PCE(X) updates to tion (5) clustering structure andstep5 entropy statistics: If PCE(X) > Threshold 1) Update Existing : Whenever X is candidate normal outlier, data initialize object from candidate outlier counter of X to one and save both X and Step6 modulecounter and inliers into candidate from outlier repository to further verify its deviation in outlier detection module module inputted to update module then following are the steps of update procedure: Assign object X to cluster C i such that on assigning X increase in entropy of C i is minimum out of all other clusters C j where j = 1...j...K. Update cluster summary structure of cluster C i. Calculate the new entropy of clustering by using equation (5) Discard the data object X 2) Incorporate Emerging : Candidate outliers from candidate repository are clustered periodically. If size of any clusters is large enough and its entropy value is smaller than a threshold, it means that cluster is representing a new class in incoming data and hence must be incorporated in clustering as clustering based outlier detection approach assume that outliers are small in numbers and occur in sparse spaces. So following actions are taken K = k +1, where k is the number of clusters Initialize cluster summary structure of cluster C k +1. Calculate the new entropy of clustering by using equation (5) 3) Prune Obsolete : If no data object is assigned to a cluster C i from a long time, it signifies that cluster C i no more exists in data streams. A times factor δt is associated with each cluster that stores timestamp of data object that is least recently added to the cluster. If difference between current timestamp and δt is greater than a threshold then cluster C i is deleted. And following actions are taken Delete cluster summary structure of cluster C i. K = k -1, where k is the number of clusters Calculate the new entropy of clustering by using equa- Cluster summary structure for a cluster comprises FT and δt. FT is the frequency table which stores the frequency of each attribute value pair of every attribute in the cluster C. It can be updated incrementally on assigning a new data object to the cluster by incrementing the frequency of corresponding attribute value pair of each attribute. δt stores the timestamp of the data object that is least recently added to the cluster. 4) Detection: A candidate outlier turns to a real outlier if it fulfils the deviation criterion continuously over w sliding windows [11], [22]. We have used an entropy based deviation Criterion PCE that is defined in equation (4). Outlying nature of a candidate outlier is confirmed over multiple data windows before flagging it as real outlier because it may be part of an emerging cluster and showing deviation temporarily instead of being real outlier. To keep a count of number of windows over which an object found as candidate outlier, a counter is associated with each candidate and it is set to one initially when candidate outlier is detected first time in candidate outlier detection module. Algorithm for Detection Module is given in Algorithm 2. If a candidate outlier found to be an inlier in outlier detection module then it is output to updating module. Because an inlier represents the normal behaviour so it must be incorporated in clustering of data stream. 3.4 Time Complexity Let k is the maximum number of clusters that can occur at a time in data stream, d is the dimensions of data, n is the window size and c is the maximum number of candidate outlier may occur at a time and D is data stream size. So there

5 Algorithm 2 Detection Input: Clustering (Cl) - Comprises k, Candidate s, Threshold Output: Real s 1: Repeat for all candidate outliers 2: Read next candidate outlier from repository 3: Calculate the PCE(candidate outlier) 4: if PCE(candidate outlier) > Threshold & Counter(candidate outlier) == W then 5: Declare candidate outlier as real outlier, remove from candidate outlier repository and save to outlier repository 6: else 7: if PCE(candidate outlier) > Threshold & Counter(candidate outlier)< W then 8: Update Counter(candidate outlier) = Counter(candidate outlier) + 1 9: else 10: Candidate outlier is an inlier (normal data object), Remove it from candidate repository and output to updating module 11: end if 12: end if 13: End Repeat will be total D/n data windows for processing. For processing of each window there are three modules. Initialization module works only once and on a small sampled data so its running time can be considered constant I in analysis. Time taken by candidate outlier detection module will be k*n*(1 + d). Updating will take k*d*(n-c) +d*p*(k+1+c*kc*i) and in outlier detection c*k*(1+d) time will be elapsed. Here i is the iteration in clustering of candidate outliers, k c is the number of clusters in candidate outliers set and p is the number windows after which candidate outliers will be clustered. As we have I, d, p and c constant and i and kc will be very small because c is constant. So total running time of proposed scheme can be stated as follows: Running T ime = (D/n) k (n + 1) (7) As k is the number of clusters and it will not increase corresponding to data stream size so time complexity of proposed outlier detection technique will be of order O(D) which show that time complexity is of linear order with data stream size. 3.5 Space Complexity Memory space used in storing a variable dependents upon the operating system specification, let s for the present work consider windows vista operating system and let k is the number of clusters that can occur at a time in data stream, d is the dimensions of data, m is the maximum number of domain values for a attribute and c is the maximum number of candidate outlier may occur at a time. Proposed method requires to store only following information Summary structure of cluster Candidate outliers Current window data Space required to store summary structure of k clusters is 4*k*d*m bytes, current window data will take 4*d*n bytes and for candidate outliers 4*c*(d + 1) bytes will be needed. Hence Total memory space used =4*k*d*m bytes + 4*d*n bytes + 4*c*(d + 1). We have d and m constant so total space used will be of order O(k + n + c). Size of window n, number of clusters k and number of candidate outliers are very small as compare to size of data stream so space complexity of proposed method will be of order O(C) where C is a small constant. In this section, it is concluded that proposed method is efficient in terms of space complexity as it is required for data streams. 4. Experimental Results We have done implementations in matlab R2010a and experiments are conducted on synthetic as well as real data sets. Real data sets are taken from UCI machine learning repository [23]. Threshold values are set by conducting experiments on a subset of data. 4.1 Data Sets We have worked on two real datasets and two synthetic datasets. For experimental purpose a time stamp is added to each record in all datasets that specify the order of processing of streaming data. 10% samples of each dataset are used in initialization phase and rest are processed over sliding data window. 1) KDDCUP 99 Data Set: This data set was first time used in ACM KDD CUP Challenge of year After that it has been highly referenced for verification of outlier detection techniques. It is a computer network intrusion detection dataset. Each record represents a network connection which was simulated in a military network environment and labelled as either normal or an intrusion. It consists of 22 simulated attacks of following categories: DOS, R2L, U2R, and PROBE. We have removed class labels for experiments. It consists of total 41 attributes out of which are 7 categorical and 34 numeric attributes and approx. 4,898,431 connection records. In this original form, dataset is not suitable for outlier detection because the percentage of attacks is unrealistically higher than normal records. So we have sampled 10% subset of original data that consist of records. In sampled dataset attack records are 4895 that are approximately 1% of sampled data set. Numeric attributes are discretized to categorical attributes using equal width binning. 2) Mushroom Data Sets: It contains 8,124 data records of 23 species of mushrooms over 22 categorical attributes.

6 Table 1: Performance of proposed method in terms of outlier detection rate and false alarm rate Dataset Total Total Normal No. of s No. of False Detection False Alarm s Objects Detected Alarms Rate Rate KDDCUP Mushroom DS DS There are two classes of mushrooms: poisonous (48.2%) and edible (51.8%). We have planted 2% outliers in data sets based upon the frequency of each domain value of attributes. 3) Synthetic Data Set: Synthetic datasets are very useful for performance analysis as it is easy to control data parameters. We have generated two synthetic datasets using GAClust [28] data generator. It is freely available online. First dataset is named as DS1. It consists of records, 5 categorical attributes and 5 clusters. One cluster contains only 500 records and it is considered as outlier class in our experiments. Second synthetic data set is named as DS2, it comprises data objects, 5 attributes, 5 clusters and 1% randomly generated outliers. 4.2 Metrics for performance Evaluation To assess the performance of proposed method along with running time we have also examined following two other metrics: outlier detection rate and false alarm rate. detection rate refers to the ratio of numbers of actual outliers detected to the total number of outliers in data (refer eq. (8)). False alarm rate is the ratio of numbers of normal data objects that are mistakenly flagged as outlier to the total number of normal data objects (refer eq. (9)). Detection Rate = False Alarm Rate = 4.3 Performance Evaluation Number of Detected Total Number of False Alarms Total Normal Data Objects 1) Detection Rate & False Alarm Rate It can be analysed from Table1 that outlier detection rate of proposed technique vary from to on different datasets which shows its effectiveness. It has been resulted due the incorporation of both local as well as global characteristics of data in outlier detection by using both individual cluster entropy and PCE (percentage change in entropy of clustering) to decide upon outlying nature (deviation) of data object. False alarm rate of an outlier detection method must be as low as possible because dealing with false alarms require extra effort and expensive for user and system. False alarm rate of proposed method on all four datasets is given in Table (8) (9) 1 and its value is small too as should be. The proposed method has checked an object exceptional behavior over multiple windows before declaring it as an outlier which leads to lower false alarms. 2)Effect of Dataset Size on Running Time Data streams processing methods must be efficient in terms of running time to meet the challenge of high speed and tremendous volume of streaming data. To validate the efficiency of proposed technique in terms of running time, experiments are done by increasing size of dataset DS2 and KDD CUP dataset (having very large size) and results are shown in Fig 2. Running Time (Seconds) DS2 KDDCUP Data Set Size Figure 2: Effect of Dataset Size on Running Time It can be concluded from Fig. 2 that running time increases linearly with dataset size. It shows the scalability of proposed method. It is achieved by processing streaming data in sliding window (which require only single scan of data) model and incrementally. Difference between running time of two dataset is the result of their different number of dimensions. 3) Increasing Percentage of s verses Detection Rate In this experiment, a subset of size 5000 of data set DS1 has been used. s are placed in increasing percentages in the data. These outliers are of following two types: group outliers and point outliers. The percentages of outliers correspond to initial data set size. It can be seen from Fig.3 that outlier detection rate of proposed method is regular with increasing percentages of outliers up to a level because it considers small clusters as outliers groups in place of normal clusters following the criterion of [9]. There is a rapid fall in detection rate after

7 Detection Rate % 8% 12% 16% 20% 24% 30% % of s in Dataset Figure 3: Effect of Increasing Percentage of s on Detection Rate 20% outliers and this fall is obvious too as these objects do not have anomalous behaviour any more as per clustering based outlier detection approach which assume that outliers are only a small fraction of whole data. But for this analysis, we have still considered them as anomalous. 5. Conclusion In this work, we have proposed an outlier detection method for data streams using the concept of entropy. It is made incremental and adaptive to handle dynamic nature of data streams. The proposed technique has been validated on both synthetic and real world datasets. Experimental results prove its effectiveness on outlier detection rate, false alarm rate and running time performance metrics. It also uses memory space efficiently as its space complexity is of order O(C). It can be applied in a large number of fields such as banking databases, medical databases, network intrusion detection and weather prediction. In future, we will compare the proposed techniques to other exiting techniques and analyze the effect of threshold value of deviation criterion on detection rate. An extension of this work for mixed dataset is in progress. References [1] V. Hodge and J. Austin, A survey of outlier detection methodologies, Artif. Intell. Rev., vol. 22, no. 2, pp , Oct [2] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Comput. Surv., vol. 41, no. 3, pp. 15:1 15:58, July [3] J. Han and M. Kamber, Data Mining: Concepts and Techniques, J. Kacprzyk and L. C. Jain, Eds. Morgan Kaufmann, 2006, vol. 54, no. Second Edition. [4] C. Aggarwal, Ed., Data Streams Models and Algorithms. Springer, [5] S. Ramaswamy, R. Rastogi, and K. Shim, Efficient algorithms for mining outliers from large data sets, in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, ser. SIGMOD 00. New York, NY, USA: ACM, 2000, pp [6] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, Lof: identifying density-based local outliers, in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, ser. SIGMOD 00. New York, NY, USA: ACM, 2000, pp [7] F. Angiulli and C. Pizzuti, Fast outlier detection in high dimensional spaces, in Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, ser. PKDD 02. London, UK, UK: Springer-Verlag, 2002, pp [8] Y. Tao and D. Pi, Unifying density-based clustering and outlier detection, in Proceedings of the 2009 Second International Workshop on Knowledge Discovery and Data Mining, ser. WKDD 09. Washington, DC, USA: IEEE Computer Society, 2009, pp [9] Z. He, X. Xu, and S. Deng, Discovering cluster based local outliers, Pattern Recognition Letters, vol. 2003, pp. 9 10, [10] F. Angiulli, S. Basta, and C. Pizzuti, Distance-based detection and prediction of outliers, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp , [11] F. Angiulli and F. Fassetti, Detecting distance-based outliers in streams of data, in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, ser. CIKM 07, New York, NY, USA, 2007, pp [12] M. S. Sadik and L. Gruenwald, DBOD-DS : Distance Based Detection for Data Streams. Springer, 2011, vol. 6261, p. 122Ű136. [13] D. Pokrajac, A. Lazarevic, and L. J. Latecki, Incremental local outlier detection for data streams, in CIDM, 2007, pp [14] C. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp , , July, October [15] X. Liu, C.-T. Lu, and F. Chen, An entropy-based method for assessing the number of spatial outliers, in IRI, 2008, pp [16] F. Jiang, Y. Sui, and C. Cao, An information entropy-based approach to outlier detection in rough sets, Expert Syst. Appl., vol. 37, no. 9, pp , Sept [17] A. Koufakou, E. G. Ortiz, M. Georgiopoulos, G. C. Anagnostopoulos, and K. M. Reynolds, A scalable and efficient outlier detection strategy for categorical data. in ICTAI (2). IEEE Computer Society, 2007, pp [18] Z. He, S. Deng, and X. Xu, An optimization model for outlier detection in categorical data, in Advances in Intelligent Computing, ser. Lecture Notes in Computer Science, 2005, vol. 3644, pp [19] S. Wang, Y. Fan, C. Zhang, H. Xu, X. Hao, and Y. Hu, Entropy based clustering of data streams with mixed numeric and categorical values. in ACIS-ICIS. IEEE Computer Society, 2008, pp [20] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik, LIMBO: Scalable Clustering of Categorical Data, in Adv. Database Technol. - EDBT 2004, 2004, pp [21] D. Barbará, Y. Li, and J. Couto, Coolcat: An entropy-based algorithm for categorical clustering, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, ser. CIKM 02, New York, NY, USA, 2002, pp [22] M. Elahi, K. Li, W. Nisar, X. Lv, and H. Wang, Efficient clusteringbased outlier detection algorithm for dynamic data stream, in Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05, ser. FSKD 08. Washington, DC, USA: IEEE Computer Society, 2008, pp [23] A. Frank and A. Asuncion, UCI machine learning repository, [Online]. Available: [24] E. M. Knorr and R. T. Ng, Algorithms for mining distance-based outliers in large datasets, in Proceedings of the 24rd International Conference on Very Large Data Bases, ser. VLDB 98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp [25] F. T. Liu, K. M. Ting, and Z.-H. Zhou, Isolation-based anomaly detection, ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. 3:1 3:39, Mar [26] Yogita and D. Toshniwal, A framework for outlier detection in evolving data streams by weighting attributes in clustering, in Proceedings of the 2nd International Conference on Communication Computing and Security, India, [27] S. Gianvecchio and H. Wang, An entropy-based approach to detecting covert timing channels, IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 6, pp , [28] D. Cristofor and D. Simovici, Finding median partitions using information-theoretical-based genetic algorithms, Journal of Universal Computer Science, vol. 8, pp

Detection and Deletion of Outliers from Large Datasets

Detection and Deletion of Outliers from Large Datasets Detection and Deletion of Outliers from Large Datasets Nithya.Jayaprakash 1, Ms. Caroline Mary 2 M. tech Student, Dept of Computer Science, Mohandas College of Engineering and Technology, India 1 Assistant

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 A Review on Various Outlier Detection Techniques

More information

An Experimental Analysis of Outliers Detection on Static Exaustive Datasets.

An Experimental Analysis of Outliers Detection on Static Exaustive Datasets. International Journal Latest Trends in Engineering and Technology Vol.(7)Issue(3), pp. 319-325 DOI: http://dx.doi.org/10.21172/1.73.544 e ISSN:2278 621X An Experimental Analysis Outliers Detection on Static

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

IMPROVING THE PERFORMANCE OF OUTLIER DETECTION METHODS FOR CATEGORICAL DATA BY USING WEIGHTING FUNCTION

IMPROVING THE PERFORMANCE OF OUTLIER DETECTION METHODS FOR CATEGORICAL DATA BY USING WEIGHTING FUNCTION IMPROVING THE PERFORMANCE OF OUTLIER DETECTION METHODS FOR CATEGORICAL DATA BY USING WEIGHTING FUNCTION 1 NUR ROKHMAN, 2 SUBANAR, 3 EDI WINARKO 1 Gadjah Mada University, Department of Computer Science

More information

AN IMPROVED DENSITY BASED k-means ALGORITHM

AN IMPROVED DENSITY BASED k-means ALGORITHM AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

DETECTION OF ANOMALIES FROM DATASET USING DISTRIBUTED METHODS

DETECTION OF ANOMALIES FROM DATASET USING DISTRIBUTED METHODS DETECTION OF ANOMALIES FROM DATASET USING DISTRIBUTED METHODS S. E. Pawar and Agwan Priyanka R. Dept. of I.T., University of Pune, Sangamner, Maharashtra, India M.E. I.T., Dept. of I.T., University of

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Research Paper Available online at: Efficient Clustering Algorithm for Large Data Set

Research Paper Available online at:   Efficient Clustering Algorithm for Large Data Set Volume 2, Issue, January 202 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Efficient Clustering Algorithm for

More information

I. INTRODUCTION II. RELATED WORK.

I. INTRODUCTION II. RELATED WORK. ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A New Hybridized K-Means Clustering Based Outlier Detection Technique

More information

Review on Data Mining Techniques for Intrusion Detection System

Review on Data Mining Techniques for Intrusion Detection System Review on Data Mining Techniques for Intrusion Detection System Sandeep D 1, M. S. Chaudhari 2 Research Scholar, Dept. of Computer Science, P.B.C.E, Nagpur, India 1 HoD, Dept. of Computer Science, P.B.C.E,

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Mining Of Inconsistent Data in Large Dataset In Distributed Environment

Mining Of Inconsistent Data in Large Dataset In Distributed Environment Mining Of Inconsistent Data in Large Dataset In Distributed Environment M.Shanthini 1 Department of Computer Science and Engineering, Syed Ammal Engineering College, Ramanathapuram, Tamilnadu, India 1

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

ENHANCED DBSCAN ALGORITHM

ENHANCED DBSCAN ALGORITHM ENHANCED DBSCAN ALGORITHM Priyamvada Paliwal #1, Meghna Sharma *2 # Software Engineering, ITM University Sector 23-A, Gurgaon, India *Asst. Prof. Dept. of CS, ITM University Sector 23-A, Gurgaon, India

More information

K-means based data stream clustering algorithm extended with no. of cluster estimation method

K-means based data stream clustering algorithm extended with no. of cluster estimation method K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Outlier Recognition in Clustering

Outlier Recognition in Clustering Outlier Recognition in Clustering Balaram Krishna Chavali 1, Sudheer Kumar Kotha 2 1 M.Tech, Department of CSE, Centurion University of Technology and Management, Bhubaneswar, Odisha, India 2 M.Tech, Project

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Density Based Clustering using Modified PSO based Neighbor Selection

Density Based Clustering using Modified PSO based Neighbor Selection Density Based Clustering using Modified PSO based Neighbor Selection K. Nafees Ahmed Research Scholar, Dept of Computer Science Jamal Mohamed College (Autonomous), Tiruchirappalli, India nafeesjmc@gmail.com

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

A Fast Randomized Method for Local Density-based Outlier Detection in High Dimensional Data

A Fast Randomized Method for Local Density-based Outlier Detection in High Dimensional Data A Fast Randomized Method for Local Density-based Outlier Detection in High Dimensional Data Minh Quoc Nguyen, Edward Omiecinski, and Leo Mark College of Computing, Georgia Institute of Technology, Atlanta,

More information

An Abnormal Data Detection Method Based on the Temporal-spatial Correlation in Wireless Sensor Networks

An Abnormal Data Detection Method Based on the Temporal-spatial Correlation in Wireless Sensor Networks An Based on the Temporal-spatial Correlation in Wireless Sensor Networks 1 Department of Computer Science & Technology, Harbin Institute of Technology at Weihai,Weihai, 264209, China E-mail: Liuyang322@hit.edu.cn

More information

Regression Based Cluster Formation for Enhancement of Lifetime of WSN

Regression Based Cluster Formation for Enhancement of Lifetime of WSN Regression Based Cluster Formation for Enhancement of Lifetime of WSN K. Lakshmi Joshitha Assistant Professor Sri Sai Ram Engineering College Chennai, India lakshmijoshitha@yahoo.com A. Gangasri PG Scholar

More information

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS DEEVI RADHA RANI Department of CSE, K L University, Vaddeswaram, Guntur, Andhra Pradesh, India. deevi_radharani@rediffmail.com NAVYA DHULIPALA

More information

Automatic Group-Outlier Detection

Automatic Group-Outlier Detection Automatic Group-Outlier Detection Amine Chaibi and Mustapha Lebbah and Hanane Azzag LIPN-UMR 7030 Université Paris 13 - CNRS 99, av. J-B Clément - F-93430 Villetaneuse {firstname.secondname}@lipn.univ-paris13.fr

More information

Computer Technology Department, Sanjivani K. B. P. Polytechnic, Kopargaon

Computer Technology Department, Sanjivani K. B. P. Polytechnic, Kopargaon Outlier Detection Using Oversampling PCA for Credit Card Fraud Detection Amruta D. Pawar 1, Seema A. Dongare 2, Amol L. Deokate 3, Harshal S. Sangle 4, Panchsheela V. Mokal 5 1,2,3,4,5 Computer Technology

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

Filtered Clustering Based on Local Outlier Factor in Data Mining

Filtered Clustering Based on Local Outlier Factor in Data Mining , pp.275-282 http://dx.doi.org/10.14257/ijdta.2016.9.5.28 Filtered Clustering Based on Local Outlier Factor in Data Mining 1 Vishal Bhatt, 2 Mradul Dhakar and 3 Brijesh Kumar Chaurasia 1,2,3 Deptt. of

More information

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering Large Dynamic Datasets Using Exemplar Points Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au

More information

Research on outlier intrusion detection technologybased on data mining

Research on outlier intrusion detection technologybased on data mining Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS OUTLIER MINING IN HIGH DIMENSIONAL DATASETS DATA MINING DISCUSSION GROUP OUTLINE MOTIVATION OUTLIERS IN MULTIVARIATE DATA OUTLIERS IN HIGH DIMENSIONAL DATA Distribution-based Distance-based NN-based Density-based

More information

arxiv: v1 [cs.lg] 3 Oct 2018

arxiv: v1 [cs.lg] 3 Oct 2018 Real-time Clustering Algorithm Based on Predefined Level-of-Similarity Real-time Clustering Algorithm Based on Predefined Level-of-Similarity arxiv:1810.01878v1 [cs.lg] 3 Oct 2018 Rabindra Lamsal Shubham

More information

Outlier Detection Scoring Measurements Based on Frequent Pattern Technique

Outlier Detection Scoring Measurements Based on Frequent Pattern Technique Research Journal of Applied Sciences, Engineering and Technology 6(8): 1340-1347, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2013 Submitted: August 02, 2012 Accepted: September

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

A Review of K-mean Algorithm

A Review of K-mean Algorithm A Review of K-mean Algorithm Jyoti Yadav #1, Monika Sharma *2 1 PG Student, CSE Department, M.D.U Rohtak, Haryana, India 2 Assistant Professor, IT Department, M.D.U Rohtak, Haryana, India Abstract Cluster

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

An Enhanced K-Medoid Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com

More information

Unsupervised learning on Color Images

Unsupervised learning on Color Images Unsupervised learning on Color Images Sindhuja Vakkalagadda 1, Prasanthi Dhavala 2 1 Computer Science and Systems Engineering, Andhra University, AP, India 2 Computer Science and Systems Engineering, Andhra

More information

Data Stream Clustering Using Micro Clusters

Data Stream Clustering Using Micro Clusters Data Stream Clustering Using Micro Clusters Ms. Jyoti.S.Pawar 1, Prof. N. M.Shahane. 2 1 PG student, Department of Computer Engineering K. K. W. I. E. E. R., Nashik Maharashtra, India 2 Assistant Professor

More information

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

More information

Detecting Outliers in Data streams using Clustering Algorithms

Detecting Outliers in Data streams using Clustering Algorithms Detecting Outliers in Data streams using Clustering Algorithms Dr. S. Vijayarani 1 Ms. P. Jothi 2 Assistant Professor, Department of Computer Science, School of Computer Science and Engineering, Bharathiar

More information

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Clustering methods: Part 7 Outlier removal Pasi Fränti

Clustering methods: Part 7 Outlier removal Pasi Fränti Clustering methods: Part 7 Outlier removal Pasi Fränti 6.5.207 Machine Learning University of Eastern Finland Outlier detection methods Distance-based methods Knorr & Ng Density-based methods KDIST: K

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH International Journal of Information Technology and Knowledge Management January-June 2011, Volume 4, No. 1, pp. 27-32 DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY)

More information

DOI:: /ijarcsse/V7I1/0111

DOI:: /ijarcsse/V7I1/0111 Volume 7, Issue 1, January 2017 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey on

More information

Heterogeneous Density Based Spatial Clustering of Application with Noise

Heterogeneous Density Based Spatial Clustering of Application with Noise 210 Heterogeneous Density Based Spatial Clustering of Application with Noise J. Hencil Peter and A.Antonysamy, Research Scholar St. Xavier s College, Palayamkottai Tamil Nadu, India Principal St. Xavier

More information

Keywords: Clustering, Anomaly Detection, Multivariate Outlier Detection, Mixture Model, EM, Visualization, Explanation, Mineset.

Keywords: Clustering, Anomaly Detection, Multivariate Outlier Detection, Mixture Model, EM, Visualization, Explanation, Mineset. ISSN 2319-8885 Vol.03,Issue.35 November-2014, Pages:7140-7144 www.ijsetr.com Accurate and Efficient Anomaly Detection via Online Oversampling Principal Component Analysis K. RAJESH KUMAR 1, S.S.N ANJANEYULU

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

Detection of Anomalies using Online Oversampling PCA

Detection of Anomalies using Online Oversampling PCA Detection of Anomalies using Online Oversampling PCA Miss Supriya A. Bagane, Prof. Sonali Patil Abstract Anomaly detection is the process of identifying unexpected behavior and it is an important research

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA E-mail:

More information

COW: Malware Classification in an Open World

COW: Malware Classification in an Open World : Malware Classification in an Open World Abstract A large number of new malware families are released on a daily basis. However, most of the existing works in the malware classification domain are still

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

Outlier Detection with Two-Stage Area-Descent Method for Linear Regression

Outlier Detection with Two-Stage Area-Descent Method for Linear Regression Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Tenerife, Canary Islands, Spain, December 16-18, 2006 463 Outlier Detection with Two-Stage Area-Descent Method for Linear

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923

More information

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department

More information

Role of big data in classification and novel class detection in data streams

Role of big data in classification and novel class detection in data streams DOI 10.1186/s40537-016-0040-9 METHODOLOGY Open Access Role of big data in classification and novel class detection in data streams M. B. Chandak * *Correspondence: hodcs@rknec.edu; chandakmb@gmail.com

More information

Clustering Of Ecg Using D-Stream Algorithm

Clustering Of Ecg Using D-Stream Algorithm Clustering Of Ecg Using D-Stream Algorithm Vaishali Yeole Jyoti Kadam Department of computer Engg. Department of computer Engg. K.C college of Engg, K.C college of Engg Thane (E). Thane (E). Abstract The

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

PCA Based Anomaly Detection

PCA Based Anomaly Detection PCA Based Anomaly Detection P. Rameswara Anand 1,, Tulasi Krishna Kumar.K 2 Department of Computer Science and Engineering, Jigjiga University, Jigjiga, Ethiopi 1, Department of Computer Science and Engineering,Yogananda

More information

An Effective Outlier Detection-Based Data Aggregation for Wireless Sensor Networks

An Effective Outlier Detection-Based Data Aggregation for Wireless Sensor Networks An Effective Outlier Detection-Based Data Aggregation for Wireless Sensor Networks Dr Ashwini K B 1 Dr Usha J 2 1 R V College of Engineering 1 Master of Computer Applications 1 Bangalore, India 1 ashwinikb@rvce.edu.in

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, Uttarkhand, India. bnkeshav123@gmail.com, mitusuec@iitr.ernet.in,

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection Send Orders for Reprints to reprints@benthamscience.ae 1228 The Open Automation and Control Systems Journal, 2014, 6, 1228-1232 Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion

More information

Chapter 5: Outlier Detection

Chapter 5: Outlier Detection Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

More information

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm E. Padmalatha Asst.prof CBIT C.R.K. Reddy, PhD Professor CBIT B. Padmaja Rani, PhD Professor JNTUH ABSTRACT Data Stream

More information

Towards New Heterogeneous Data Stream Clustering based on Density

Towards New Heterogeneous Data Stream Clustering based on Density , pp.30-35 http://dx.doi.org/10.14257/astl.2015.83.07 Towards New Heterogeneous Data Stream Clustering based on Density Chen Jin-yin, He Hui-hao Zhejiang University of Technology, Hangzhou,310000 chenjinyin@zjut.edu.cn

More information

Intrusion Detection System Using K-SVMeans Clustering Algorithm

Intrusion Detection System Using K-SVMeans Clustering Algorithm Intrusion Detection System Using K-eans Clustering Algorithm 1 Jaisankar N, 2 Swetha Balaji, 3 Lalita S, 4 Sruthi D, Department of Computer Science and Engineering, Misrimal Navajee Munoth Jain Engineering

More information

E-Stream: Evolution-Based Technique for Stream Clustering

E-Stream: Evolution-Based Technique for Stream Clustering E-Stream: Evolution-Based Technique for Stream Clustering Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai Department of Computer Engineering, Faculty of Engineering Kasetsart University,

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Performance Evaluation of Density-Based Outlier Detection on High Dimensional Data

Performance Evaluation of Density-Based Outlier Detection on High Dimensional Data Performance Evaluation of Density-Based Outlier Detection on High Dimensional Data P. Murugavel Research Scholar, Manonmaniam Sundaranar University, Tirunelveli, Tamil Nadu, India Dr. M. Punithavalli Research

More information

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction Swapna M. Patil Dept.Of Computer science and Engineering,Walchand Institute Of Technology,Solapur,413006 R.V.Argiddi Assistant

More information

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set Renu Vashist School of Computer Science and Engineering Shri Mata Vaishno Devi University, Katra,

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information