Entropy Based Adaptive Outlier Detection Technique for Data Streams

Size: px

Start display at page:

Download "Entropy Based Adaptive Outlier Detection Technique for Data Streams"

Sheryl Laurel Flynn
5 years ago
Views:

1 Entropy Based Adaptive Detection Technique for Data Streams Yogita 1, Durga Toshniwal 1, and Bhavani Kumar Eshwar 2 1 Department of Computer Science and Engineering, IIT Roorkee, India 2 IBM India Software Labs, Bangalore, India Abstract detection in data streams is an immensely enthralling problem in many application areas such as network intrusion detection, faulty sensor detection, fraud detection in online financial transactions etc. Majority of existing outlier detection techniques have been mainly designed for static datasets and require a global view and multiple scans of data which is not feasible in case of streaming data. In this paper, we propose an entropy based outlier detection technique for streaming data exploiting the fact that presence of an anomalous data object highly increases the entropy of normal data clustering. It maintains clusters of streaming data and finds change in its entropy on incoming data object. If increment in entropy is very large then the data object is marked as candidate outlier and its anomalous behaviour confirmed over multiple sliding windows to minimize the false alarms. The proposed method is incremental and dynamically updates clustering structure and entropy statistics to deal with heavy volume and concept evolution of data streams. The proposed scheme has been evaluated on both synthetic and real world data. Experimental results prove its effectiveness on following performance measures: outlier detection rate, false alarm rate and running time. Keywords: Concept Evolution; Data Streams; Entropy; Detection 1. Introduction detection identifies such data objects that significantly deviate from rest of data [1], [2]. In many applications, outliers are more interesting than the normal patterns of data for example network intrusion detection, fraud detection, fault diagnosis, finding criminal activities in electronic commerce etc. Data streams are potentially infinite sequence of data objects [3] and are produced by many applications such real-time surveillance, environmental monitoring, medical systems, communication networks, online banking, internet traffic etc. detection in streaming data faces following challenges [4]: Entire data stream cannot be stored for multiple scans because of its terrific volume. detection model needs to be updated with incoming data to handle the dynamic nature of data streams. High speed of data streams imposes the limitation of memory space and processing time on outlier mining techniques. A lot of literature is available for outlier detection in static datasets which can be classified as distance based, density based and clustering based methods [5], [6], [7], [8], [9], [10]. But most of existing techniques do not fit in streaming data environment because of their assumption of availability of whole data in memory for multiple scans. Some work has also been done toward distance and density based outlier detection in data streams [11], [12], [13]. These works involve computation of nearest neighbors and depend upon the choice of different parameter. The entropy is a powerful mechanism for measurement of information content or uncertainty of a variable [14]. It is also referred as a measure of randomness of a system. Concept of entropy is very intuitive for outlier detection because presence of outliers increases the entropy (randomness) of dataset [15], [16], [17], [18] and this increment can be used to measure the outliersness of an object. Concept of entropy has also been used in a number of literature works for clustering data [19], [20] but in the present work our focus on outlier detection as oppose finding quality clusters. In this paper, we propose an outlier detection technique for streaming data which uses the concept of entropy to avoid pair wise distance computations and nearest neighbour parameter setting. It makes use of fact that presence of an outlier is likely to increases the entropy of clusters comprising normal data. To exploit this fact, it maintains clusters of the normal data objects and for each incoming new data object finds the change in the entropy of clusters [21]. If increment in entropy is very large then the data object is marked as candidate outlier and their anomalous behaviour confirmed over multiple sliding windows to minimize the false alarms [11], [22]. The proposed method is incremental and dynamically updates clustering structure and entropy statistics to deal with heavy volume and concept evolution in data streams. We have implemented and validated proposed technique on both synthetic and real world datasets [23]. The rest of this paper is organized as follows: Section 2 talks about related work. The proposed outlier detection technique has been presented in section 3. The section 4 focuses on the experimental results and conclusion has been given in section 5.

2 2. Related Work detection has its application in large number of domains that s why it has been a topic of importance for research community. A good survey of it can be find in [1], [2]. Majority of this literature focuses on static dataset. Distance-based outlier detection approaches are presented in [5], [24]. Their definitions of outlier are simple and intuitive but requires user to specify parameter k and d which could be difficult to determine. This idea is further extended in [7], [10], where the outlier factor of each data object is calculated as the sum of distances from its k th nearest neighbors. Density based outlier methods are proposed by Breunig et al. in [6] which captures local outlierness of an object and by Tao et al. in [8], it unifies density-based clustering and outlier detection in [8]. Tony et al. [25] proposed the first isolation method for outlier detection called as Isolation Forest (iforest) which detects outliers purely based on the concept of isolation without using any distance or density measure. Concepts of distance, density and clustering based outlier detection have also been extended to data streams. Two distance based approaches for finding outliers in data streams are given in [11], [12]. Yogita et al. have proposed an unsupervised approach for identifying outliers evolving data streams by weighting attributes in clustering [26]. Most of these methods have applied sliding window model for dealing with data streams. Concept of entropy has been used a lot for solving clustering and outlier detection problem in static [17], [18], [20] and clustering problem in streaming data [21], [19]. But its use for finding outliers in data streams is very intuitive and needs further exploration. Liu and Lu proposed an entropy-based method to approximate the number of outliers for a spatial data set in [15] by using a function of local contrast and local contrast probability of non-spatial and spatial attributes. An entropy-based approach to discover covert timing channels is proposed by Steven et al. in [27] based on the fact that the creation of a covert timing channel affects the entropy of original process. By using information entropy model to measure the uncertainty in rough sets framework Jiang et al. presented a new definition of outliers known as IE (Information Entropy)- based outliers [16]. 3. The Proposed Method This section presents the proposed adaptive outlier detection scheme. Initially preliminary concepts and notations are introduced and then deviation criterion and proposed method is discussed. In the end of this section time and space complexity of proposed method are analysed. 3.1 Preliminary Concepts and Notations 1) Data Stream: A data stream is an infinite sequence of data objects x 1, x 2,..,x n,.., arriving at time stamps T 1,T 2,..,T n,.., Each data object is a multidimensional point with m attributes. Data streams are of tremendous volume and flows at very high speed. In this work sliding window model has been used to process streaming data. It stores only a percentage of data in memory at a time. After processing the current window data, only sufficient summaries of data are maintained in memory and detailed data is discarded. 2) Cluster Summary Structure: A cluster Summary structure CS of a cluster C at time t is defined as follows: CS = (F T, δt) (1) where FT is the frequency table which stores the frequency of each attribute value pair of every attribute in the cluster C. A similar data structure is used in [17] for maintaining the attribute value frequency of complete dataset. It can be updated incrementally on assigning a new data object to the cluster by incrementing the frequency of corresponding attribute value pair of each attribute. δt stores the timestamp of the data object that is least recently added to the cluster. 3) Entropy: It is the measure of information and uncertainty or randomness of a variable [14]. Let x is a random variable and S(x) is the set of values that variable x can take and P(x) represents the probability function of random variable x, then entropy E(x) is defined as given by equation (2). E(x) = P (x)log 2 (P (x)) (2) xɛs(x) The entropy of a multivariate vector X = (x 1,...,x i,..., x m ) having m attributes and x i is a discrete random variable, can be calculated as defined by equation (3). E(X) = x1ɛs(x1) xiɛs(xi) xmɛs(xm) P (X)log 2 (P (X)) where P(x 1,...x i,..., x m ) is the multivariate probability distribution function. 3.2 Deviation Criterion for Detection Given a data stream and clusters of normal data, outlier detection aims to identify such data objects that deviate heavily from the clusters based on a deviation criterion. A deviation criterion is very important factor for outlier detection and measures outlierness of a data object. In this paper, we have proposed an entropy based deviation criterion PCE(X) as defined in equation (4). It gives the percentage change in the entropy of clustering Cl on assigning a data object X to a nearest cluster (nearest cluster means that cluster for which increase in entropy is minimal out of all clusters). As outliers highly increases the entropy of clustering so value of PCE will be large and positive for (3)

3 them. As oppose to outliers value of PCE for normal data objects, will be either negative or small and positive. ( ) E(Cl + X) E(Cl) P CE(X) = 100 (4) E(C) Where E(Cl) is the entropy of a clustering Cl and E(Cl + X) is the new entropy of clustering Cl on assigning data object X to nearest cluster C. E(Cl) is defined by the equation (5) and it represents the weighted sum of entropies of all the clusters [21]. E(Cl) = k Ck (E(Ck) (5) D To simplify the computation of entropy of a cluster in streaming environment, we have assumed the independence of the attributes of the data stream. This assumption transforms the equation (3) into equation (6) and entropy of a cluster Ck can be calculated using it where xi is a data object belongs to cluster Ck. E(Ck) = E(x1) + E(x2)... + E(xi)... + E(xr) (6) PCE is based upon the fact that when a data object is assigned to a cluster, there may be increase or decrease in the entropy of cluster and respectively in the total entropy of the clustering. Intuitively, if data object is inherently similar to cluster then entropy (randomness) of cluster will either decrease or it will increase slightly, while on assigning a dissimilar (outlying) one it will increase very high. So the percentage change in entropy of clustering (PCE) is a justified criterion for differentiating between outliers and normal data objects. 3.3 Proposed Method in Detail The pictorial representation of proposed technique is given in Fig. 1. It comprises mainly four modules that are detailed below. Data streams are of tremendous volume and flows at very high speed that s why cannot be stored in memory for processing. We have used sliding window model to process streaming data. It stores only a percentage of data in memory at a time for processing. After processing the current window data, only sufficient summaries of data are maintained in memory and detailed data is discarded to vacate space for next window data. 1) Initialization: In proposed outlier detection technique, normal behaviour of data is represented by clusters. So this module initializes the normal behaviour (clusters) by performing clustering on sampled normal data (It should not contain outlying objects).these clusters are outputted to candidate outlier detection module only once, when processing of streaming data starts. For this initial clustering any clustering algorithm can be freely chosen. 2) Candidate Detection: Candidate outlier is an object which satisfies deviation criterion. We have used an entropy based deviation criterion PCE defined in equation (4). Algorithm for candidate outlier detection is shown in Algorithm1. It assigns incoming data objects to clusters following the approach given in [21]. Algorithm 1 Candidate Detection Input: Clustering (Cl) - Comprises k, Current Window Data, Threshold Output: Candidate s 1: Repeat for all data objects of current window 2: X = Read next data object in window 3: Temporally assign X to cluster Ci such that on assigning X increase in entropy of Ci is minimum out of all other clusters Cj where j = 1...j...K. 4: Calculate the PCE(X) 5: if PCE(X) > Threshold then 6: X is candidate outlier, initialize counter of X to one and save both X and counter into candidate outlier repository to further verify its deviation in outlier detection module 7: else 8: X is normal data object output it to updating module to update cluster Ci 9: end if 10: End Repeat In candidate outlier detection algorithm, all objects are classified either as candidate outlier or normal data object. To keep a count of number of windows an object found as candidate outlier, a counter is associated with it and counter is set to one initially. Candidate outliers along with counter are stored in candidate repository. Later on outlier detection module takes candidate outliers from repository to verify their outlying nature (deviation) over multiple data windows. This multiple times verification is done because a candidate outlier may be part of an emerging cluster and showing deviation temporarily instead of being actually an outlier [11], [22]. To consider the local as well as global characteristics of data in outlier detection we have used both increments in individual cluster entropy in step 3 and PCE (percentage change in entropy of clustering) in step 5 of algorithm 1 respectively as criterion to decide upon outlying nature (deviation) of data object. 3) Updating: Concept evolution occurs when new classes come to existence and old may extinct from streaming data over time. It crop ups due to the change in the underlying process which is producing streams. In proposed method, smooth evolution of concepts has been addressed in following way: Proposed method is made adaptive by dynamically updating the data clustering and entropy statistics with incoming data streams. This is explained in this section itself. Outlying nature of an object is verified over multiple

4 Data Stream Sliding Window Initialization Candidate Detection Module Normal Data Object Updating Module Update Existing Incorporate Emerging Cluster Inliers Detection Module Discard Current Window Data After Processing Repository Candidate Prune Obsolete Candidate s Repository Figure 1: Proposed Adaptive Detection Technique Algorithm1 Candidate Detection Input Clustering (Cl) - Comprises k, Current Window Data, Threshold Output Candidate s data windows before declaring it as outlier because it Step1 Repeat for all data objects of current window may be part of an emerging cluster and showing deviation temporarily instead of being actually an outlier. Step2 X = Read next data object in window Temporally assign X to cluster C Step3 i such that on assigning X increase in entropy of This is the part of outlier detection module. C i is minimum out of all other clusters C j where j = 1...j...K. There are followingstep4 three typescalculate of possible the PCE(X) updates to tion (5) clustering structure andstep5 entropy statistics: If PCE(X) > Threshold 1) Update Existing : Whenever X is candidate normal outlier, data initialize object from candidate outlier counter of X to one and save both X and Step6 modulecounter and inliers into candidate from outlier repository to further verify its deviation in outlier detection module module inputted to update module then following are the steps of update procedure: Assign object X to cluster C i such that on assigning X increase in entropy of C i is minimum out of all other clusters C j where j = 1...j...K. Update cluster summary structure of cluster C i. Calculate the new entropy of clustering by using equation (5) Discard the data object X 2) Incorporate Emerging : Candidate outliers from candidate repository are clustered periodically. If size of any clusters is large enough and its entropy value is smaller than a threshold, it means that cluster is representing a new class in incoming data and hence must be incorporated in clustering as clustering based outlier detection approach assume that outliers are small in numbers and occur in sparse spaces. So following actions are taken K = k +1, where k is the number of clusters Initialize cluster summary structure of cluster C k +1. Calculate the new entropy of clustering by using equation (5) 3) Prune Obsolete : If no data object is assigned to a cluster C i from a long time, it signifies that cluster C i no more exists in data streams. A times factor δt is associated with each cluster that stores timestamp of data object that is least recently added to the cluster. If difference between current timestamp and δt is greater than a threshold then cluster C i is deleted. And following actions are taken Delete cluster summary structure of cluster C i. K = k -1, where k is the number of clusters Calculate the new entropy of clustering by using equa- Cluster summary structure for a cluster comprises FT and δt. FT is the frequency table which stores the frequency of each attribute value pair of every attribute in the cluster C. It can be updated incrementally on assigning a new data object to the cluster by incrementing the frequency of corresponding attribute value pair of each attribute. δt stores the timestamp of the data object that is least recently added to the cluster. 4) Detection: A candidate outlier turns to a real outlier if it fulfils the deviation criterion continuously over w sliding windows [11], [22]. We have used an entropy based deviation Criterion PCE that is defined in equation (4). Outlying nature of a candidate outlier is confirmed over multiple data windows before flagging it as real outlier because it may be part of an emerging cluster and showing deviation temporarily instead of being real outlier. To keep a count of number of windows over which an object found as candidate outlier, a counter is associated with each candidate and it is set to one initially when candidate outlier is detected first time in candidate outlier detection module. Algorithm for Detection Module is given in Algorithm 2. If a candidate outlier found to be an inlier in outlier detection module then it is output to updating module. Because an inlier represents the normal behaviour so it must be incorporated in clustering of data stream. 3.4 Time Complexity Let k is the maximum number of clusters that can occur at a time in data stream, d is the dimensions of data, n is the window size and c is the maximum number of candidate outlier may occur at a time and D is data stream size. So there

5 Algorithm 2 Detection Input: Clustering (Cl) - Comprises k, Candidate s, Threshold Output: Real s 1: Repeat for all candidate outliers 2: Read next candidate outlier from repository 3: Calculate the PCE(candidate outlier) 4: if PCE(candidate outlier) > Threshold & Counter(candidate outlier) == W then 5: Declare candidate outlier as real outlier, remove from candidate outlier repository and save to outlier repository 6: else 7: if PCE(candidate outlier) > Threshold & Counter(candidate outlier)< W then 8: Update Counter(candidate outlier) = Counter(candidate outlier) + 1 9: else 10: Candidate outlier is an inlier (normal data object), Remove it from candidate repository and output to updating module 11: end if 12: end if 13: End Repeat will be total D/n data windows for processing. For processing of each window there are three modules. Initialization module works only once and on a small sampled data so its running time can be considered constant I in analysis. Time taken by candidate outlier detection module will be k*n*(1 + d). Updating will take k*d*(n-c) +d*p*(k+1+c*kc*i) and in outlier detection c*k*(1+d) time will be elapsed. Here i is the iteration in clustering of candidate outliers, k c is the number of clusters in candidate outliers set and p is the number windows after which candidate outliers will be clustered. As we have I, d, p and c constant and i and kc will be very small because c is constant. So total running time of proposed scheme can be stated as follows: Running T ime = (D/n) k (n + 1) (7) As k is the number of clusters and it will not increase corresponding to data stream size so time complexity of proposed outlier detection technique will be of order O(D) which show that time complexity is of linear order with data stream size. 3.5 Space Complexity Memory space used in storing a variable dependents upon the operating system specification, let s for the present work consider windows vista operating system and let k is the number of clusters that can occur at a time in data stream, d is the dimensions of data, m is the maximum number of domain values for a attribute and c is the maximum number of candidate outlier may occur at a time. Proposed method requires to store only following information Summary structure of cluster Candidate outliers Current window data Space required to store summary structure of k clusters is 4*k*d*m bytes, current window data will take 4*d*n bytes and for candidate outliers 4*c*(d + 1) bytes will be needed. Hence Total memory space used =4*k*d*m bytes + 4*d*n bytes + 4*c*(d + 1). We have d and m constant so total space used will be of order O(k + n + c). Size of window n, number of clusters k and number of candidate outliers are very small as compare to size of data stream so space complexity of proposed method will be of order O(C) where C is a small constant. In this section, it is concluded that proposed method is efficient in terms of space complexity as it is required for data streams. 4. Experimental Results We have done implementations in matlab R2010a and experiments are conducted on synthetic as well as real data sets. Real data sets are taken from UCI machine learning repository [23]. Threshold values are set by conducting experiments on a subset of data. 4.1 Data Sets We have worked on two real datasets and two synthetic datasets. For experimental purpose a time stamp is added to each record in all datasets that specify the order of processing of streaming data. 10% samples of each dataset are used in initialization phase and rest are processed over sliding data window. 1) KDDCUP 99 Data Set: This data set was first time used in ACM KDD CUP Challenge of year After that it has been highly referenced for verification of outlier detection techniques. It is a computer network intrusion detection dataset. Each record represents a network connection which was simulated in a military network environment and labelled as either normal or an intrusion. It consists of 22 simulated attacks of following categories: DOS, R2L, U2R, and PROBE. We have removed class labels for experiments. It consists of total 41 attributes out of which are 7 categorical and 34 numeric attributes and approx. 4,898,431 connection records. In this original form, dataset is not suitable for outlier detection because the percentage of attacks is unrealistically higher than normal records. So we have sampled 10% subset of original data that consist of records. In sampled dataset attack records are 4895 that are approximately 1% of sampled data set. Numeric attributes are discretized to categorical attributes using equal width binning. 2) Mushroom Data Sets: It contains 8,124 data records of 23 species of mushrooms over 22 categorical attributes.

6 Table 1: Performance of proposed method in terms of outlier detection rate and false alarm rate Dataset Total Total Normal No. of s No. of False Detection False Alarm s Objects Detected Alarms Rate Rate KDDCUP Mushroom DS DS There are two classes of mushrooms: poisonous (48.2%) and edible (51.8%). We have planted 2% outliers in data sets based upon the frequency of each domain value of attributes. 3) Synthetic Data Set: Synthetic datasets are very useful for performance analysis as it is easy to control data parameters. We have generated two synthetic datasets using GAClust [28] data generator. It is freely available online. First dataset is named as DS1. It consists of records, 5 categorical attributes and 5 clusters. One cluster contains only 500 records and it is considered as outlier class in our experiments. Second synthetic data set is named as DS2, it comprises data objects, 5 attributes, 5 clusters and 1% randomly generated outliers. 4.2 Metrics for performance Evaluation To assess the performance of proposed method along with running time we have also examined following two other metrics: outlier detection rate and false alarm rate. detection rate refers to the ratio of numbers of actual outliers detected to the total number of outliers in data (refer eq. (8)). False alarm rate is the ratio of numbers of normal data objects that are mistakenly flagged as outlier to the total number of normal data objects (refer eq. (9)). Detection Rate = False Alarm Rate = 4.3 Performance Evaluation Number of Detected Total Number of False Alarms Total Normal Data Objects 1) Detection Rate & False Alarm Rate It can be analysed from Table1 that outlier detection rate of proposed technique vary from to on different datasets which shows its effectiveness. It has been resulted due the incorporation of both local as well as global characteristics of data in outlier detection by using both individual cluster entropy and PCE (percentage change in entropy of clustering) to decide upon outlying nature (deviation) of data object. False alarm rate of an outlier detection method must be as low as possible because dealing with false alarms require extra effort and expensive for user and system. False alarm rate of proposed method on all four datasets is given in Table (8) (9) 1 and its value is small too as should be. The proposed method has checked an object exceptional behavior over multiple windows before declaring it as an outlier which leads to lower false alarms. 2)Effect of Dataset Size on Running Time Data streams processing methods must be efficient in terms of running time to meet the challenge of high speed and tremendous volume of streaming data. To validate the efficiency of proposed technique in terms of running time, experiments are done by increasing size of dataset DS2 and KDD CUP dataset (having very large size) and results are shown in Fig 2. Running Time (Seconds) DS2 KDDCUP Data Set Size Figure 2: Effect of Dataset Size on Running Time It can be concluded from Fig. 2 that running time increases linearly with dataset size. It shows the scalability of proposed method. It is achieved by processing streaming data in sliding window (which require only single scan of data) model and incrementally. Difference between running time of two dataset is the result of their different number of dimensions. 3) Increasing Percentage of s verses Detection Rate In this experiment, a subset of size 5000 of data set DS1 has been used. s are placed in increasing percentages in the data. These outliers are of following two types: group outliers and point outliers. The percentages of outliers correspond to initial data set size. It can be seen from Fig.3 that outlier detection rate of proposed method is regular with increasing percentages of outliers up to a level because it considers small clusters as outliers groups in place of normal clusters following the criterion of [9]. There is a rapid fall in detection rate after

7 Detection Rate % 8% 12% 16% 20% 24% 30% % of s in Dataset Figure 3: Effect of Increasing Percentage of s on Detection Rate 20% outliers and this fall is obvious too as these objects do not have anomalous behaviour any more as per clustering based outlier detection approach which assume that outliers are only a small fraction of whole data. But for this analysis, we have still considered them as anomalous. 5. Conclusion In this work, we have proposed an outlier detection method for data streams using the concept of entropy. It is made incremental and adaptive to handle dynamic nature of data streams. The proposed technique has been validated on both synthetic and real world datasets. Experimental results prove its effectiveness on outlier detection rate, false alarm rate and running time performance metrics. It also uses memory space efficiently as its space complexity is of order O(C). It can be applied in a large number of fields such as banking databases, medical databases, network intrusion detection and weather prediction. In future, we will compare the proposed techniques to other exiting techniques and analyze the effect of threshold value of deviation criterion on detection rate. An extension of this work for mixed dataset is in progress. References [1] V. Hodge and J. Austin, A survey of outlier detection methodologies, Artif. Intell. Rev., vol. 22, no. 2, pp , Oct [2] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Comput. Surv., vol. 41, no. 3, pp. 15:1 15:58, July [3] J. Han and M. Kamber, Data Mining: Concepts and Techniques, J. Kacprzyk and L. C. Jain, Eds. Morgan Kaufmann, 2006, vol. 54, no. Second Edition. [4] C. Aggarwal, Ed., Data Streams Models and Algorithms. Springer, [5] S. Ramaswamy, R. Rastogi, and K. Shim, Efficient algorithms for mining outliers from large data sets, in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, ser. SIGMOD 00. New York, NY, USA: ACM, 2000, pp [6] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, Lof: identifying density-based local outliers, in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, ser. SIGMOD 00. New York, NY, USA: ACM, 2000, pp [7] F. Angiulli and C. Pizzuti, Fast outlier detection in high dimensional spaces, in Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, ser. PKDD 02. London, UK, UK: Springer-Verlag, 2002, pp [8] Y. Tao and D. Pi, Unifying density-based clustering and outlier detection, in Proceedings of the 2009 Second International Workshop on Knowledge Discovery and Data Mining, ser. WKDD 09. Washington, DC, USA: IEEE Computer Society, 2009, pp [9] Z. He, X. Xu, and S. Deng, Discovering cluster based local outliers, Pattern Recognition Letters, vol. 2003, pp. 9 10, [10] F. Angiulli, S. Basta, and C. Pizzuti, Distance-based detection and prediction of outliers, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp , [11] F. Angiulli and F. Fassetti, Detecting distance-based outliers in streams of data, in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, ser. CIKM 07, New York, NY, USA, 2007, pp [12] M. S. Sadik and L. Gruenwald, DBOD-DS : Distance Based Detection for Data Streams. Springer, 2011, vol. 6261, p. 122Ű136. [13] D. Pokrajac, A. Lazarevic, and L. J. Latecki, Incremental local outlier detection for data streams, in CIDM, 2007, pp [14] C. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp , , July, October [15] X. Liu, C.-T. Lu, and F. Chen, An entropy-based method for assessing the number of spatial outliers, in IRI, 2008, pp [16] F. Jiang, Y. Sui, and C. Cao, An information entropy-based approach to outlier detection in rough sets, Expert Syst. Appl., vol. 37, no. 9, pp , Sept [17] A. Koufakou, E. G. Ortiz, M. Georgiopoulos, G. C. Anagnostopoulos, and K. M. Reynolds, A scalable and efficient outlier detection strategy for categorical data. in ICTAI (2). IEEE Computer Society, 2007, pp [18] Z. He, S. Deng, and X. Xu, An optimization model for outlier detection in categorical data, in Advances in Intelligent Computing, ser. Lecture Notes in Computer Science, 2005, vol. 3644, pp [19] S. Wang, Y. Fan, C. Zhang, H. Xu, X. Hao, and Y. Hu, Entropy based clustering of data streams with mixed numeric and categorical values. in ACIS-ICIS. IEEE Computer Society, 2008, pp [20] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik, LIMBO: Scalable Clustering of Categorical Data, in Adv. Database Technol. - EDBT 2004, 2004, pp [21] D. Barbará, Y. Li, and J. Couto, Coolcat: An entropy-based algorithm for categorical clustering, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, ser. CIKM 02, New York, NY, USA, 2002, pp [22] M. Elahi, K. Li, W. Nisar, X. Lv, and H. Wang, Efficient clusteringbased outlier detection algorithm for dynamic data stream, in Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05, ser. FSKD 08. Washington, DC, USA: IEEE Computer Society, 2008, pp [23] A. Frank and A. Asuncion, UCI machine learning repository, [Online]. Available: [24] E. M. Knorr and R. T. Ng, Algorithms for mining distance-based outliers in large datasets, in Proceedings of the 24rd International Conference on Very Large Data Bases, ser. VLDB 98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp [25] F. T. Liu, K. M. Ting, and Z.-H. Zhou, Isolation-based anomaly detection, ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. 3:1 3:39, Mar [26] Yogita and D. Toshniwal, A framework for outlier detection in evolving data streams by weighting attributes in clustering, in Proceedings of the 2nd International Conference on Communication Computing and Security, India, [27] S. Gianvecchio and H. Wang, An entropy-based approach to detecting covert timing channels, IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 6, pp , [28] D. Cristofor and D. Simovici, Finding median partitions using information-theoretical-based genetic algorithms, Journal of Universal Computer Science, vol. 8, pp

Detection and Deletion of Outliers from Large Datasets

Detection and Deletion of Outliers from Large Datasets Nithya.Jayaprakash 1, Ms. Caroline Mary 2 M. tech Student, Dept of Computer Science, Mohandas College of Engineering and Technology, India 1 Assistant