Detection of Anomalies using Online Oversampling PCA

Similar documents
Anomaly Detection on Data Streams with High Dimensional Data Environment

Keywords: Clustering, Anomaly Detection, Multivariate Outlier Detection, Mixture Model, EM, Visualization, Explanation, Mineset.

PCA Based Anomaly Detection

Computer Technology Department, Sanjivani K. B. P. Polytechnic, Kopargaon

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

A Survey on Intrusion Detection Using Outlier Detection Techniques

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA

Chapter 5: Outlier Detection

DATA MINING II - 1DL460

Analyzing Outlier Detection Techniques with Hybrid Method

Data preprocessing Functional Programming and Intelligent Algorithms

Outlier detection using autoencoders

Clustering Algorithms for Data Stream

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

I. INTRODUCTION II. RELATED WORK.

Handling Missing Values via Decomposition of the Conditioned Set

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Filtered Clustering Based on Local Outlier Factor in Data Mining

Performance Analysis of Data Mining Classification Techniques

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

An Experimental Analysis of Outliers Detection on Static Exaustive Datasets.

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Applying Modified K-Nearest Neighbor to Detect Insider Threat in Collaborative Information Systems

Using the DATAMINE Program

Interferogram Analysis using Active Instance-Based Learning

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Stream Clustering Using Micro Clusters

Distance-based Outlier Detection: Consolidation and Renewed Bearing

IMPROVING THE PERFORMANCE OF OUTLIER DETECTION METHODS FOR CATEGORICAL DATA BY USING WEIGHTING FUNCTION

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Fraud Detection of Mobile Apps

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Unsupervised learning in Vision

Optimization of Query Processing in XML Document Using Association and Path Based Indexing

Normalization based K means Clustering Algorithm

Detection and Deletion of Outliers from Large Datasets

International Journal of Advanced Research in Computer Science and Software Engineering

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Outlier Detection with Two-Stage Area-Descent Method for Linear Regression

A Detailed Analysis on NSL-KDD Dataset Using Various Machine Learning Techniques for Intrusion Detection

Role of big data in classification and novel class detection in data streams

Anomaly Detection. You Chen

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Dynamic Data in terms of Data Mining Streams

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

A Survey on Feature Extraction Techniques for Palmprint Identification

DOI:: /ijarcsse/V7I1/0111

Outlier Detection Scoring Measurements Based on Frequent Pattern Technique

A Study on Different Challenges in Facial Recognition Methods

Anomaly Detection System for Video Data Using Machine Learning

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Intrusion Detection Using Data Mining Technique (Classification)

Clustering methods: Part 7 Outlier removal Pasi Fränti

Enhancing K-means Clustering Algorithm with Improved Initial Center

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Multidirectional 2DPCA Based Face Recognition System

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

AN IMPROVED DENSITY BASED k-means ALGORITHM

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

PATTERN RECOGNITION USING NEURAL NETWORKS

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Mean-shift outlier detection

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Chapter 1, Introduction

Anomaly Detection Based on Access Behavior and Document Rank Algorithm

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

DETECTION OF ANOMALIES FROM DATASET USING DISTRIBUTED METHODS

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Recognition of Non-symmetric Faces Using Principal Component Analysis

An ICA based Approach for Complex Color Scene Text Binarization

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Mining Of Inconsistent Data in Large Dataset In Distributed Environment

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Unsupervised Learning for anomaly Intrusion detection

Enhanced Bug Detection by Data Mining Techniques

Design of Fault Diagnosis System of FPSO Production Process Based on MSPCA

Large Scale Data Analysis for Policy

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

An Efficient Approach for Color Pattern Matching Using Image Mining

Face Recognition for Mobile Devices

Transcription:

Detection of Anomalies using Online Oversampling PCA Miss Supriya A. Bagane, Prof. Sonali Patil Abstract Anomaly detection is the process of identifying unexpected behavior and it is an important research topic in data mining and machine learning. Many real-world applications require an effective and efficient framework to identify such data instances. So it is required to propose the outlier detection methods. The previously proposed anomaly detection methods are implemented in batch mode, so they cannot be extended to large scale or online problems. In earlier Principal Component Analysis (PCA) formulation it is required to store entire data matrix or covariance so the computational cost and memory requirements will get increased. To handle this issue the method proposed here is useful that is oversampling Principal Component Analysis (ospca). PCA is used to perform dimension reduction which helps to get principal directions of data, based on that anomaly detection is performed. Oversampling effect will duplicate the target instance multiple times to amplify the effect of outliers. The method can also be applied for normal data with multiclustering structure. Here in ospca online updating technique is used to handle online, large scale, or streaming data problems. Index Terms Anomaly detection, principal Component Analysis, outlier, oversampling I. INTRODUCTION Anomalies are the patterns in data that do not conform to their expected behavior and Anomaly detection refers to the problem of finding those patterns in data. The patterns which are different from their expected behavior are referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. Anomaly detection is very important in data mining. Anomaly or outlier detection is basically used to find the group of instances which deviate from original data [1]. Anomaly detection is the first step in number of data mining applications. Numbers of methods are available for outlier detection and Numbers of applications are there, where anomaly detection is important such as homeland security, credit card fraud detection, intrusion and insider threat detection in cyber security, fault detection or malignant diagnosis. Anomalies are also referred to as outliers, discordant Manuscript received July 2015. Miss. Supriya A. Bagane, Computer Engineering, Bhivarabai Sawant Institute of Technology and Research., Pune, India, 9049720859 Prof. Sonali Patil, Computer Engineering, Bhivarabai Sawant Institute of Technology and Research, Pune, India, 8308885446, observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains [5]. The terms anomalies and outliers can be used interchangeably. Anomalous data is not available in large amount but its presence may affect the solution model such as the distribution or the principal directions of the data [1]. Consider the example of calculation of data mean or the least square solution of the associated linear regression model both are sensitive to outliers. They may get affected by outliers. Leave One out strategy can be used to calculate the principal directions of the data set without the presence of target instance and original data set. This thing is helpful in determining the variation of resulting principal directions. After this by differentiating these two eigen vectors the anomaly of target instance is calculated. To address this problem which is there in existing system we use the oversampling strategy to duplicate target instance and to perform oversampling PCA (ospca). The effect of outlier instance will be amplified due to its duplicates present in the PCA formulation due to this it becomes easier to detect outlier data[3]. The method proposed here works on dimensionality reduction provided by PCA and the calculation of threshold value. Using this threshold value as a parameter the outlier detection is performed. So it is not required to store entire data or covariance matrix. Due to which computation and memory requirements will get reduced. II. EXISTING SYSTEM The available approaches are divided into three categories: distribution or statistical, distance based and density based methods [1][2][3][4][5]. One more approach Angle Based Outlier detection is also popular [1]. Statistical approach assumes that the data follows some standard or predetermined distributions which aim to find outliers deviated from those standard distributions. Most distribution models are assumed here are univariate and so there is lack of robustness for multidimensional data. Distance based method calculates the distances between each data point of interest and its neighbors. It uses some predetermined threshold, if the result is above this threshold the target instance is considered as anomaly. In this method there is no need of prior knowledge of data distribution. These approaches encounter problems when data distribution is complex (e.g. multiclustered structure). In such type of cases this approach will result in determining improper neighbors and thus outliers cannot be correctly identified. 2475

The problem faced in distance based methods we have density based methods. This method uses density based Local outlier factor (LOF) to measure the outlierness of each data instance. The LOF finds the degree of outlierness by using the local density of each data instance which provides suspicious ranking scores for all samples. The Local Outlier Factor estimates local data structure using density estimation which allows users to identify outliers which are there in global data structure. The estimation of local data density for each instance is computationally very expensive, when size of data set is large. Along with these three approaches one more approach is quite popular and unique that is Angle Based Outlier Detection (ABOD). Simply speaking, ABOD calculates the variation of the angles between each target instance and the remaining data points, because it is observed that an outlier will produce a smaller angle variance than the normal ones do. The main point to note regarding ABOD is the computation complexity due to a huge amount of instance pairs to be considered. So, a fast ABOD algorithm is proposed to generate an approximation of the original ABOD solution. However, the search of the nearest neighbours still prohibits its extension to large scale problems (batch or online modes), since the user will need to keep all data instances to calculate the required angle information [1][10]. There are some problems with the methods specified here such as the anomaly detection methods specified here are typically implemented in batch mode so it is not possible to extend this anomaly detection to large scale or online data. Distribution models are assumed univariate in existing system due to which there is lack of robustness when high dimensional data is concerned. Moreover these models are typically implemented in the original data space directly; due to that their solution models might suffer from the noise present in the data. computation time and need less memory requirements also clustering technique is added to it for optimization purpose. A. Principal Component Analysis Principal component analysis is appropriate when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. The principal components may then be used as predictor or criterion variables in subsequent analyses. Principal Component Analysis uses dimension reduction and also determines the principal directions of data. We have used this method to calculate principal directions of data. To generate principal directions, we have to construct data covariance matrix and calculate its dominant eigen vectors. These eigenvectors are very informative among all the vectors which are there in original data space. So they are considered as principal directions [1]. By using the dimensionality reduction, the dimensionality of the data set is reduced to get the reduced items. These items are considered as attributes. By calculating the threshold value of all these attributes anomaly detection is performed. Minimum threshold is chosen to perform this task. B. Use of PCA for Anomaly Detection By using PCA we can find principal directions of data instances. Here we will study the variation of principal directions to perform detection. When we add or remove a data instance there is variation in principal direction. When a normal instance is added there is a slight change in the direction. When an outlier instance will get added there is drastic change in principal direction. Based on this detection is performed. The illustration of this is given fig. 1. III. ANOMALY DETECTION USING OVERSAMPLING PCA The problems faced with existing system do not allow anomaly detection with large scale or online data. With large data there exist the problem of high memory and computational requirements. To overcome this problem we can use oversampling technique with Principal Component Analysis (PCA). Outlier detection is the process of identifying unusual behavior which is different from normal one. It is widely used in data mining, for example, to identify customer behavioral change, fraud and manufacturing flaws. In recent years many researchers had proposed several concepts to obtain the optimal result in detecting the anomalies. But the process of PCA made it somewhat different due to its computations. In order to overcome the computational complexity, online oversampling PCA has been used. The algorithm uses Online updating technique of the principal directions for the effective computation and satisfying the online detecting demand and also oversampling will improve the effect of outliers which leads to accurate detection of outliers. Experimental results show that this method is effective in Fig. 1 The effect of adding or removing outlier on the principal directions We note that the clustered blue circles in Fig. 1 represent normal data instances, the red square denotes an outlier, and the green arrow is the dominant principal direction. From Fig. 1, we see that the principal direction is deviated when an outlier instance is added. More specifically, the presence of such an outlier instance produces a large angle between the resulting and the original principal directions. On the other hand, this angle will be small when a normal data point is added. Therefore, this is used to determine the outlierness of the target data point using the LOO strategy [1]. C. Oversampling PCA When we have large data set at that time the adding or removing of a single outlier instance will not significantly 2476

affect the resulting principal direction of the data. Due to which oversampling strategy is used and presents an oversampling Principal Component Analysis (ospca) algorithm for online or large scale anomaly detection problems. The ospca scheme duplicates the target instance multiple times [1]. This technique will not store entire covariance matrix like previously available methods. By oversampling the target instance, ospca allows to determine the anomaly of target instance with the help of variations in the dominant eigenvector [3]. The ospca will duplicate the target instances n no of times from data set and computes the score of outlier. If the score is above the threshold it is determined to be an outlier. The solving of PCA to calculate the principal directions n times using n instances and it is very costlier and it prohibits the online data in anomaly detection. Oversampling is basically used to amplify the outlierness of each data point. For identifying those outliers by using LOO strategy it is required to duplicate target instances instead of removing it. That means we can duplicate the target instance many times and observe how much variation is there in the principal direction. With this oversampling scheme the principal directions and mean of the data will only be affected slightly if the target instance is a normal data point shown in fig. 2(a). On the contrary, the variations will be enlarged if we duplicate an outlier shown in fig. 2(b). labeling. To build a simple and effective online detection model it is required to disregard these potentially deviated data instances from the training set of normal data. Fig. 3 Online Anomaly Detection Framework The flow chart of this scheme is shown in fig. 3. Two main phases are there which are needed to implement this technique: Data cleaning and online detection. In the data cleaning phase filters out the most deviated data using ospca. This first phase is offline and the percentage of the training normal data to be disregarded can be determined by the user. In the second phase of online detection, threshold is used to determine the anomaly of each received data point. In this phase the dominant principal direction of the filtered training normal data which extracted in data cleaning phase is used to detect each arriving target instance [1]. V. IMPLEMENTATION DETAILS This online Anomaly detection can be carried out in three phases: Data Cleaning, Detection and Clustering. A. Data Cleaning Data preprocessing is the first step in data mining which involves transformation of raw data into an understandable format. Real world data is incomplete, inconsistent and/or lacking in certain behaviors and may contain errors. A set of data instances from original data set is selected as predefined input. The PCA formulation is applied on the data set selected. The PCA formulation helps to perform reduction. Only the reduced attributes are used for anomaly detection, instead of using entire data matrix or covariance matrix. This is shown in Fig.4. Fig. 2: The effect of oversampling on an outlier and normal instances On the other hand we can also apply oversampling scheme in the LOO strategy. The main idea is to enlarge the effect between the normal data point and an outlier. This is possible with the help of oversampling PCA [6]. IV. ARCHITECTURE OF PROPOSED SYSTEM For Online Anomaly detection applications like spam mail filtering, it is desired to design an initial classifier using the training normal data, and this classifier is updated by newly received normal or outlier data accordingly. In practical scenario the training normal data collected in advance can be contaminated by noise of incorrect data 2477

C. Clustering Fig. 5 Predicted Outliers The training data will be selected by assumption. So it may happen that some outlier data may be considered as normal data in the previous method due to training data. So the clustering method is used to tackle this problem. The clusters are created for input data instances and then the outlier calculation is done for each cluster to find the outlier exactly. By using simple clustering algorithm the data is divided into multiple clusters and then detection is performed based on threshold values. Fig. 6 shows this where outliers from each cluster are calculated. Fig. 4 Dimensionality Reduction B. Detection This phase is for detecting the outlierness of the user input. After reduction of data set totally reduced attributes are calculated. Next the threshold is calculated for each attribute. Smallest value is selected as threshold for anomaly detection. When the user giving the input to the system, the system calculate the threshold value of that input, then compares that new value with the threshold value which is previously calculated. If the St value of the new data instance is greater than threshold value, then input data is identified as an outlier and that value will be discarded by the system. Otherwise it is considered as a normal data instance, and the PCA value of that data instance is updated accordingly. Based on these threshold values predicted outliers are calculated first as shown in Fig. 5. Fig. 6 Proposed System VI. RESULTS In existing system, the methods used are implemented in batch mode so they cannot be extended to large scale problems. If they are extended to large scale problems, they will result in sacrificing computation and memory requirements. The proposed system will overcome this limitation. By using oversampling PCA, to perform large scale or online anomaly detection, memory requirements will get reduced. Fig. 7 Graph comparing Existing and Proposed System 2478

VII. CONCLUSION The method proposed here is an online updating technique with oversampling PCA. Oversampling PCA with LOO strategy will amplify the effect of outliers. Thus we can successfully use the variation of the dominant principal direction to identify the presence of rare but abnormal data. Although there are various methods for anomaly detection, the method proposed here does not need to keep the entire covariance or data matrices during the online detection process. Therefore, compared with other anomaly detection methods, this approach is able to achieve satisfactory results while significantly reducing computational costs and memory. Thus our ospca is preferable for online large scale or streaming data problems. Further in future work this will extend the same strategy of oversampling Principal Component Analysis for the detection of outlier in data with high dimensional space using online updating technique with the help of multidimensional reduction techniques. ACKNOWLEDGMENT I would like to acknowledge all the people who helped and assisted me throughout my work. First of all I would like to thank respected Principal Dr. T. K. Nagaraj sir and respected H.O.D. Prof. G. M. Bhandari Mam and my guide Prof. Sonali Patil Mam and all the professors in our department for their constant support. REFERENCES [1] Yuh-Jye Lee, Yi-Ren Yeh, and Yu-Chiang Frank Wang, Anomaly Detection via Online Oversampling Principal Component Analysis IEEE Transactions on Knowledge and Data Engineering, VOL. 25, NO. 7, July 2013 [2] Gal I., Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining [3] Priyanka R. Patil, R. D. Kadu, A Novel data Mining approach to Calculate Outlier from Large Data set using Advance Principal component Analysis, Proceedings of IRF International Conference, 5th & 6th February 2014, Pune India. ISBN: 978-93-82702-56-6 [4] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, LiWu chang, A Novel Anomaly Detection Scheme Based on Principal Component Classifier, Naval Research Laboratory, Center for High Assurance Computer Systems [5] V. Chandola, A. Banerjee, and V. Kumar, Anomaly Detection: A Survey, ACM Computing Surveys, vol. 41, no. 3, pp. 15:1-15:58, 2009. [6] Y. Srilaxmi, D. Ratna Kishore, Online Anomaly Detection under Oversamplig PCA, International Journal of Science and Research(IJSR), ISSN(Online):2319-7064, Volume 3 Issue 9, September 2014 [7] Y.-R. Yeh, Z.-Y. Lee, and Y.-J. Lee, Anomaly Detection via Oversampling Principal Component Analysis, Proc. First KES Int l Symp. Intelligent Decision Technologies, pp. 449-458, 2009 [8] T. Ahmed, Online Anomaly Detection using KDE, Proc. IEEE Conf. Global Telecomm., 2009. [9] D.M. Hawkins, Identification of Outliers. Chapman and Hall, 1980. [10] H.-P. Kriegel, M. Schubert, and A. Zimek, Angle-Based Outlier Detection in High-Dimensional Data, Proc. 14th ACM SIGKDD Int l Conf. Knowledge Discovery and data Mining, 2008. 2479