What are anomalies and why do we care?

Anomaly Detection Based on V. Chandola, A. Banerjee, and V. Kupin, Anomaly detection: A survey, ACM Computing Surveys, 41 (2009), Article 15, 58 pages.

Outline What are anomalies and why do we care? Different aspects of detection Anomaly detection applications Classification based anomaly detection methods Nearest neighbor based anomaly detection methods Clustering based anomaly detection methods Statistical anomaly detection methods Information theoretic methods Spectral anomaly detection methods Handling conceptual anomalities 2

What are anomalies and why do we care? Anomaly (outlier, discordant, exception, aberration, peculiarity, or contaminant) detection refers to finding patterns that have unexpected behavior. First studied in 1887 paper (Edgeworth) in statistics. Important since anomalies in data translate into significant and actionable information in apps. Useful in cyber-security, fraud or intrusion detection, health care, sensor or equipment failure, surveillance. Relevance or interestingness of anomalies is a feature. Related to noise accommodation/removal and novelty detection. 3

Major challenges to making detections: Defining a normal region that covers all normal behavior patterns. Anomalies due to malicious behavior will adapt over time to continue to be plagues. Normal behavior evolves over time. Different application domains have different views on what is an anomaly making a general theory difficult to establish. General lack of availability of useful data for training purposes in labeling algorithms. Noise is similar to an anomaly and leads to false detections. 4

Diversity of techniques: Statistics Machine learning Information theory Spectral theory Graph theory and topology which are used in different application areas and may be specific to a single one of them (i.e., not useful for developing an abstract theory). 5

Different aspects of anomaly detection A key aspect of any anomaly detection technique is the nature of the input data. Input is generally a collection of data instances (an object, record, point, vector, pattern, event, case, sample, observation, or entity). The attributes can be of different types such as binary, categorical, or continuous. Each data instance consists of one (univariate) or multiple (multivariate) attributes. Multiple data types are common in the multivariate case. 6

Attributes are key to which technique(s) to use in anomaly detection. Data instances cause relationships, too (e.g., sequence data, spatial data, and graphical data). In sequence data, data instances are linearly ordered. In spatial data, data instances are related to neighboring data points. Similarly with spatialtemporal data. In graphical data, data instances are related through vertices and edges. 7

Anomaly classification: 1. Point anomalies refer to when each data is independent of all of the others. 2. Contextual anomalies refer to when each data has a set of possible attributes. a. Contextual attributes: example is when latitude and longitude are attributes only. b. Behavioral attributes: example is average rainfall at a given point on the planet. 3. Collective anomalies refer to having multiple data instances with particular properties. An example is a sequence like ssh, buffer overflow, http-web, httpweb, ssh, buffer overflow, ftp, http-web, ftp, 8

Data labels are associated with data instances and can indicate many things including if the instance is normal or an anomaly. Getting highly labeled data is extremely expensive and is frequently achieved using training sets. Anomaly detection operates with labeled data in Supervised mode: a training set with normal and anomaly labeled data is available. Semisupervised mode: a training set with just normal labeled data is available. Unsupervised mode: no training set is required because it is assumed that normal data is far more common than anomalies. 9

Anomaly detection output using is one of only two forms: 1. Scores: Each instance receives a numeric score. Analysts will usually only look at high scored data to verify anomalies. Domain specific scoring thresholds are common and useful. 2. Labels: Normal or anomaly is assigned to each test instance. 10

Anomaly detection applications We consider a number of applications: Intrusion detection Fraud detection Medical and health anomaly detection Industrial damage detection Image processing Text processing Sensor networks There are many more application fields that can be considered with specialized methodologies. 11

Intrusion detection Intrusion detection refers to detection of malicious activity (break-ins, hacking attempts, penetrations, and other forms of computer abuse) on computer systems. Key challenges: Huge volume of data. The anomaly detection techniques must be computationally efficient. Data typically streamed and requires online analysis. False alarm rate can be too high. 12

Host based intrusion detection systems are required to handle the sequential nature of data. Moreover, point anomaly detection techniques are not applicable in this domain. The techniques have to either model the sequence data or compute similarity between sequences. Network based intrusion detection systems deal with network data. The intrusions typically occur as point anomalies though certain techniques model the data in a sequential fashion and detect collective anomalies. The big challenge here is that the anomalies evolve over time as hackers refine their techniques. 13

Methods used in intrusion detection: Statistical profiling using histograms Parametric or nonparametric statistical modeling Bayesian networks Mixture of models Neural networks Support vector machines Rule-based systems Clustering based Nearest neighbor based Spectral Information theoretic 14

Fraud detection Fraud detection refers to detection of criminal activities occurring in commercial organizations such as banks, credit card companies, insurance agencies, cell phone companies, stock market, etc. and may involve employees or customers. Immediate detection is wanted. Credit card fraud: Data is multidimensional (user id, amount spent, frequency of use, location, distance between last location, time between last use, history of items purchased in the past, ). 15

Point anomaly techniques typically used. By owner and by operation methods used. Want detection during the first fraudulent transaction. Do not want to irritate cardholder with false alarms that freeze a card (this can be really irritating when overseas during a hotel checkout or conference registration). Methods used in credit card fraud detection: Neural networks Rule based systems Clustering based 16

Mobile phone fraud detection The task is to scan a large set of accounts, examining the calling behavior of each, and to issue an alarm when an account appears to have been misused. Methods used in mobile phone fraud detection: Statistical profiling using histograms Parametric statistical modeling Neural networks Rule based systems Insurance claim fraud detection handled similarly. 17

Insider trading detection This refers to the knowledge of a pending merger, acquisition, a terrorist attack affecting a particular industry, pending legislation affecting a particular industry, or any information that would affect the stock prices in a particular industry. Insider trading can be detected by identifying anomalous trading activities in the regular and options markets and tax declarations. Methods used in insider trading detection: Statistical profiling using histograms Information theoretic 18

Medical and public health anomaly detection It works with patient records. The data can have anomalies due to several reasons, e.g., abnormal patient condition, instrumentation errors, or recording errors. Several techniques have also focused on detecting disease outbreaks in a specific area. Methods used in medical and public health anomaly detection: Parametric statistical modeling Neural networks Rule based systems 19

Bayesian networks Nearest neighbor based Industrial damage detection Industrial units suffer damage due to continuous usage and normal wear and tear. Such damage needs to be detected early to prevent further escalation and losses. The data in this domain is usually referred to as sensor data because it is recorded using different sensors and collected for analysis. There are two categories: 20

1. Fault Detection in Mechanical Units 2. Structural Defect Detection The methods used in industrial damage detection: Statistical profiling using histograms Parametric or nonparametric statistical modeling Bayesian networks Mixture of models Neural networks Rule-based systems Spectral 21

Image processing Look for interesting features, e.g., video surveillance, satellite imagery, xrays/ct scans/ Methods used in image processing: Bayesian networks Mixture of models Regression Neural networks Support vector machines Clustering based 22

Nearest neighbor based Text data anomaly detection Detect novel topics or events or news stories in a collection of documents or news articles. The anomalies are caused due to a new interesting event or an anomalous topic. The data in this domain is typically high dimensional and very sparse. The data also has a temporal aspect since the documents are collected over time. 23

Methods used in text data anomaly detection: Statistical profiling using histograms Mixture of models Neural networks Support vector machines Clustering based Sensor networks Anomalies in data collected from a sensor network can either mean that one or more sensors are faulty or they are detecting events (e.g., intrusions). Anomaly detection 24

in sensor networks can capture sensor fault detection or intrusion detection or both. By definition, the data is collected online and in a distributed manner. Distributed data mining is used. Methods used in sensor networks: Parametric statistical modeling Bayesian networks Nearest neighbor based Rule-based systems Spectral 25

Classification based anomaly detection methods Classification-based anomaly detection techniques operate in a similar two-phase fashion. The training phase learns a classifier using the available labeled training data. The testing phase classifies a test instance as normal or anomalous, using the classifier. Assumption: A classifier that can distinguish between normal and anomalous classes can be learned in the given feature space. 26

One-class classification based anomaly detection techniques assume that all training instances have only one class label. Multi-class classification based anomaly detection techniques assume that the train- ing data contains labeled instances belonging to multiple normal classes. Each class is tested for normalness inside a confidence level. Test instances are anomalous if no test is normal. 27

Neural networks based A basic multi-class anomaly detection technique using neural networks operates in two steps: 1. A neural network is trained on the normal training data to learn the different normal classes. 2. Each test instance is provided as an input to the neural network. Replicator Neural Networks have been used for one-class anomaly detection. If the neural net accepts the test input, it is normal. 28

Some neural net classification methods: Multi-layered perceptrons Neural trees Auto-associative networks Adaptive resonance theory based Radial basis function based Hopfield networks Oscillatory networks 29

Bayesian networks based The different attributes are assumed independent. This is a basic technique for a univariate categorical dataset. Given a test data instance it uses a native Bayesian network to estimate the posterior probability of observing a class label from a set of normal class labels and the anomaly class label. The class label with largest posterior is chosen as the predicted class for the given test instance. The likelihood of observing the test instance given a class and the prior on the class probabilities is estimated from the training data set. 30

Support vector machines based This method is applied one class setting and learn a region that contains the training data instances (a boundary). Kernels, such as radial basis function kernel, can be used to learn complex regions. For each test instance, the basic technique determines if the test instance falls within the learned region. If a test instance falls within the learned region, it is declared as normal. Otherwise it is an anomaly. Audio anomaly detection is one of the major uses of this method. 31

Rule based This method learns rules that capture the normal behavior of a system. A test instance that is not covered by any such rule is considered as an anomaly. Rule based techniques have been applied in both one class and multiclass settings. A basic multi-class rule-based technique consists of two steps: 1. Learn rules (each with a confidence level) from the training data using a rule learning algorithm. 32

2. Find for each test instance the rule that best captures the test instance. Complexity: This is dependent on which classification algorithm is used. Decision trees are faster than SVMs. +/- of classification based methods: + Classification-based techniques (especially the multi-class techniques) can use powerful algorithms that can distinguish between instances belonging to different classes. 33

+ The testing phase of classification based methods is fast since each test instance needs to be compared against only a precomputed model. Multi-class classification based methods rely on the availability of accurate labels for various normal classes, which is often impossible. Classification based methods assign a label to each test instance, which can also become a disadvantage when a meaningful anomaly score is desired for the test instances. 34

Nearest neighbor based anomaly detection methods Assumption. Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. Nearest neighbor based anomaly detection methods can be broadly grouped into two categories: 1. Methods that use the distance of a data instance to its kth nearest neighbor as the anomaly score; 2. Methods that compute the relative density of each data instance to compute its anomaly score. 35

Using distance to kth nearest neighbor The anomaly score of a data instance is defined as its distance to its kth nearest neighbor in a given data set. Three extensions: 1. Modify the definition to obtain the anomaly score of a data instance. 2. Use different distance/similarity measures to handle different data types. 3. Improve the efficiency of the basic technique: the complexity of the basic technique is O(N 2 ), where N is the data size. Use faster methods. 36

For point 3, prune the search space by either ignoring instances that cannot be anomalous or by focusing on instances that are most likely to be anomalous. Using relative density Density based anomaly detection techniques estimate the density of the neighborhood of each data instance: Low density implies an anomaly. High density implies normal. Density based techniques perform poorly if the data has regions of varying densities. Approaches that try to 37

weigh the relative weights of neighboring dense neighborhoods have been developed. Complexity: O(N 2 ) +/- of nearest neighbor based methods: + They are unsupervised in nature and do not make any assumptions regarding the generative distribution for the data. Instead, they are purely data driven. + Semisupervised techniques perform better than unsupervised techniques in terms of missed anomalies, since the likelihood that an anomaly will 38

form a close neigh- borhood in the training data set is very low. + Adapting these methods to a different data type is easy: modify the measure. Missed anomalies for unsupervised methods: if the data has normal instances that do not have enough close neighbors, or if the data has anomalies that have enough close neigh- bors, the technique fails to label them correctly. Many false positives for semi-supervised methods: if the normal instances in the test data do not have enough similar normal instances in the training data. 39

The computational complexity of the testing phase is also a significant challenge since it involves computing the distance of each test instance. The performance of a nearest neighbor based technique greatly relies on a distance measure. 40

Clustering based anomaly detection methods Clustering is used to group similar data instances into clusters. Clustering is primarily an unsupervised technique though semi-supervised clustering has also been explored lately. Three formulations based on different assumptions: 1. Normal data instances belong to a cluster in the data, while anomalies do not belong to any cluster. 2. Normal data instances lie close to their closest cluster centroid, while anomalies are far away from their closest cluster centroid. 41

3. Normal data instances belong to large and dense clusters, while anomalies either belong to small or sparse clusters. Assumption 1 remarks: Apply a known clustering based algorithm to the data set and declare any data instance that does not belong to any cluster as anomalous. A disadvantage of such techniques is that they are not optimized to find anomalies, since the main aim of the underlying clustering algorithm is to find clusters. Assumption 2 remarks: 42

Methods consist of two steps: (1) the data is clustered using a clustering algorithm, and (2) for each data instance, its distance to its closest cluster centroid is calculated as its anomaly score. Can operate in semi-supervised mode. If the anomalies in the data form clusters by themselves, these techniques will not be able to detect such anomalies. Assumption 3 remarks: Methods declare instances belonging to clusters whose size or density is below a threshold as anomalous. 43

There are linear time algorithms. There are many similarities between clustering and nearest neighbor based anomaly detection methods. Complexity: Depends on the training and detection algorithms, but there are a few O(N) ones. +/- of clustering based methods: + Unsupervised mode is viable. + Complex data types handled by using a clustering algorithm that can handle the particular data type. 44

+ The testing phase is fast since the number of clusters against which every test instance needs to be compared is a small constant. Performance is highly dependent on the effectiveness of clustering algorithms in capturing the cluster structure of normal instances. Many techniques detect anomalies as a byproduct of clustering, and hence are not optimized for anomaly detection. Miss anomalies: some clustering algorithms force every instance to be assigned to some cluster. Anomaly clusters are clusters, so missed. Sloooow: O(dN), where d= data. 45

Statistical anomaly detection methods Underlying principle: An anomaly is an observation that is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed. Assumption. Normal data instances occur in high probability regions of a stochastic model while anomalies occur in low probability regions of the stochastic model. Statistical techniques fit a statistical model (usually for normal behavior) to the given data and then apply a 46

statistical inference test to determine if an unseen instance belongs to this model or not. Parametric Methods Parametric methods assume that the normal data is generated by a parametric distribution with parameters θ and probability density function f (x,θ), where x is an observation. The anomaly score of a test instance (or observation) x is the inverse of the probability density function f (x,θ). The parameters θ are estimated from the given data. 47

Gaussian modeling based Such methods assume that the data is generated from a Gaussian distribution. The parameters are estimated using Maximum Likelihood Estimates (MLE). The distance of a data instance to the estimated mean is the anomaly score for that instance. A threshold is applied to the anomaly scores to determine the anomalies. Different techniques in this category calculate the distance to the mean and the threshold in different ways. Statistical rules commonly used: 48

Box plot rule Grubb s test Student t-test χ 2 test Regression model based Two steps: 1. A regression model is fitted on the data. 2. For each test instance, the residual for the test instance is used to determine the anomaly score. 49

The residual is the part of the instance that is unexplained by the regression model. The magnitude of the residual can be used as the anomaly score for the test instance, though statistical tests have been proposed to determine anomalies with certain confidence. Another variant that detects anomalies in multivariate time-series data is generated by an Autoregressive Moving Average (ARMA) model. 50

Mixture of parametric distributions based Such methods use a mixture of parametric statistical distributions to model the data. Two subcategories: 1. Those that model the normal instances and anomalies as separate parametric distributions. 2. Those that model only the normal instances as a mixture of parametric distributions. Subcategory remarks: 1. The testing phase involves determining which distribution, normal or anomalous, the test instance belongs to. 51

2. Model the normal instances as a mixture of parametric distributions. A test instance that does not belong to any of the learned models is declared to be an anomaly. Nonparametric methods These methods use nonparametric statistical models, such that the model structure is not defined a prioiri, but is instead determined dynamically from the data. These methods make fewer assumptions regarding the data, (e.g., smoothness of density) when compared to parametric techniques. 52

Histogram based The simplest nonparametric statistical method is to use histograms to maintain a profile of the normal data. The size of the bin used when building the histogram is key for anomaly detection: Too small: many normal test instances will fall in empty or rare bins, resulting in a high false alarm rate. Too large: many anomalous test instances will fall in frequent bins, resulting in a high false negative rate. 53

For univariate data there are two steps: 1. Build a histogram based on the different values taken by that feature in the training data. 2. Check if a test instance falls in any one of the bins of the histogram. If it does, then the test instance is normal. Otherwise it is anomalous. For multivariate data, a basic technique is to construct attributewise histograms. During testing, for each test instance the anomaly score for each attribute value of the test instance is calculated as the height of the bin that contains the attribute value. 54

The per attribute anomaly scores are aggregated to obtain an overall anomaly score for the test instance. Complexity: Completely dependent on the method(s) used. Good luck. +/- of statistical based methods: + Assumptions true, statistics true + Confidence levels provide, well, confidence + Unsupervised mode works if the distribution estimation step is robust to anomalies in data. Methods rely on the assumption that the data is generated from a particular distribution. This 55

assumption often does not hold true, especially for high dimensional real data sets. What was that famous quote about statistics??? Histograms are simple to implement and easily lie about the results. You get what you pay for. 56

Information theoretic methods Analyze the information content of a data set using different information theoretic measures such as Kolomogorov complexity, entropy, relative entropy, etc. Assumption: Anomalies in data induce irregularities in the information content of the data set. Let C(D) denote the complexity of a given data set, D. A basic information theoretic technique can be described as follows. Given a data set D, find the minimal subset of instances, I, such that C(D) C(D I) is maximum. All 57

instances in the subset thus obtained, are deemed as anomalous. The problem addressed by this basic technique is to find a Pareto-optimal solution, which does not have a single optimum, since there are two different objectives that need to be optimized. Complexity: This has exponential time complexity. Never, ever use it unless you have no other choice. +/- of information theoretic based methods: + Unsupervised mode works like a charm. + No assumptions about the underlying data. 58

The performance of such techniques is highly dependent on the choice of the information theoretic measure. Only with huge numbers of anomalies is even one found. Information theoretic techniques applied to sequences (and spatial data sets rely on the size of the substructure) are often nontrivial to obtain. It is difficult to associate an anomaly score with a test instance using this method. 59

Spectral anomaly detection methods Spectral techniques try to find an approximation of the data using a combination of attributes that capture the bulk of the variability in the data. Assumption. Data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different. Determine such subspaces that the anomalous instances can be easily identified. Such techniques can work in an unsupervised as well as a semi-supervised setting. 60

Principal component analysis is the major algorithm used. Complexity: Typically O(d), but O(dim 2 ). Singular value decompositions are frequently used and O(N 2 ). +/- of spectral based methods: + Spectral techniques automatically perform dimensionality reduction and are suitable for handling high dimensional data sets. + They can be used as a preprocessing step followed by application of any existing anomaly detection technique in the transformed space. 61

+ They can be used in an unsupervised setting. Spectral techniques are useful only if the normal and anomalous instances are separable in the lower dimensional embedding of the data. Have high computational complexity. 62

Defining concepts: Spatial Graphs Sequential Profile Handling conceptual anomalies There is very little literature in this area. It is rip for Ph.D. dissertations. 63

Quick summary A general theory is still an open research problem that will reward numerous students with Ph.D.s in the future. That said, there are many areas that have been developed over a long time. 64