What are anomalies and why do we care?

Similar documents
Automatic Detection Of Suspicious Behaviour

Anomaly Detection on Data Streams with High Dimensional Data Environment

DATA MINING II - 1DL460

Outlier Detection. Chapter 12

Large Scale Data Analysis for Policy

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

Contents. Preface to the Second Edition

Anomaly Detection in Categorical Datasets with Artificial Contrasts. Seyyedehnasim Mousavi

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

Anomaly Detection. You Chen

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Database and Knowledge-Base Systems: Data Mining. Martin Ester

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

Detection of Anomalies using Online Oversampling PCA

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Chapter 5: Outlier Detection

Jarek Szlichta

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Detection and Deletion of Outliers from Large Datasets

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Applying Supervised Learning

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Table Of Contents: xix Foreword to Second Edition

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

Network Traffic Measurements and Analysis

Performance Analysis of Data Mining Classification Techniques

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Machine Learning. Supervised Learning. Manfred Huber

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Using Machine Learning to Optimize Storage Systems

Slides for Data Mining by I. H. Witten and E. Frank

Lab 9. Julia Janicki. Introduction

Generative and discriminative classification techniques

Chapter 9: Outlier Analysis

MSA220 - Statistical Learning for Big Data

CS 229 Midterm Review

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Chapter 3: Supervised Learning

Data Mining Classification - Part 1 -

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

10-701/15-781, Fall 2006, Final

SOCIAL MEDIA MINING. Data Mining Essentials

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Chapter 1, Introduction

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Unsupervised Learning

Performance Evaluation of Various Classification Algorithms

Clustering & Classification (chapter 15)

CPSC 340: Machine Learning and Data Mining

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Topics in Machine Learning

Data Preprocessing. Data Mining 1

Computational Statistics and Mathematics for Cyber Security

Machine Learning Techniques for Data Mining

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Data Preprocessing. Slides by: Shree Jaswal

Machine Learning : Clustering, Self-Organizing Maps

STATISTICS (STAT) Statistics (STAT) 1

3 Feature Selection & Feature Extraction

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Supervised vs unsupervised clustering

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Data Preprocessing. Data Preprocessing

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

CSE4334/5334 DATA MINING

Support vector machines

Random Forest A. Fornaser

Machine Learning - Clustering. CS102 Fall 2017

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

The Curse of Dimensionality

Machine Learning Lecture 3

Multi-label classification using rule-based classifier systems

Classifiers and Detection. D.A. Forsyth

Artificial Neural Networks (Feedforward Nets)

Transcription:

Anomaly Detection Based on V. Chandola, A. Banerjee, and V. Kupin, Anomaly detection: A survey, ACM Computing Surveys, 41 (2009), Article 15, 58 pages.

Outline What are anomalies and why do we care? Different aspects of detection Anomaly detection applications Classification based anomaly detection methods Nearest neighbor based anomaly detection methods Clustering based anomaly detection methods Statistical anomaly detection methods Information theoretic methods Spectral anomaly detection methods Handling conceptual anomalities 2

What are anomalies and why do we care? Anomaly (outlier, discordant, exception, aberration, peculiarity, or contaminant) detection refers to finding patterns that have unexpected behavior. First studied in 1887 paper (Edgeworth) in statistics. Important since anomalies in data translate into significant and actionable information in apps. Useful in cyber-security, fraud or intrusion detection, health care, sensor or equipment failure, surveillance. Relevance or interestingness of anomalies is a feature. Related to noise accommodation/removal and novelty detection. 3

Major challenges to making detections: Defining a normal region that covers all normal behavior patterns. Anomalies due to malicious behavior will adapt over time to continue to be plagues. Normal behavior evolves over time. Different application domains have different views on what is an anomaly making a general theory difficult to establish. General lack of availability of useful data for training purposes in labeling algorithms. Noise is similar to an anomaly and leads to false detections. 4

Diversity of techniques: Statistics Machine learning Information theory Spectral theory Graph theory and topology which are used in different application areas and may be specific to a single one of them (i.e., not useful for developing an abstract theory). 5

Different aspects of anomaly detection A key aspect of any anomaly detection technique is the nature of the input data. Input is generally a collection of data instances (an object, record, point, vector, pattern, event, case, sample, observation, or entity). The attributes can be of different types such as binary, categorical, or continuous. Each data instance consists of one (univariate) or multiple (multivariate) attributes. Multiple data types are common in the multivariate case. 6

Attributes are key to which technique(s) to use in anomaly detection. Data instances cause relationships, too (e.g., sequence data, spatial data, and graphical data). In sequence data, data instances are linearly ordered. In spatial data, data instances are related to neighboring data points. Similarly with spatialtemporal data. In graphical data, data instances are related through vertices and edges. 7

Anomaly classification: 1. Point anomalies refer to when each data is independent of all of the others. 2. Contextual anomalies refer to when each data has a set of possible attributes. a. Contextual attributes: example is when latitude and longitude are attributes only. b. Behavioral attributes: example is average rainfall at a given point on the planet. 3. Collective anomalies refer to having multiple data instances with particular properties. An example is a sequence like ssh, buffer overflow, http-web, httpweb, ssh, buffer overflow, ftp, http-web, ftp, 8

Data labels are associated with data instances and can indicate many things including if the instance is normal or an anomaly. Getting highly labeled data is extremely expensive and is frequently achieved using training sets. Anomaly detection operates with labeled data in Supervised mode: a training set with normal and anomaly labeled data is available. Semisupervised mode: a training set with just normal labeled data is available. Unsupervised mode: no training set is required because it is assumed that normal data is far more common than anomalies. 9

Anomaly detection output using is one of only two forms: 1. Scores: Each instance receives a numeric score. Analysts will usually only look at high scored data to verify anomalies. Domain specific scoring thresholds are common and useful. 2. Labels: Normal or anomaly is assigned to each test instance. 10

Anomaly detection applications We consider a number of applications: Intrusion detection Fraud detection Medical and health anomaly detection Industrial damage detection Image processing Text processing Sensor networks There are many more application fields that can be considered with specialized methodologies. 11

Intrusion detection Intrusion detection refers to detection of malicious activity (break-ins, hacking attempts, penetrations, and other forms of computer abuse) on computer systems. Key challenges: Huge volume of data. The anomaly detection techniques must be computationally efficient. Data typically streamed and requires online analysis. False alarm rate can be too high. 12

Host based intrusion detection systems are required to handle the sequential nature of data. Moreover, point anomaly detection techniques are not applicable in this domain. The techniques have to either model the sequence data or compute similarity between sequences. Network based intrusion detection systems deal with network data. The intrusions typically occur as point anomalies though certain techniques model the data in a sequential fashion and detect collective anomalies. The big challenge here is that the anomalies evolve over time as hackers refine their techniques. 13

Methods used in intrusion detection: Statistical profiling using histograms Parametric or nonparametric statistical modeling Bayesian networks Mixture of models Neural networks Support vector machines Rule-based systems Clustering based Nearest neighbor based Spectral Information theoretic 14

Fraud detection Fraud detection refers to detection of criminal activities occurring in commercial organizations such as banks, credit card companies, insurance agencies, cell phone companies, stock market, etc. and may involve employees or customers. Immediate detection is wanted. Credit card fraud: Data is multidimensional (user id, amount spent, frequency of use, location, distance between last location, time between last use, history of items purchased in the past, ). 15

Point anomaly techniques typically used. By owner and by operation methods used. Want detection during the first fraudulent transaction. Do not want to irritate cardholder with false alarms that freeze a card (this can be really irritating when overseas during a hotel checkout or conference registration). Methods used in credit card fraud detection: Neural networks Rule based systems Clustering based 16

Mobile phone fraud detection The task is to scan a large set of accounts, examining the calling behavior of each, and to issue an alarm when an account appears to have been misused. Methods used in mobile phone fraud detection: Statistical profiling using histograms Parametric statistical modeling Neural networks Rule based systems Insurance claim fraud detection handled similarly. 17

Insider trading detection This refers to the knowledge of a pending merger, acquisition, a terrorist attack affecting a particular industry, pending legislation affecting a particular industry, or any information that would affect the stock prices in a particular industry. Insider trading can be detected by identifying anomalous trading activities in the regular and options markets and tax declarations. Methods used in insider trading detection: Statistical profiling using histograms Information theoretic 18

Medical and public health anomaly detection It works with patient records. The data can have anomalies due to several reasons, e.g., abnormal patient condition, instrumentation errors, or recording errors. Several techniques have also focused on detecting disease outbreaks in a specific area. Methods used in medical and public health anomaly detection: Parametric statistical modeling Neural networks Rule based systems 19

Bayesian networks Nearest neighbor based Industrial damage detection Industrial units suffer damage due to continuous usage and normal wear and tear. Such damage needs to be detected early to prevent further escalation and losses. The data in this domain is usually referred to as sensor data because it is recorded using different sensors and collected for analysis. There are two categories: 20

1. Fault Detection in Mechanical Units 2. Structural Defect Detection The methods used in industrial damage detection: Statistical profiling using histograms Parametric or nonparametric statistical modeling Bayesian networks Mixture of models Neural networks Rule-based systems Spectral 21

Image processing Look for interesting features, e.g., video surveillance, satellite imagery, xrays/ct scans/ Methods used in image processing: Bayesian networks Mixture of models Regression Neural networks Support vector machines Clustering based 22

Nearest neighbor based Text data anomaly detection Detect novel topics or events or news stories in a collection of documents or news articles. The anomalies are caused due to a new interesting event or an anomalous topic. The data in this domain is typically high dimensional and very sparse. The data also has a temporal aspect since the documents are collected over time. 23

Methods used in text data anomaly detection: Statistical profiling using histograms Mixture of models Neural networks Support vector machines Clustering based Sensor networks Anomalies in data collected from a sensor network can either mean that one or more sensors are faulty or they are detecting events (e.g., intrusions). Anomaly detection 24

in sensor networks can capture sensor fault detection or intrusion detection or both. By definition, the data is collected online and in a distributed manner. Distributed data mining is used. Methods used in sensor networks: Parametric statistical modeling Bayesian networks Nearest neighbor based Rule-based systems Spectral 25

Classification based anomaly detection methods Classification-based anomaly detection techniques operate in a similar two-phase fashion. The training phase learns a classifier using the available labeled training data. The testing phase classifies a test instance as normal or anomalous, using the classifier. Assumption: A classifier that can distinguish between normal and anomalous classes can be learned in the given feature space. 26

One-class classification based anomaly detection techniques assume that all training instances have only one class label. Multi-class classification based anomaly detection techniques assume that the train- ing data contains labeled instances belonging to multiple normal classes. Each class is tested for normalness inside a confidence level. Test instances are anomalous if no test is normal. 27

Neural networks based A basic multi-class anomaly detection technique using neural networks operates in two steps: 1. A neural network is trained on the normal training data to learn the different normal classes. 2. Each test instance is provided as an input to the neural network. Replicator Neural Networks have been used for one-class anomaly detection. If the neural net accepts the test input, it is normal. 28

Some neural net classification methods: Multi-layered perceptrons Neural trees Auto-associative networks Adaptive resonance theory based Radial basis function based Hopfield networks Oscillatory networks 29

Bayesian networks based The different attributes are assumed independent. This is a basic technique for a univariate categorical dataset. Given a test data instance it uses a native Bayesian network to estimate the posterior probability of observing a class label from a set of normal class labels and the anomaly class label. The class label with largest posterior is chosen as the predicted class for the given test instance. The likelihood of observing the test instance given a class and the prior on the class probabilities is estimated from the training data set. 30

Support vector machines based This method is applied one class setting and learn a region that contains the training data instances (a boundary). Kernels, such as radial basis function kernel, can be used to learn complex regions. For each test instance, the basic technique determines if the test instance falls within the learned region. If a test instance falls within the learned region, it is declared as normal. Otherwise it is an anomaly. Audio anomaly detection is one of the major uses of this method. 31

Rule based This method learns rules that capture the normal behavior of a system. A test instance that is not covered by any such rule is considered as an anomaly. Rule based techniques have been applied in both one class and multiclass settings. A basic multi-class rule-based technique consists of two steps: 1. Learn rules (each with a confidence level) from the training data using a rule learning algorithm. 32

2. Find for each test instance the rule that best captures the test instance. Complexity: This is dependent on which classification algorithm is used. Decision trees are faster than SVMs. +/- of classification based methods: + Classification-based techniques (especially the multi-class techniques) can use powerful algorithms that can distinguish between instances belonging to different classes. 33

+ The testing phase of classification based methods is fast since each test instance needs to be compared against only a precomputed model. Multi-class classification based methods rely on the availability of accurate labels for various normal classes, which is often impossible. Classification based methods assign a label to each test instance, which can also become a disadvantage when a meaningful anomaly score is desired for the test instances. 34

Nearest neighbor based anomaly detection methods Assumption. Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. Nearest neighbor based anomaly detection methods can be broadly grouped into two categories: 1. Methods that use the distance of a data instance to its kth nearest neighbor as the anomaly score; 2. Methods that compute the relative density of each data instance to compute its anomaly score. 35

Using distance to kth nearest neighbor The anomaly score of a data instance is defined as its distance to its kth nearest neighbor in a given data set. Three extensions: 1. Modify the definition to obtain the anomaly score of a data instance. 2. Use different distance/similarity measures to handle different data types. 3. Improve the efficiency of the basic technique: the complexity of the basic technique is O(N 2 ), where N is the data size. Use faster methods. 36

For point 3, prune the search space by either ignoring instances that cannot be anomalous or by focusing on instances that are most likely to be anomalous. Using relative density Density based anomaly detection techniques estimate the density of the neighborhood of each data instance: Low density implies an anomaly. High density implies normal. Density based techniques perform poorly if the data has regions of varying densities. Approaches that try to 37

weigh the relative weights of neighboring dense neighborhoods have been developed. Complexity: O(N 2 ) +/- of nearest neighbor based methods: + They are unsupervised in nature and do not make any assumptions regarding the generative distribution for the data. Instead, they are purely data driven. + Semisupervised techniques perform better than unsupervised techniques in terms of missed anomalies, since the likelihood that an anomaly will 38

form a close neigh- borhood in the training data set is very low. + Adapting these methods to a different data type is easy: modify the measure. Missed anomalies for unsupervised methods: if the data has normal instances that do not have enough close neighbors, or if the data has anomalies that have enough close neigh- bors, the technique fails to label them correctly. Many false positives for semi-supervised methods: if the normal instances in the test data do not have enough similar normal instances in the training data. 39

The computational complexity of the testing phase is also a significant challenge since it involves computing the distance of each test instance. The performance of a nearest neighbor based technique greatly relies on a distance measure. 40

Clustering based anomaly detection methods Clustering is used to group similar data instances into clusters. Clustering is primarily an unsupervised technique though semi-supervised clustering has also been explored lately. Three formulations based on different assumptions: 1. Normal data instances belong to a cluster in the data, while anomalies do not belong to any cluster. 2. Normal data instances lie close to their closest cluster centroid, while anomalies are far away from their closest cluster centroid. 41

3. Normal data instances belong to large and dense clusters, while anomalies either belong to small or sparse clusters. Assumption 1 remarks: Apply a known clustering based algorithm to the data set and declare any data instance that does not belong to any cluster as anomalous. A disadvantage of such techniques is that they are not optimized to find anomalies, since the main aim of the underlying clustering algorithm is to find clusters. Assumption 2 remarks: 42

Methods consist of two steps: (1) the data is clustered using a clustering algorithm, and (2) for each data instance, its distance to its closest cluster centroid is calculated as its anomaly score. Can operate in semi-supervised mode. If the anomalies in the data form clusters by themselves, these techniques will not be able to detect such anomalies. Assumption 3 remarks: Methods declare instances belonging to clusters whose size or density is below a threshold as anomalous. 43

There are linear time algorithms. There are many similarities between clustering and nearest neighbor based anomaly detection methods. Complexity: Depends on the training and detection algorithms, but there are a few O(N) ones. +/- of clustering based methods: + Unsupervised mode is viable. + Complex data types handled by using a clustering algorithm that can handle the particular data type. 44

+ The testing phase is fast since the number of clusters against which every test instance needs to be compared is a small constant. Performance is highly dependent on the effectiveness of clustering algorithms in capturing the cluster structure of normal instances. Many techniques detect anomalies as a byproduct of clustering, and hence are not optimized for anomaly detection. Miss anomalies: some clustering algorithms force every instance to be assigned to some cluster. Anomaly clusters are clusters, so missed. Sloooow: O(dN), where d= data. 45

Statistical anomaly detection methods Underlying principle: An anomaly is an observation that is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed. Assumption. Normal data instances occur in high probability regions of a stochastic model while anomalies occur in low probability regions of the stochastic model. Statistical techniques fit a statistical model (usually for normal behavior) to the given data and then apply a 46

statistical inference test to determine if an unseen instance belongs to this model or not. Parametric Methods Parametric methods assume that the normal data is generated by a parametric distribution with parameters θ and probability density function f (x,θ), where x is an observation. The anomaly score of a test instance (or observation) x is the inverse of the probability density function f (x,θ). The parameters θ are estimated from the given data. 47

Gaussian modeling based Such methods assume that the data is generated from a Gaussian distribution. The parameters are estimated using Maximum Likelihood Estimates (MLE). The distance of a data instance to the estimated mean is the anomaly score for that instance. A threshold is applied to the anomaly scores to determine the anomalies. Different techniques in this category calculate the distance to the mean and the threshold in different ways. Statistical rules commonly used: 48

Box plot rule Grubb s test Student t-test χ 2 test Regression model based Two steps: 1. A regression model is fitted on the data. 2. For each test instance, the residual for the test instance is used to determine the anomaly score. 49

The residual is the part of the instance that is unexplained by the regression model. The magnitude of the residual can be used as the anomaly score for the test instance, though statistical tests have been proposed to determine anomalies with certain confidence. Another variant that detects anomalies in multivariate time-series data is generated by an Autoregressive Moving Average (ARMA) model. 50

Mixture of parametric distributions based Such methods use a mixture of parametric statistical distributions to model the data. Two subcategories: 1. Those that model the normal instances and anomalies as separate parametric distributions. 2. Those that model only the normal instances as a mixture of parametric distributions. Subcategory remarks: 1. The testing phase involves determining which distribution, normal or anomalous, the test instance belongs to. 51

2. Model the normal instances as a mixture of parametric distributions. A test instance that does not belong to any of the learned models is declared to be an anomaly. Nonparametric methods These methods use nonparametric statistical models, such that the model structure is not defined a prioiri, but is instead determined dynamically from the data. These methods make fewer assumptions regarding the data, (e.g., smoothness of density) when compared to parametric techniques. 52

Histogram based The simplest nonparametric statistical method is to use histograms to maintain a profile of the normal data. The size of the bin used when building the histogram is key for anomaly detection: Too small: many normal test instances will fall in empty or rare bins, resulting in a high false alarm rate. Too large: many anomalous test instances will fall in frequent bins, resulting in a high false negative rate. 53

For univariate data there are two steps: 1. Build a histogram based on the different values taken by that feature in the training data. 2. Check if a test instance falls in any one of the bins of the histogram. If it does, then the test instance is normal. Otherwise it is anomalous. For multivariate data, a basic technique is to construct attributewise histograms. During testing, for each test instance the anomaly score for each attribute value of the test instance is calculated as the height of the bin that contains the attribute value. 54

The per attribute anomaly scores are aggregated to obtain an overall anomaly score for the test instance. Complexity: Completely dependent on the method(s) used. Good luck. +/- of statistical based methods: + Assumptions true, statistics true + Confidence levels provide, well, confidence + Unsupervised mode works if the distribution estimation step is robust to anomalies in data. Methods rely on the assumption that the data is generated from a particular distribution. This 55

assumption often does not hold true, especially for high dimensional real data sets. What was that famous quote about statistics??? Histograms are simple to implement and easily lie about the results. You get what you pay for. 56

Information theoretic methods Analyze the information content of a data set using different information theoretic measures such as Kolomogorov complexity, entropy, relative entropy, etc. Assumption: Anomalies in data induce irregularities in the information content of the data set. Let C(D) denote the complexity of a given data set, D. A basic information theoretic technique can be described as follows. Given a data set D, find the minimal subset of instances, I, such that C(D) C(D I) is maximum. All 57

instances in the subset thus obtained, are deemed as anomalous. The problem addressed by this basic technique is to find a Pareto-optimal solution, which does not have a single optimum, since there are two different objectives that need to be optimized. Complexity: This has exponential time complexity. Never, ever use it unless you have no other choice. +/- of information theoretic based methods: + Unsupervised mode works like a charm. + No assumptions about the underlying data. 58

The performance of such techniques is highly dependent on the choice of the information theoretic measure. Only with huge numbers of anomalies is even one found. Information theoretic techniques applied to sequences (and spatial data sets rely on the size of the substructure) are often nontrivial to obtain. It is difficult to associate an anomaly score with a test instance using this method. 59

Spectral anomaly detection methods Spectral techniques try to find an approximation of the data using a combination of attributes that capture the bulk of the variability in the data. Assumption. Data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different. Determine such subspaces that the anomalous instances can be easily identified. Such techniques can work in an unsupervised as well as a semi-supervised setting. 60

Principal component analysis is the major algorithm used. Complexity: Typically O(d), but O(dim 2 ). Singular value decompositions are frequently used and O(N 2 ). +/- of spectral based methods: + Spectral techniques automatically perform dimensionality reduction and are suitable for handling high dimensional data sets. + They can be used as a preprocessing step followed by application of any existing anomaly detection technique in the transformed space. 61

+ They can be used in an unsupervised setting. Spectral techniques are useful only if the normal and anomalous instances are separable in the lower dimensional embedding of the data. Have high computational complexity. 62

Defining concepts: Spatial Graphs Sequential Profile Handling conceptual anomalies There is very little literature in this area. It is rip for Ph.D. dissertations. 63

Quick summary A general theory is still an open research problem that will reward numerous students with Ph.D.s in the future. That said, there are many areas that have been developed over a long time. 64