DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology, Uppsala University,! Uppsala, Sweden 09/03/16 1
Anomaly Detection (Tan, Steinbach, Kumar ch. 10) Kjell Orsborn! Department of Information Technology Uppsala University, Uppsala, Sweden 09/03/16 2
What are an anomaly or outlier? What are anomalies/outliers? Single or sets of data points that are considerably different than the remainder of the data (i.e. normal data) E.g. unusual credit card purchase, sports: Usain Bolt, Leo Messi, Outliers are different from the noise data Noise is random error or variance in a measured variable Noise should be removed before outlier detection Outliers are interesting: It violates the mechanism that generates the normal data Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model Applications: Credit card fraud detection, telecommunication! fraud detection, network intrusion detection,! fault detection, customer segmentation,! medical analysis 09/03/16 3
Anomaly/outlier detection Variants of anomaly/outlier detection problems:! Given a database D, find all the data points x D with anomaly scores greater than some threshold t! Given a database D, find all the data points x D having the top-n largest anomaly scores f(x)! Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D 09/03/16 4
Types of outliers (I) Three kinds: global, contextual and collective outliers Global outlier (or point anomaly) Object is O g if it significantly deviates from the rest of the data set Ex. Auditing stock trading transactions Issue: Find an appropriate measurement of deviation Contextual outlier (or conditional outlier, note: special case is local outlier) Object is O c if it deviates significantly based on a selected context Ex. -20 o C in Uppsala: outlier? (depending on summer or winter?) Attributes of data objects should be divided into two groups Contextual attributes: defines the context, e.g., time & location Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature Can be viewed as a generalization of local outliers whose density significantly deviates from its local area. Issue: How to define or formulate meaningful context? Global Outlier 09/03/16 5 5"
Types of outliers (II)! Collective Outliers o A subset of data objects collectively deviate! significantly from the whole data set, even if the! individual data objects may not be outliers o Applications: e.g., intrusion detection: Collective outlier! When a number of computers keep sending denial-of-service packages to each other " Detection of collective outliers " Consider not only behavior of individual objects, but also that of groups of objects " Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects. " A data set may have multiple types of outliers " One object may belong to more than one type of outlier 09/03/16 6 6"
Challenges of outlier detection " Modeling normal objects and outliers properly " Hard to enumerate all possible normal behaviors in an application " The border between normal and outlier objects is often a gray area! " Application-specific outlier detection " Choice of distance measure among objects and the model of relationship among objects are often application-dependent " E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations! " Handling noise in outlier detection " Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection 09/03/16 7 7"
Challenges of outlier detection cont " Understandability " Understand why these are outliers: Justification of the detection " Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism! " How many outliers are there in the data?! " When method is unsupervised " Validation can be quite challenging (just like for clustering)! " Outlier detection can be compared to finding needle in a haystack! " Working assumption: " There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data 09/03/16 8 8"
Ozone depletion history:! Importance of anomaly detection In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://undsci.berkeley.edu/article/0_0_0/ozone_depletion_09 http://ozonewatch.gsfc.nasa.gov/facts/history.html http://ozonewatch.gsfc.nasa.gov/index.html 09/03/16 9
Anomaly detection schemes General steps Build a profile of the normal behavior Profile can be patterns or summary statistics for the overall population Use the normal profile to detect anomalies Anomalies are observations whose characteristics! differ significantly from the normal profile Types of anomaly detection! schemes: Graphical & Statistical based Proximity based Density based Clustering based 09/03/16 10
Graphical approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations Time consuming Subjective 09/03/16 11
Convex hull method Extreme points are assumed to be outliers Use convex hull method to detect extreme values Data are assigned to layers of convex hulls that are peeled of to detect outliers!! What if the outlier occurs in the middle of the data? 09/03/16 12
Statistical approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) 09/03/16 13
!!! The Grubbs test Detect outliers in univariate data (i.e. data including only one attribute) assuming data sample comes from normal distribution:! The Grubb's test (also called maximum normed residual test) Outlier condition is defined as: G exp > G critical For each object x in a data set, compute its z-score (i.e. G exp ):! z = x x where s is standard deviation and x is the mean (also G exp )! s x is an outlier if:! where G exp = (also termed G critical ) is the value taken by a two-sided t-distribution at a significance level of α/(2n), and N is the no of objects in the data set. 09/03/16 14
09/03/16 15 Statistical-based likelihood approach Identifying outliers by calculating the change in likelihood when moving a point from one distribution to another in a mixture of 2 distributions. The overall probability distribution of the data:! D = (1 λ) M + λ A, where λ is the expected fraction of outliers.! M is a probability distribution estimated from data, usually Gaussian can be based on any modeling method (naïve Bayes, maximum entropy, etc) A is assumed to be a uniform distribution Likelihood and log likelihood at time t: = + + + = % % & ' ( ( ) * % % & ' ( ( ) * = = t i t t i t t i t t t i t t A x i A t M x i M t t A x i A A M x i M M N i i D t x P A x P M D LL x P x P x P D L ) ( log log ) ( log ) log(1 ) ( ) ( ) ( ) (1 ) ( ) ( 1 λ λ λ λ
Statistical-based likelihood approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution, typically Gaussian) A (anomalous distribution, typically uniform) General approach of algorithm 10.1 (Tan et al):! Initially, assume all the data points belong to M Let LL t (D) be the log likelihood of D at time t For each point x t that belongs to M, move it to A Let LL t+1 (D) be the new log likelihood. Compute the difference, Δ = LL t (D) LL t+1 (D) If Δ > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A 09/03/16 16
Statistical-based likelihood approach Algorithm 10.1 (Tan et al): 09/03/16 17
Limitations of statistical approaches Most of the tests are for a single attribute In many cases, the data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution 09/03/16 18
Proximity-based outlier detection In proximity-based outlier detection an object is an outlier if it is distant from most points called distant-based outliers More general and easily applied than statistical approaches since usually easier to define proximity measure There are various ways to define outliers: Data points for which there are fewer than p neighboring points within a distance D Data points whose distance to the kth nearest neighbor is greatest Can be sensitive to value of k Data points whose average distance to the k nearest neighbors is greatest more robust than only distance to kth nearest neighbor Compute the distance between every pair of data points can make it expensive, O(m 2 ) Grid-based methods and indexing can improve performance and complexity Does not handle widely varying densities well since using global thresholds 09/03/16 19
Nearest-neighbor based approach Example where the outlier score is given by the distance to its k-nearest neighbor 09/03/16 20
! Density-based outlier: Density-based outlier detection Outliers are points in regions of low density The outlier score of an object is the inverse of the density around the object. Inverse distance density (inverse of averaged distance to the k-nearest neighbours):, where N(x,k) is the set of k-nearest neighbors of x, N(x,k) is the size of that set and y is a nearest neighbor. No of points within region density (DBSCAN): The density around an object is equal to the no of objects that are within a specified distance d of the object. 09/03/16 21
Density-based outlier detection (the LOF approach) For each point, compute the density of its local neighborhood Compute the local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value p 2 p 1 In the Nearest-neighbor approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers 09/03/16 22
! Density-based outlier detection using relative density Average relative density (ard) is e.g. given as the ratio of the density of a point x and the average density of its nearest neighbors as follows:! ard(x, k) = density(x, k) y N (x,k) density(y, k)/ N(x, k (Eq 10.7)! A simplified version of the LOF technique using ard(x, k) is given by: 09/03/16 23
Example of relative density (LOF) approach (using k = 10) 09/03/16 24
Clustering-based outlier detection Clustering-based outlier: an object is a cluster-based outlier if the object does not strongly belong to any cluster Basic idea: Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points, they are outliers 09/03/16 25
Clustering-based outlier example 09/03/16 26
Outliers in lower dimensional projection (a grid-based approach) In high-dimensional space, data is sparse and notion of proximity becomes meaningless Every point is an almost equally good outlier from the perspective of proximity-based definitions Lower-dimensional projection methods A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density 09/03/16 27
! Outliers in lower dimensional projection (a grid-based approach) Divide each attribute into φ equal-depth intervals Each interval contains a fraction f = 1/φ of the records Consider a k-dimensional cube created by picking grid ranges from k different dimensions If attributes are independent, we expect a region to contain a fraction f k of the records If there are N points, we can measure sparsity of a cube D including n points as by the sparsity coefficient S:!! where expected fraction and standard deviation of the points in a k-dimensional cube is! Nf k Nf k (1 f k ) given by and respectively.! Negative sparsity indicates cube contains smaller number of points than expected Ref: Outlier Detection for High Dimensional Data, Charu C. Aggarwal and Philip S. Yu, ACM SIGMOD 2001 May 21-24, Santa Barbara, California, USA, 2001. 09/03/16 28
Example for sparsity coefficient N=100, φ = 5, f = 1/5 = 0.2, N f 2 = 4 (expected fraction) 09/03/16 29