Detection of Outliers

Size: px

Start display at page:

Download "Detection of Outliers"

Junior Webb
6 years ago
Views:

1 Detection of Outliers TNM033 - Data Mining by Anton Auoja, Albert Backenhof & Mikael Dalkvist

2 Holy Outliers, Batman!! An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. - Frank E. Grubbs

4 Holy Causes, Batman!! Apparatus malfunction. Fraudulent behavior. Human error. Natural deviations. Contamination.

5 Holy Applications, Batman!! Fraud Detection Medicine Public Health Sports statistics Detecting measurement errors

6 Holy WEKA, Batman!! Interquartile Range One Class Classifier DBScan

7 Holy Common Methods, Batman!! Statistical Distance Kernel High Dimensional

Holy Statistical Methods, Batman!! An outlier is an object with low probability with respect to the probability distribution model of the data. Model Based.

8 Holy Statistical Methods, Batman!! An outlier is an object with low probability with respect to the probability distribution model of the data. Model Based. Assume Gaussian distribution. Calculate the mean and standard deviation of the data. The probability of each object under the distribution can then be calculated.

9 Holy Examples, Batman!! Box Plots Trimmed Means Grubbs Test

10 Holy Box and Whisker Plots, Batman!! Interquartile Range Q3 - Q1 Lower Inner Fence: Q1-1.5*IQR Upper Inner Fence: Q *IQR Lower Outer Fence: Q1-3*IQR Upper Outer Fence: Q3 + 3*IQR

11 Holy Trimmed Means, Batman!! Delete percentage of extreme values. Calculate mean. Use new mean for comparison.

12 Holy Test, Grubbs!! Calculate the normal logarithm. Sort data. Calculate Z. Compare Z to the critical Z value.

14 Holy Issues, Batman!! Identifying distribution of data set. The number of attributes Mixtures of distribution

15 Holy Distance Based Methods, Batman!! DP(p,D) k-nearest Neighbor Local Distance Based

$p:th fraction of all objects of the database$

16 Holy DB(p,D), Knorr & Ng, Batman!! An object o is an outlier if at least the p:th fraction of all objects of the database are at a distance greater than D from the given object o.

17 Holy Distance to k-nearest Neighbors, Batman!! Outlier score. Score each object [0, [ depending on the distance to its k-nearest neighbors. Highly dependent on the choice of k. Can be modified to use the mean of distances of a point to all its 1NN, 2NN,..., knn as an outlier score.

19 Holy Local distance-based algorithms, Batman!! Determine the difference of an object from its nearest neighbors. A threshold value is set. All objects whose outlier factors exceed this value are considered to be outliers. Local Outlier Factor (LOF).

20 Holy Advantages, Batman!! More general and easier to apply then statistical approaches No probabilistic model needed Can find local outliers

21 Unholy Disadvantages, Batman!! Methods are typically O(n 2 ) Sensitive to choice of parameters Dependent on pre-defined parameters Can t handle datasets with regions that have widely differing density

22 Holy Kernel Based Methods, Batman!!

24 Original space Hilbert (Feature) space

25 X H

27 Holy Implicitly, Batman!! No additional memory or computation cost.

28 Holy High Dimensional, Batman!! Curse of Dimensionality

29 One way is to create subspaces of original space.

30 Another is Angle Based Outlier Degree.

31 Holy References, Batman!! Outlier Detection Techniques. Hans-Peter Kriegel, Peer Kröger and Arthur Zimek. Ludwig- Maximilians-Universität München Munich, Germany. A Review of Statistical Outlier Methods. Steven Walfish. Pharmaceutical Technology. Outlier Detection Algorithms in Data Mining Systems. M. I. Petrovskiy. Department of Computational Mathematics and Cybernetics, Moscow State University, Vorob evy gory, Moscow. Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata Fallon and Christine Spada. Outlier Detection with Kernel Density Functions. L. J. Latecki, A. Lazarevic, D. Pokrajac Classification by Support Vector Machines. F. Markowetz. Max-Planck-Institute for Molecular Genetics Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Chapter 5: Outlier Detection

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.