Chapter 5: Outlier Detection

Similar documents
Epilog: Further Topics

Outlier Detection Techniques

Outlier Detection Techniques

DATA MINING II - 1DL460

Detection of Outliers

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Kapitel 4: Clustering

Chapter 7: Numerical Prediction

Anomaly Detection on Data Streams with High Dimensional Data Environment

Chapter 9: Outlier Analysis

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Course Content. What is an Outlier? Chapter 7 Objectives

Large Scale Data Analysis for Policy

DBSCAN. Presented by: Garrett Poppe

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Knowledge Discovery in Databases

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Knowledge Discovery in Databases II Summer Term 2017

Network Traffic Measurements and Analysis

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Statistics 202: Statistical Aspects of Data Mining

Clustering Part 4 DBSCAN

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

数据挖掘 Introduction to Data Mining

OPTICS-OF: Identifying Local Outliers

University of Florida CISE department Gator Engineering. Clustering Part 4

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

CISC 4631 Data Mining

Compare the density around a point with the density around its local neighbors. The density around a normal data object is similar to the density

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

DS504/CS586: Big Data Analytics Big Data Clustering II

Anomaly detection. Luboš Popelínský. Luboš Popelínský KDLab, Faculty of Informatics, MU

Detection of Anomalies using Online Oversampling PCA

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

CS570: Introduction to Data Mining

Lecture 7: Linear Regression (continued)

DS504/CS586: Big Data Analytics Big Data Clustering II

Data Mining Algorithms

Overview of Clustering

Data Mining. Lecture 03: Nearest Neighbor Learning

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

A Meta analysis study of outlier detection methods in classification

Clustering CS 550: Machine Learning

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Mean-shift outlier detection

10701 Machine Learning. Clustering

Generative and discriminative classification techniques

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Clustering Algorithms for Data Stream

Package ldbod. May 26, 2017

Filtered Clustering Based on Local Outlier Factor in Data Mining

A Fast Randomized Method for Local Density-based Outlier Detection in High Dimensional Data

Data Mining: Exploring Data. Lecture Notes for Chapter 3

ECLT 5810 Clustering

Spatial Outlier Detection

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Outlier Detection with Two-Stage Area-Descent Method for Linear Regression

Chapter 4: Text Clustering

Distance-based Methods: Drawbacks

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Authors: Coman Gentiana. Asparuh Hristov. Daniel Corteso. Fernando Nunez

Knowledge Discovery and Data Mining I

INF 4300 Classification III Anne Solberg The agenda today:

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

I. INTRODUCTION II. RELATED WORK.

University of Florida CISE department Gator Engineering. Clustering Part 5

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Computer Vision I - Filtering and Feature detection

Partition Based with Outlier Detection

Global Journal of Engineering Science and Research Management

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Region-based Segmentation

Outlier Detection. Chapter 12

Data Mining 4. Cluster Analysis

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

AN IMPROVED DENSITY BASED k-means ALGORITHM

A Review on Cluster Based Approach in Data Mining

Clustering Lecture 4: Density-based Methods

ECLT 5810 Clustering

Machine Learning using MapReduce

CHAPTER 4: CLUSTER ANALYSIS

Generative and discriminative classification techniques

A Nonparametric Outlier Detection for Effectively Discovering Top-N Outliers from Engineering Data

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Nearest neighbor classification DSE 220

Unsupervised Learning : Clustering

COMP 465: Data Mining Still More on Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India

Clustering in Data Mining

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Transcription:

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr. Thomas Seidl Tutorials: Julian Busch, Evgeniy Faerman, Florian Richter, Klaus Schmid Knowledge Discovery in Databases I: Clustering 1

Überblick Introduction Clustering based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 2

Introduction What is an outlier? Definition nach Hawkins [Hawkins 1980]: An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism Statistics-based intuition: Normal data objects follow a generating mechanism, e.g. some given statistical process Abnormal objects deviate from this generating mechanism 3

Introduction Example: Hadlum vs. Hadlum (1949) [Barnett 1978] The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier. 4

Introduction Beispiel: Hadlum vs. Hadlum (1949) [Barnett 1978] blue: statistical basis (13634 observations of gestation periods) green: assumed underlying Gaussian process Very low probability for the birth of Mrs. Hadlums child being generated by this process red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period starts later) 5

Introduction Applications: Fraud detection Purchasing behavior of a credit card owner usually changes when the card is stolen Abnormal buying patterns can characterize credit card abuse Medicine Unusual symptoms or test results may indicate potential health problems of a patient Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, ) Public health The occurrence of a particular disease, e.g. tetanus, scattered across various hospitals of a city indicate problems with the corresponding vaccination program in that city Whether an occurrence is abnormal depends on different aspects like frequency, spatial correlation, etc. 6

Introduction Applications: Sports statistics In many sports, various parameters are recorded for players in order to evaluate the players performances Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter values Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters Detecting measurement errors Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors Abnormal values could provide an indication of a measurement error Removing such errors can be important in other data mining and data analysis tasks One person s noise could be another person s signal. 7

Introduction Important properties of Outlier Models: Global vs. local approach: Outlierness regarding whole dataset (global) or regarding a subset of data (local)? Labeling vs. Scoring Binary decision or outlier degree score? Assumptions about Outlierness : What are the characteristics of an outlier object? 8

Überblick Introduction Clustering based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 9

Clustering-based An object is a cluster-based outlier if it does not strongly belong to any cluster: Basic idea: Cluster the data into groups Choose points in small clusters as candidate outliers. Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points and clusters, they are outliers A more systematic approach Find clusters and then assess the degree to which a point belongs to any cluster e.g. for k-means distance to the centroid In case of k-means (or in general, clustering algorithms with some objective function), if the elimination of a point results in substantial improvement of the objective function, we could classify it as an outlier i.e., clustering creates a model of the data and the outliers distort that model. 10

Überblick Introduction Clustering based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 11

Statistical Tests General idea Given a certain kind of statistical distribution (e.g., Gaussian) Compute the parameters assuming all data points have been generated by such a statistical distribution (e.g., mean and standard deviation) Outliers are points that have a low probability to be generated by the overall distribution (e.g., deviate more than 3 times the standard deviation from the mean) Basic assumption Normal data objects follow a (known) distribution and occur in a high probability region of this model Outliers deviate strongly from this distribution 12

Statistical Tests A huge number of different tests are available differing in Type of data distribution (e.g. Gaussian) Number of variables, i.e., dimensions of the data objects (univariate/multivariate) Number of distributions (mixture models) Parametric versus non-parametric (e.g. histogram-based) Example on the following slides Gaussian distribution Multivariate 1 model Parametric 13

Statistical Tests Probability density function of a multivariate normal distribution T 1 ( x ) Σ ( x ) 1 2 N( x) e d 2 is the mean value of all points (usually data are normalized such that =0) is the covariance matrix from the mean MDist( x, ) 1 ( x ) Σ ( x ) is the Mahalanobis distance of point x to MDist follows a c 2 -distribution with d degrees of freedom (d = data dimensionality) All points x, with MDist(x,) > c 2 (0,975) [ 3. s] 14

Statistical Tests Visualization (2D) [Tan et al. 2006] 15

Statistical Tests Problems Curse of dimensionality The larger the degree of freedom, the more similar the MDist values for all points x-axis: observed MDist values y-axis: frequency of observation 16

Statistical Tests Problems (cont.) Robustness Mean and standard deviation are very sensitive to outliers These values are computed for the complete data set (including potential outliers) The MDist is used to determine outliers although the MDist values are influenced by these outliers Discussion Data distribution is fixed Low flexibility (if no mixture models) Global method DB 17

Überblick Introduction Clustering based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 18

Distance-based Approaches General Idea Judge a point based on the distance(s) to its neighbors Several variants proposed Basic Assumption Normal data objects have a dense neighborhood Outliers are far apart from their neighbors, i.e., have a less dense neighborhood 19

Distance-based Approaches DB(e,)-Outliers Basic model [Knorr and Ng 1997] Given a radius e and a percentage A point p is considered an outlier if at most percent of all other points have a distance to p less than e Card({ q DB dist ( p, q) e}) OutlierSet ( e, ) { p } Card ( DB) p 1 e range-query with radius e p 3 p 2 20

Distance-based Approaches Outlier scoring based on knn distances General models Take the knn distance of a point as its outlier score k-dist p 1 p 3 p 2 21

k th nearest neighbor based The outlier score of an object is given by the distance to its k-nearest neighbor. - theoretically lowest outlier score: 0 k=5 [Tan, Steinbach, Kumar 2006] 22

[Tan, Steinbach, Kumar 2006] DATABASE k th nearest neighbor based The outlier score is highly sensitive to the value of k If k is too small, then a small number of close neighbors can cause low outlier scores. If k is too large, then all objects in a cluster with less than k objects might become outliers. 23

k th nearest neighbor based cannot handle datasets with regions of widely different densities due to the global threshold sparse cluster dense cluster [Tan, Steinbach, Kumar 2006] 24

Überblick Introduction Clustering based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 25

Density-based Approaches General idea Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score. Approaches also differ in how to estimate density. Basic assumption The density around a normal data object is similar to the density around its neighbors. The density around an outlier is considerably different to the density around its neighbors. 26

[Tan, Steinbach, Kumar 2006] DATABASE Density-based Approaches Different definitions of density: e.g., # points within a specified distance d from the given object The choice of d is critical If d is to small many normal points might be considered outliers If d is to large, many outlier points will be considered as normal A global notion of density is problematic (as it is in clustering) fails when data contain regions of different densities Solution: use a notion of density that is relative to the neighborhood of the object D has a higher absolute density than A but compared to its neighborhood, D s density is lower. 27

Density-based Approaches Local Outlier Factor (LOF) [Breunig et al. 1999, 2000] Motivation: Distance-based outlier detection models have problems with different densities How to compare the neighborhood of points from areas of different densities? Example DB(e,)-outlier model» Parameters e and cannot be chosen so that o 2 is an outlier but none of the points in cluster C 1 (e.g. q) is an outlier Outliers based on knn-distance» knn-distances of objects in C 1 (e.g. q) C 1 q are larger than the knn-distance of o 2 Solution: consider relative density C 2 o 2 o 1 28

Density-based Approaches Model Reachability distance Introduces a smoothing factor reach-distk ( p, o) max{ k -distance( o), dist( p, o)} Local reachability density (lrd) of point p Inverse of the average reach-dists of the knns of p lrd k ( p) oknn ( p) reachdist Card Local outlier factor (LOF) of point p Average ratio of lrds of neighbors of p and lrd of p k knn( p) oknn ( p) LOFk ( p) Card ( p, o) lrd lrd k k knn( p) 1 ( o) ( p) 29

Density-based Approaches Properties LOF 1: point is in a cluster (region with homogeneous density around the point and its neighbors) LOF >> 1: point is an outlier Data set LOFs (MinPts = 40) Discussion Choice of k (MinPts in the original paper) specifies the reference set Originally implements a local approach (resolution depends on the user s choice for k) Outputs a scoring (assigns an LOF value to each point) 30

[Tan, Steinbach, Kumar 2006] DATABASE Density-based Approaches 31

Überblick Introduction Clustering based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 32

Angle-based Approach ABOD angle-based outlier degree [Kriegel et al. 2008] Rational Angles are more stable than distances in high dimensional spaces (cf. e.g. the popularity of cosine-based similarity measures for text data) Object o is an outlier if most other objects are located in similar directions Object o is no outlier if many other objects are located in varying directions 73 73 72 72 71 70 71 70 o 69 68 o 69 68 outlier 67 66 31 32 33 34 35 36 37 38 39 40 41 67 no outlier 66 31 32 33 34 35 36 37 38 39 40 41 33

Angle-based Approach Basic assumption Outliers are at the border of the data distribution Normal points are in the center of the data distribution Model Consider for a given point p the angle between px and py for any two x,y from the database Consider the spectrum of all these angles angle between px and py The broadness of this spectrum is a score for the outlierness of a point p px py x y 0.3 1 211-0.2-0.7 inner point outlier -1.2 34

Angle-based Approach Model (cont.) Measure the variance of the angle spectrum Weighted by the corresponding distances (for lower dimensional data sets where angles are less reliable) ABOD p = VAR x,y DB xp, yp xp 2 yp 2 Properties Small ABOD => outlier High ABOD => no outlier 35

Angle-based Approach Algorithms Naïve algorithm is in O n 3 Approximate algorithm based on random sampling for mining top-n outliers Discussion Do not consider all pairs of other points x,y in the database to compute the angles Compute ABOD based on samples => lower bound of the real ABOD Filter out points that have a high lower bound Refine (compute the exact ABOD value) only for a small number of points Global approach to outlier detection Outputs an outlier score (inversely scaled: high ABOD score => inlier, low ABOD score => outlier) 36

Überblick Introduction Clustering-based approach Statistical approaches Distance-based Outliers Density-based Outliers und Local Outliers Angle-based Outliers Summary 37

Summary Algorithm properties: global / local labeling / scoring model assumptions Clustering-based outliers: Identification of not cluster members Statistical outliers: Assumed probability distribution The probability for the objects to be generated by this distribution is small 38

Summary Distance-based outliers: Distance to the neighbors as outlier metric Density-based outliers: Density around the point as outlier metric Angle-based outliers: Angles between outliers and random point pairs vary slightly 39