OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Similar documents
DATA MINING II - 1DL460

A Meta analysis study of outlier detection methods in classification

Chapter 5: Outlier Detection

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Anomaly Detection on Data Streams with High Dimensional Data Environment

Course Content. What is an Outlier? Chapter 7 Objectives

OPTICS-OF: Identifying Local Outliers

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

Chapter 9: Outlier Analysis

Filtered Clustering Based on Local Outlier Factor in Data Mining

Clustering methods: Part 7 Outlier removal Pasi Fränti

PCA Based Anomaly Detection

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Keywords: Clustering, Anomaly Detection, Multivariate Outlier Detection, Mixture Model, EM, Visualization, Explanation, Mineset.

Clustering Part 4 DBSCAN

Database and Knowledge-Base Systems: Data Mining. Martin Ester

OUTLIER DETECTION. Short Course Session 1. Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India

University of Florida CISE department Gator Engineering. Clustering Part 4

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Analysis and Extensions of Popular Clustering Algorithms

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

Outlier Detection with Two-Stage Area-Descent Method for Linear Regression

CS570: Introduction to Data Mining

White Rose Consortium eprints Repository

Detection of Outliers

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Data Clustering With Leaders and Subleaders Algorithm

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

Clustering part II 1

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

Clustering Lecture 4: Density-based Methods

I. INTRODUCTION II. RELATED WORK.

HIERARCHICAL CLUSTERED OUTLIER DETECTION IN LASER SCANNER POINT CLOUDS

Dimension Induced Clustering

A Nonparametric Outlier Detection for Effectively Discovering Top-N Outliers from Engineering Data

FILTERING AND REFINEMENT: A TWO-STAGE APPROACH FOR EFFICIENT AND EFFECTIVE ANOMALY DETECTION XIAO YU THESIS

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Large Scale Data Analysis for Policy

Outlier Detection Techniques

CS145: INTRODUCTION TO DATA MINING

Mining Common Outliers for Intrusion Detection

Implementation of Fuzzy C-Means and Possibilistic C-Means Clustering Algorithms, Cluster Tendency Analysis and Cluster Validation

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Mean-shift outlier detection

Outlier Detection Techniques

Compare the density around a point with the density around its local neighbors. The density around a normal data object is similar to the density

AN IMPROVED DENSITY BASED k-means ALGORITHM

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Clustering Large Dynamic Datasets Using Exemplar Points

Research issues in Outlier mining: High- Dimensional Stream Data

ENHANCED DBSCAN ALGORITHM

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Authors: Coman Gentiana. Asparuh Hristov. Daniel Corteso. Fernando Nunez

University of Florida CISE department Gator Engineering. Clustering Part 5

Spatial Outlier Detection

Detecting Outliers in Data streams using Clustering Algorithms

CS570: Introduction to Data Mining

A Review on Cluster Based Approach in Data Mining

Review: Identification of cell types from single-cell transcriptom. method

DETECTION OF ANOMALIES FROM DATASET USING DISTRIBUTED METHODS

OBE: Outlier by Example

Automatic Group-Outlier Detection

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Outlier Detection in Stream Data by Clustering Method

Package ldbod. May 26, 2017

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes

Detection of Anomalies using Online Oversampling PCA

Network Traffic Measurements and Analysis

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Kapitel 4: Clustering

Outlier Detection Using Random Walks

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Performance Evaluation of Density-Based Outlier Detection on High Dimensional Data

Angle-Based Outlier Detection in High-dimensional Data

CPSC 340: Machine Learning and Data Mining

DATA MINING II - 1DL460

Ying Liu* and Alan P. Sprague

An Efficient Clustering and Distance Based Approach for Outlier Detection

Local projections for high-dimensional outlier detection

Anomaly detection. Luboš Popelínský. Luboš Popelínský KDLab, Faculty of Informatics, MU

DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION

INF 4300 Classification III Anne Solberg The agenda today:

Statistical Pattern Recognition

D-GridMST: Clustering Large Distributed Spatial Databases

Partition Based with Outlier Detection

Statistical Pattern Recognition

An Experimental Analysis of Outliers Detection on Static Exaustive Datasets.

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Data mining fundamentals

CS570: Introduction to Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Clustering CS 550: Machine Learning

Transcription:

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS DATA MINING DISCUSSION GROUP OUTLINE MOTIVATION OUTLIERS IN MULTIVARIATE DATA OUTLIERS IN HIGH DIMENSIONAL DATA Distribution-based Distance-based NN-based Density-based Clustering-based Depth-based methods CONCLUSION 1

MOTIVATION: CFC has ability to breakdown ozone Estimation in 75 s: 7% drop within 60yrs. In 1985, British Antarctic Survey showed that ozone levels had dropped to 10% below WHY? What was wrong? Ozone hole had been covered up by a computer-program Evidence of the ozone-hole was seen as far back as 1976. Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html 2

MOTIVATION Colon Data (Alon et al, 1999) X : 62 2000, gene expression data, The colon cancer dataset is known to be heterogeneous because the tissue samples contain a mixture of cell types!!! MOTIVATION Data often (always) contain outliers. Statistical methods are severely affected by outliers. We have to identify OUTLIER (s) accurately!!! 3

What is outlier? No universally accepted definition!!! Hawkins (1980) An observation (few) that deviates (differs) so much from other observations as to arouse suspicion that it was generated by a different mechanism. Barnett and Lewis (1994) An observation (few) which appears to be inconsistent (different) with the remainder of that set of data. The main reasons for outliers in a data set: Data errors Unspecified missing observations Data which do not come from the target population intended to be sampled Correct but extreme responses (Rare event syndrome) 4

A few applications of outlier detection: Fraud detection Network intrusion detection Satellite image analysis Structural defect detection Loan application processing Discovery of astronomical objects Motion segmentation Detection of unexpected entries in databases And many more OUTLIERS IN MULTIVARIATE DATA Visual tools Scatter plots and 3D scatter plots Higher dimensions??? 5

OUTLIERS IN MULTIVARIATE DATA OUTLIERS IN MULTIVARIATE DATA Numerical methods aim to detect outliers by computing a measure of how far a particular point is from the center of the data. The usual measure of outlyingness for a data point, is the Mahalanobis distance: x i D i = ( x i X )' S 1 ( x i X ), i = 1,2,..., N 6

OUTLIERS IN MULTIVARIATE DATA Primary goal is robust estimation OGK (Maronna and Zamar,1992) MVE (Rousseeuw, 1985) MCD (Rousseeuw and Van Driessen, 1999) Primary goal is outlier detection MULTOUT (Woodruff and Rocke, 1994) BACON ( Billor et al., 2000) Kurtosis 1 (Pena and Prieto, 2001) None of these methods works quite as well when the dimensionality is high!!! Curse of Dimensionality 7

OUTLIERS IN HIGH DIMENSIONAL DATA (Hodge et al., 2004 ; Lazarevic et al., 2000) Distribution-based Distance-based NN based Density-based Cluster-based Depth-Based Distribution-based approaches (Barnett&Lewis, 1994; Hawkins, 1980) Data points are modeled using a stochastic distribution Outliers are observations which deviate from the given distribution. Drawbacks: Unsuitable even for moderately highdimensional data sets. Perform expensive tests to determine which model fits the data best, if any! 8

Distance-based Methods: NN-based Methods: There are various ways to define outliers: Data points for which there are fewer than r neighboring points within a distance D (Knorr and Ng, 1998): The top n data points whose distance to the k th nearest neighbor is greatest (Ramaswamy et al., 2000) n data points whose average distance to the k th nearest neighbor is greatest (Acuna and Rodriguez, 2004) 9

Lower-dimensional projection methods: Barbara et al. (1996) Aggarwal and Yu (2001) Filzmoser et al. (2008) Aggarwal and Yu (2001) The method assumes that outliers abnormally sparse in certain lower dimensional projections. They use evolutionary search algorithm to determine the projections 10

Filzmoser et al., 2008: PCOUT Algorithm PCOUT is a recent outlier identification algorithm that is particularly effective in high dimensions. Based on the robustly sphered data, semi-robust principal components are computed which are needed for determining distances for each observation. Separate weights for location and scatter outliers are computed based on these distances. The combined weights are used for outlier identification (See R: pcout) PCOUT: Colon Data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 11

PCOUT: Leukemia Data (72x7129) We will try to identify multivariate outliers among the 7129 genes, without using the information of the two leukemia types ALL and AML. The outlying genes will then be used for differentiating between the cases. 2609 genes Density based Methods: Local Outlying Factor: LOF (Breunig et al., 2000) For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value Distance from p 2 to nearest neighbor p 2 p 1 12

Local Correlation Integral (LOCI) (Papadimitriou, et al, 2002) LOCI computes the neighborhood size (the number of neighbors) for each point and identifies as outliers points whose neighborhood size significantly vary with respect to the neighborhood size of their neighbors. This approach not only finds outlying points but also outlying micro-clusters. LOCI algorithm provides LOCI plot which contains information such as inter cluster distance and cluster diameter Clustering-based Methods: Key assumption: normal data records belong to large and dense clusters, while outliers do not belong to any of the clusters or form very small clusters Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other noncandidate points, they are outliers 13

Clustering-based Methods: 1- CLARANS (Ng and Han, 1994) 2- BIRCH (Zhang et al., 1996) 3- DBSCAN (Ester et al., 1996) 4- CURE (Guha et al., 1998) : : Advantages: No need to be supervised Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data Drawbacks Computationally expensive If normal points do not create any clusters the techniques may fail In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. They are not optimized for outlier detection. The outlier detection criteria are implicit and cannot easily be inferred from the clustering procedures (Papadimitriu et al., 2002) 14

Depth-based Methods: Depth is is a quantitative measurement of how central a point is with respect to a data set. Mahalanobis depth; spatial Depth; halfspace depth; projection depth, zonoid depth, (Zuo and Serfling, 2000) web source: http://www.cs.tufts.edu/research/geometry/research.php Depth-based Methods: Preparata and Shamos (1988); Serfling and Wang (2005) 15

What if outlier is in the middle of the data? Conclusions Outlier detection can detect critical information in data Highly applicable in various application areas There is no single universally applicable or generic outlier detection approach. Researcher should select an algorithm that is suitable for their data set in terms of the correct distribution model, the correct attribute types, the scalability, the speed, any incremental capabilities. 16

THANKS!!! e-mail: turkmen@stat.osu.edu 17