Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Similar documents
Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

DATA MINING II - 1DL460

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

10-701/15-781, Fall 2006, Final

Chapter 9: Outlier Analysis

Clustering and The Expectation-Maximization Algorithm

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Chapter 5: Outlier Detection

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Sections 4.3 and 4.4

Spatial Outlier Detection

Outlier Detection. Chapter 12

IBL and clustering. Relationship of IBL with CBR

Expectation Maximization (EM) and Gaussian Mixture Models

Contents. Preface to the Second Edition

Network Traffic Measurements and Analysis

COMS 4771 Clustering. Nakul Verma

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

What is machine learning?

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

Large Scale Data Analysis for Policy

Mixture Models and the EM Algorithm

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Expectation Maximization: Inferring model parameters and class labels

Supervised vs. Unsupervised Learning

What are anomalies and why do we care?

Unsupervised Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning. Semi-Supervised Learning. Manfred Huber

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Clustering Lecture 5: Mixture Model

Lecture 3 January 22

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

CS 229 Midterm Review

Anomaly Detection on Data Streams with High Dimensional Data Environment

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Unsupervised Learning: Clustering

Hierarchical clustering

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling

Introduction to Mobile Robotics

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei i Han 1 University of Illinois, IBM TJ Watson.

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Inference and Representation

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Automatic Detection Of Suspicious Behaviour

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

9. Conclusions. 9.1 Definition KDD

Using Machine Learning to Optimize Storage Systems

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Unsupervised Learning

Bus Detection and recognition for visually impaired people

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Introduction to Machine Learning CMU-10701

Unsupervised Learning

ALTERNATIVE METHODS FOR CLUSTERING

Classification. 1 o Semestre 2007/2008

Clustering & Classification (chapter 15)

Content-based image and video analysis. Machine learning

Clustering: Classic Methods and Modern Views

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

ECE 5424: Introduction to Machine Learning

Dynamic Thresholding for Image Analysis

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Supervised Learning for Image Segmentation

CS Introduction to Data Mining Instructor: Abdullah Mueen

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised Learning and Clustering

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Generative and discriminative classification techniques

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Mixture Models and EM

Bioinformatics - Lecture 07

Motivation. Technical Background

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Lecture 25: Review I

Machine Learning Lecture 3

1 Training/Validation/Testing

Use of Extreme Value Statistics in Modeling Biometric Systems

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Transcription:

Outliers Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1

Concepts What is an outlier? The set of data points that are considerably different than the remainder of the data... When do they appear in data mining tasks? Given a data matrix X, find all the cases x i X with anomaly/outlier scores greater than some threshold t. Or, the top n outlier scores. Given a data matrix X, containing mostly normal (but unlabeled) data points, and a test case x new, compute an anomaly/outlier score of x new with respect to X. Applications Credit card fraud detection; Network intrusion detection; Misspecification of a model. 2 / 1

What is an outlier? 3 / 1

Issues How many outliers are there in the data? Method is unsupervised, similar to clustering or finding clusters with only 1 point in them. Usual assumption: There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data. 4 / 1

General steps Build a profile of the normal behavior. The profile generally consists of summary statistics of this normal population. Use these summary statistics to detect anomalies, i.e. points whose characteristics are very far from the normal profile. General types of schemes involve a statistical model of normal, and far is measured in terms of likelihood. Other schemes based on distances can be quasi-motivated by such statistical techniques... 5 / 1

Statistical approach Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on: Data distribution (e.g. normal) Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit, α or Type I error) 6 / 1

Grubbs Test Suppose we have a sample of n numbers Z = {Z 1,..., Z n }, i.e. a n 1 data matrix. Assuming data is from normal distribution, Grubbs tests uses distribution of max 1 i n Z i Z SD(Z) to search for outlying large values. 7 / 1

Grubbs Test Lower tail variant: min 1 i n Z i Z SD(Z) Two-sided variant: max 1 i n Z i Z SD(Z) 8 / 1

Grubbs Test Having chosen a test-statistic, we must determine a threshold that sets our threshold rule Often this is set via a hypothesis test to control Type I error. For large positive outlier, threshold is based on choosing some acceptable Type I error α and finding c α so that P 0 ( max1 i n Z i Z SD(Z) c α ) α Above, P 0 denotes the distribution of Z under the assumption there are no outliers. If Z are IID N(µ, σ 2 ) it is generally possible to compute a decent approximation of this probability using Bonferonni. 9 / 1

Grubbs Test Two sided critical level has the form c α = n 1 t2 α/(2n),n 2 n n 2 + tα/(2n),n 2 2 where P(T k t γ,k ) = γ is the upper tail quantile of T k. In R, you can use the functions pnorm, qnorm, pt, qt for these quantities. 10 / 1

ased techniques d a model Model based: linear regression with outliers hich don t fit the model dentified as outliers xample at the right, uares regression appropriate s can be fed in to test. Figure : Residuals from model can be fed into Grubbs test or Bonferroni (variant) r Introduction to 4/18/2004 11 11 / 1

Multivariate data If the non-outlying data is assumed to be multivariate Gaussian, what is the analogy of Grubbs statistic max 1 i n Z i Z SD(Z) Answer: use Mahalanobis distance max 1 i n (Z i Z) T Σ 1 (Z i Z) Above, each individual statistic has what looks like a Hotelling s T 2 distribution. 12 / 1

Likelihood approach Assume data is a mixture F = (1 λ)m + λa. Above, M is the distribution of most of the data. The distribution A is an outlier distribution, could be uniform on a bounding box for the data. This is a mixture model. If M is parametric, then the EM algorithm fits naturally here. Any points assigned to A are outliers. 13 / 1

Likelihood approach Do we estimate λ or fix it? The book starts describing an algorithm that tries to maximize the equivalent classification likelihood L(θ M, θ A ; l) = (1 λ) #l M f M (x i, θ M ) i l M λ #l A f A (x i ; θ A ) i l A 14 / 1

Likelihood approach: Algorithm Algorithm tries to maximize this by forming iterative estimates (M t, A t ) of normal and outlying data points. 1 At each stage, tries to place individual points of M t to A t. 2 Find ( θ M, θ A ) based on partition new partition (if necessary). 3 If increase in likelihood is large enough, call these new set (M t+1, A t+1 ). 4 Repeat until no further changes. 15 / 1

Nearest neighbour approach Many ways to define outliers. Example: data points for which there are fewer than k neighboring points within a distance ɛ. Example: the n points whose distance to k-th nearest neighbour is largest. The n points whose average distance to the first k nearest neighobours is largest. Each of these methods all depend on choice of some parameters: k, n, ɛ. Difficult to choose these in a systematic way. 16 / 1

Density approach For each point, x i compute a density estimate f xi,k using its k nearest neighbours. Density estimate used is ( y N(x f xi,k = i,k) d(x i, y) #N(x i, k) ) 1 Define LOF (x i ) = f xi,k ( y N(x i,k) f y,k)/#n(x i, k) 17 / 1

In th not whil both Outliers! Compute local outlier factor (LOF) of a average of the ratios of the density of density of its nearest neighbors! Outliers are points with largest LOF va p 2! p 1! Tan,Steinbach, Kumar Introduction to Figure : Nearest neighbour vs. density based 18 / 1

Detection rate Set P(O) to be the proportion of outliers or anomalies. Set P(D O) to be the probability of declaring an outlier if it truly is an outlier. This is the detection rate. Set P(D O c ) to the probability of declaring an outlier if it is truly not an outlier. 19 / 1

Bayesian detection rate Bayesian detection rate is P(O D) = P(D O)P(O) P(D O)P(O) + P(D O c )P(O c ). The false alarm rate or false discovery rate is P(O c D) = P(D O c )P(O c ) P(D O c )P(O c ) + P(D O)P(O). 20 / 1

21 / 1