CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

Similar documents
CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2016

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

DATA MINING II - 1DL460

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Chapter 5: Outlier Detection

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

1 Training/Validation/Testing

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Network Traffic Measurements and Analysis

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

TELCOM2125: Network Science and Analysis

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2017

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Chapter 9: Outlier Analysis

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2015

CPSC 340: Machine Learning and Data Mining. Non-Linear Regression Fall 2016

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

CPSC 340: Machine Learning and Data Mining

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Course on Microarray Gene Expression Analysis

数据挖掘 Introduction to Data Mining

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Course Content. What is an Outlier? Chapter 7 Objectives

Unsupervised Learning : Clustering

Clustering CS 550: Machine Learning

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Anomaly Detection on Data Streams with High Dimensional Data Environment

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Clustering Lecture 4: Density-based Methods

Outlier Detection. Chapter 12

Outlier detection using autoencoders

Lecture 6: Chapter 6 Summary

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

How do microarrays work

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

COMP 465 Special Topics: Data Mining

Lecture 25: Review I

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

CPSC 340 Assignment 0 (due Friday September 15 ATE)

Understanding Clustering Supervising the unsupervised

Nature Methods: doi: /nmeth Supplementary Figure 1

Data Informatics. Seon Ho Kim, Ph.D.

Content-based image and video analysis. Machine learning

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

CSE 158. Web Mining and Recommender Systems. Midterm recap

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Cluster Analysis: Basic Concepts and Algorithms

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Summary. Machine Learning: Introduction. Marcin Sydow

Detection of Anomalies using Online Oversampling PCA

Clustering and Visualisation of Data

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

CSE 5243 INTRO. TO DATA MINING

Bayesian Network & Anomaly Detection

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Regression III: Advanced Methods

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

University of Florida CISE department Gator Engineering. Clustering Part 2

Applying Supervised Learning

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Introduction to Data Mining

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Advanced Applied Multivariate Analysis

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

15 Wyner Statistics Fall 2013

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

DATA MINING II - 1DL460

Unsupervised Learning

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Transcription:

CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2016

Admin Assignment 1 solutions will be posted after class. Assignment 2 is out: Due next Friday, but start early! Calculus and linear algebra terms to review for next week: Vector addition and multiplication: αx + βy. Inner-product: x T y. Matrix multiplication: Xw. Solving linear systems: Ax = b. Matrix inverse: X 1. Norms: x. Gradient: f(x). Stationary points: f(x) = 0. Convex functions: f x 0.

Last Time: Hierarchical Clustering We discussed hierarchical clustering: Perform clustering at multiple scales. Output is usually a tree diagram ( dendrogram ). Reveals much more structure in data. Usually non-parametric: At finest scale, every point is its own clusters. Important application is phylogenetics. Scientific American yesterday: Scientists Trace Society s Myths to Primordial Origins Cosmic Hunt : Man hunts animal that becomes constellation. http://www.nature.com/nature/journal/v438/n7069/fig_tab/nature04338_f10.html

Motivating Example: Finding Holes in Ozone Layer The huge Antarctic ozone hole was discovered in 1985. It had been in satellite data since 1976: But it was flagged and filtered out by quality-control algorithm. https://en.wikipedia.org/wiki/ozone_depletion

Outlier Detection Outlier detection: Find observations that are unusually different from the others. Also known as anomaly detection. May want to remove outliers, or be interested in the outliers themselves. Some sources of outliers: Measurement errors. Data entry errors. Contamination of data from different sources. Rare events. http://mathworld.wolfram.com/outlier.html

Applications of Outlier Detection Data cleaning. Security and fault detection (network intrusion, DOS attacks). Fraud detection (credit cards, stocks, voting irregularities). Detecting natural disasters (earthquakes, particularly underwater). Astronomy (find new classes of stars/planets). Genetics (identifying individuals with new/ancient genes).

Classes of Methods for Outlier Detection 1. Model-based methods. 2. Graphical approaches. 3. Cluster-based methods. 4. Distance-based methods. 5. Supervised-learning methods. Warning: this is the topic with the most ambiguous solutions. Next week we ll get back to topics with more concrete solutions.

Model-Based Outlier Detection Model-based outlier detection: 1. Fit a probabilistic model. 2. Outliers are examples with low probability. Simplest approach is z-score: If z i > 3, 97% of data is larger than x i? http://mathworld.wolfram.com/outlier.html

Problems with Z-Score The z-score relies on mean and standard deviation: These measure are sensitive to outliers. Possible fixes: use quantiles, or sequentially remove worse outlier. The z-score also assumes that data is uni-modal http://mathworld.wolfram.com/outlier.html

Is the red point an outlier? Global vs. Local Outliers

Global vs. Local Outliers Is the red point an outlier? What if add the blue points?

Global vs. Local Outliers Is the red point an outlier? What if add the blue points? Red point has the lowest z-score. In the first case it was a global outlier. In this second case it s a local ouliter: It s within the range of the data, but is far away from other points. In general, hard to give precise definition of outliers.

Global vs. Local Outliers Is the red point an outlier? What if add the blue points? Red point has the lowest z-score. In the first case it was a global outlier. In this second case it s a local ouliter: It s within the range of the data, but is far away from other points. In general, hard to give precise definition of outliers. Can we have outlier groups?

Global vs. Local Outliers Is the red point an outlier? What if add the blue points? Red point has the lowest z-score. In the first case it was a global outlier. In this second case it s a local ouliter: It s within the range of the data, but is far away from other points. In general, hard to give precise definition of outliers. Can we have outlier groups? What about repeating patterns?

Graphical Outlier Detection Graphical approach to outlier detection: 1. Look at a plot of the data. 2. Human decides if data is an outlier. Examples: 1. Box plot: Visualization of quantiles/outliers. Only 1 variable at a time. http://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/boxplot/

Graphical Outlier Detection Graphical approach to outlier detection: 1. Look at a plot of the data. 2. Human decides if data is an outlier. Examples: 1. Box plot. 2. Scatterplot: Can detect complex patterns. Only 2 variables at a time. http://mathworld.wolfram.com/outlier.html

Graphical Outlier Detection Graphical approach to outlier detection: 1. Look at a plot of the data. 2. Human decides if data is an outlier. Examples: 1. Box plot. 2. Scatterplot. 3. Scatterplot array: Look at all combinations of variables. But laborious in high-dimensions. Still only 2 variables at a time. https://randomcriticalanalysis.wordpress.com/2015/05/25/standardized-tests-correlations-within-and-between-california-public-schools/

Graphical Outlier Detection Graphical approach to outlier detection: 1. Look at a plot of the data. 2. Human decides if data is an outlier. Examples: 1. Box plot. 2. Scatterplot. 3. Scatterplot array. 4. Scatterplot of 2-dimensional PCA: See high-dimensional structure. But PCA is sensitive to outliers. There might be info in higher PCs. http://scienceblogs.com/gnxp/2008/08/14/the-genetic-map-of-europe/

Cluster-Based Outlier Detection Detect outliers based on clustering: 1. Cluster the data. 2. Find points that don t belong to clusters. Examples: 1. K-means: Find points that are far away from any mean. Find clusters with a small number of points.

Cluster-Based Outlier Detection Detect outliers based on clustering: 1. Cluster the data. 2. Find points that don t belong to clusters. Examples: 1. K-means. 2. Density-based clustering: Outliers are points not assigned to cluster. http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap10_anomaly_detection.pdf

Cluster-Based Outlier Detection Detect outliers based on clustering: 1. Cluster the data. 2. Find points that don t belong to clusters. Examples: 1. K-means. 2. Density-based clustering. 3. Hierarchical clustering: Outliers take longer to join other groups. Also good for outlier groups. http://www.nature.com/nature/journal/v438/n7069/fig_tab/nature04338_f10.html

Distance-Based Outlier Detection Most of these approaches are based on distances. Can we skip the models/plot/clusters and directly use distances? Directly measure of how close objects are to their neighbours. Examples: How many points lie in a radius r? What is distance to k th nearest neighbour? https://en.wikipedia.org/wiki/local_outlier_factor

Global Distance-Based Outlier Detection: KNN KNN outlier detection: For each point, compute the average distance to its KNN. Sort these values. Choose the biggest values as outliers. Goldstein and Uchida [2016]: Compared 19 methods on 10 datasets. KNN best for finding global outliers. Local outliers better detected by LOF http://journals.plos.org/plosone/article?id=10.1371%2fjournal.pone.0152173

Local Distance-Based Outlier Detection As with density-based clustering, problem with differing densities: Outlier o 2 has similar density as elements of cluster C 1. Solution: local outlier factor (LOF) and variations like outlierness: Is point relatively far away from its neighbours? http://www.dbs.ifi.lmu.de/publikationen/papers/lof.pdf

Outlierness Let N k (x i ) be the k-nearest neighbours of x i. Let D k (x i ) be the average distance to k-nearest neighbours: Outlierness is ratio of D k (x i ) to average D k (x j ) for its neighbours j : If outlierness > 1, x i is further away from neighbours than expected.

Outlierness Ratio Outlierness finds o 1 and o 2 : More complicated data: http://www.dbs.ifi.lmu.de/publikationen/papers/lof.pdf https://en.wikipedia.org/wiki/local_outlier_factor

Outlierness with Close Clusters If clusters are close, outlierness gives unintuitive results: In this example, p has higher outlierness than q and r : The green points are not part of the KNN list of p for small k. http://www.comp.nus.edu.sg/~atung/publication/pakdd06_outlier.pdf

Outlierness with Close Clusters Influenced outlierness (INFLO) ratio: Include in denominator the reverse k-nearest neighbours: Points that have p in KNN list. Adds s and t from bigger cluster that includes p : But still has problems: Dealing with hierarchical clusters. Yields many false positives if you have global outliers. Goldstein and Uchida [2016] recommend just using KNN. http://www.comp.nus.edu.sg/~atung/publication/pakdd06_outlier.pdf

Supervised Outlier Detection Final approach to outlier detection is to use supervised learning: y i = 1 if x i is an outlier. y i = 0 if x i is a regular point. Let s us use our great methods for supervised learning: We can find very complicated outlier patterns. But it needs supervision: We need to know what outliers look like. We may not detect new types of outliers.

Summary Outlier detection is task of finding unusually different object. A concept that is very difficult to define. Model-based methods check if objects are unlikely in fitted model. Graphical methods plot data and use human to find outliers. Cluster-based methods check whether objects belong to clusters. Distance-based methods measure relative distance to neighbours. Supervised-learning methods just turn it into supervised learning. Next time: customers who bought this item also bought.