DATA MINING II - 1DL460

Similar documents
Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Chapter 5: Outlier Detection

Outlier Detection. Chapter 12

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

Chapter 9: Outlier Analysis

Network Traffic Measurements and Analysis

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

Final Exam DATA MINING I - 1DL360

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

What are anomalies and why do we care?

Automatic Detection Of Suspicious Behaviour

Anomaly Detection on Data Streams with High Dimensional Data Environment

Clustering Part 4 DBSCAN

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

University of Florida CISE department Gator Engineering. Clustering Part 4

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

Spatial Outlier Detection

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

IBL and clustering. Relationship of IBL with CBR

CPSC 340: Machine Learning and Data Mining

Large Scale Data Analysis for Policy

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Course Content. What is an Outlier? Chapter 7 Objectives

Detection of Anomalies using Online Oversampling PCA

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

1 Training/Validation/Testing

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei i Han 1 University of Illinois, IBM TJ Watson.

Contents. Preface to the Second Edition

DATA MINING II - 1DL460

Detection of Outliers

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Supervised vs. Unsupervised Learning

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

DBSCAN. Presented by: Garrett Poppe

Machine Learning (BSMC-GA 4439) Wenke Liu

CSE 347/447: DATA MINING

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

DS504/CS586: Big Data Analytics Big Data Clustering II

Chapter 3: Supervised Learning

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2016

Anomaly Detection in Categorical Datasets with Artificial Contrasts. Seyyedehnasim Mousavi

Unsupervised Learning

DATA MINING II - 1DL460

Clustering Algorithms for Data Stream

DS504/CS586: Big Data Analytics Big Data Clustering II

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering CS 550: Machine Learning

CHAPTER 4: CLUSTER ANALYSIS

Computer Technology Department, Sanjivani K. B. P. Polytechnic, Kopargaon

Clustering. Supervised vs. Unsupervised Learning

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

CS Introduction to Data Mining Instructor: Abdullah Mueen

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Robust Shape Retrieval Using Maximum Likelihood Theory

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Machine Learning Techniques for Data Mining

CS512 (Spring 2012) Advanced Data Mining : Midterm Exam I

AN IMPROVED DENSITY BASED k-means ALGORITHM

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Computer Vision I. Dense Stereo Correspondences. Anita Sellent 1/15/16

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

DATA MINING II - 1DL460

Dynamic Clustering Of High Speed Data Streams

Clustering and Visualisation of Data

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CS570: Introduction to Data Mining

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Knowledge Discovery in Databases

Chapter 4: Text Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Mean-shift outlier detection

Introduction to Trajectory Clustering. By YONGLI ZHANG

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Computer Vision I - Filtering and Feature detection

SELECTION OF OPTIMAL MINING ALGORITHM FOR OUTLIER DETECTION - AN EFFICIENT METHOD TO PREDICT/DETECT MONEY LAUNDERING CRIME IN FINANCE INDUSTRY

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

Multi-task Multi-modal Models for Collective Anomaly Detection

Transcription:

DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology, Uppsala University,! Uppsala, Sweden 09/03/16 1

Anomaly Detection (Tan, Steinbach, Kumar ch. 10) Kjell Orsborn! Department of Information Technology Uppsala University, Uppsala, Sweden 09/03/16 2

What are an anomaly or outlier? What are anomalies/outliers? Single or sets of data points that are considerably different than the remainder of the data (i.e. normal data) E.g. unusual credit card purchase, sports: Usain Bolt, Leo Messi, Outliers are different from the noise data Noise is random error or variance in a measured variable Noise should be removed before outlier detection Outliers are interesting: It violates the mechanism that generates the normal data Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model Applications: Credit card fraud detection, telecommunication! fraud detection, network intrusion detection,! fault detection, customer segmentation,! medical analysis 09/03/16 3

Anomaly/outlier detection Variants of anomaly/outlier detection problems:! Given a database D, find all the data points x D with anomaly scores greater than some threshold t! Given a database D, find all the data points x D having the top-n largest anomaly scores f(x)! Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D 09/03/16 4

Types of outliers (I) Three kinds: global, contextual and collective outliers Global outlier (or point anomaly) Object is O g if it significantly deviates from the rest of the data set Ex. Auditing stock trading transactions Issue: Find an appropriate measurement of deviation Contextual outlier (or conditional outlier, note: special case is local outlier) Object is O c if it deviates significantly based on a selected context Ex. -20 o C in Uppsala: outlier? (depending on summer or winter?) Attributes of data objects should be divided into two groups Contextual attributes: defines the context, e.g., time & location Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature Can be viewed as a generalization of local outliers whose density significantly deviates from its local area. Issue: How to define or formulate meaningful context? Global Outlier 09/03/16 5 5"

Types of outliers (II)! Collective Outliers o A subset of data objects collectively deviate! significantly from the whole data set, even if the! individual data objects may not be outliers o Applications: e.g., intrusion detection: Collective outlier! When a number of computers keep sending denial-of-service packages to each other " Detection of collective outliers " Consider not only behavior of individual objects, but also that of groups of objects " Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects. " A data set may have multiple types of outliers " One object may belong to more than one type of outlier 09/03/16 6 6"

Challenges of outlier detection " Modeling normal objects and outliers properly " Hard to enumerate all possible normal behaviors in an application " The border between normal and outlier objects is often a gray area! " Application-specific outlier detection " Choice of distance measure among objects and the model of relationship among objects are often application-dependent " E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations! " Handling noise in outlier detection " Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection 09/03/16 7 7"

Challenges of outlier detection cont " Understandability " Understand why these are outliers: Justification of the detection " Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism! " How many outliers are there in the data?! " When method is unsupervised " Validation can be quite challenging (just like for clustering)! " Outlier detection can be compared to finding needle in a haystack! " Working assumption: " There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data 09/03/16 8 8"

Ozone depletion history:! Importance of anomaly detection In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://undsci.berkeley.edu/article/0_0_0/ozone_depletion_09 http://ozonewatch.gsfc.nasa.gov/facts/history.html http://ozonewatch.gsfc.nasa.gov/index.html 09/03/16 9

Anomaly detection schemes General steps Build a profile of the normal behavior Profile can be patterns or summary statistics for the overall population Use the normal profile to detect anomalies Anomalies are observations whose characteristics! differ significantly from the normal profile Types of anomaly detection! schemes: Graphical & Statistical based Proximity based Density based Clustering based 09/03/16 10

Graphical approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations Time consuming Subjective 09/03/16 11

Convex hull method Extreme points are assumed to be outliers Use convex hull method to detect extreme values Data are assigned to layers of convex hulls that are peeled of to detect outliers!! What if the outlier occurs in the middle of the data? 09/03/16 12

Statistical approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) 09/03/16 13

!!! The Grubbs test Detect outliers in univariate data (i.e. data including only one attribute) assuming data sample comes from normal distribution:! The Grubb's test (also called maximum normed residual test) Outlier condition is defined as: G exp > G critical For each object x in a data set, compute its z-score (i.e. G exp ):! z = x x where s is standard deviation and x is the mean (also G exp )! s x is an outlier if:! where G exp = (also termed G critical ) is the value taken by a two-sided t-distribution at a significance level of α/(2n), and N is the no of objects in the data set. 09/03/16 14

09/03/16 15 Statistical-based likelihood approach Identifying outliers by calculating the change in likelihood when moving a point from one distribution to another in a mixture of 2 distributions. The overall probability distribution of the data:! D = (1 λ) M + λ A, where λ is the expected fraction of outliers.! M is a probability distribution estimated from data, usually Gaussian can be based on any modeling method (naïve Bayes, maximum entropy, etc) A is assumed to be a uniform distribution Likelihood and log likelihood at time t: = + + + = % % & ' ( ( ) * % % & ' ( ( ) * = = t i t t i t t i t t t i t t A x i A t M x i M t t A x i A A M x i M M N i i D t x P A x P M D LL x P x P x P D L ) ( log log ) ( log ) log(1 ) ( ) ( ) ( ) (1 ) ( ) ( 1 λ λ λ λ

Statistical-based likelihood approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution, typically Gaussian) A (anomalous distribution, typically uniform) General approach of algorithm 10.1 (Tan et al):! Initially, assume all the data points belong to M Let LL t (D) be the log likelihood of D at time t For each point x t that belongs to M, move it to A Let LL t+1 (D) be the new log likelihood. Compute the difference, Δ = LL t (D) LL t+1 (D) If Δ > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A 09/03/16 16

Statistical-based likelihood approach Algorithm 10.1 (Tan et al): 09/03/16 17

Limitations of statistical approaches Most of the tests are for a single attribute In many cases, the data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution 09/03/16 18

Proximity-based outlier detection In proximity-based outlier detection an object is an outlier if it is distant from most points called distant-based outliers More general and easily applied than statistical approaches since usually easier to define proximity measure There are various ways to define outliers: Data points for which there are fewer than p neighboring points within a distance D Data points whose distance to the kth nearest neighbor is greatest Can be sensitive to value of k Data points whose average distance to the k nearest neighbors is greatest more robust than only distance to kth nearest neighbor Compute the distance between every pair of data points can make it expensive, O(m 2 ) Grid-based methods and indexing can improve performance and complexity Does not handle widely varying densities well since using global thresholds 09/03/16 19

Nearest-neighbor based approach Example where the outlier score is given by the distance to its k-nearest neighbor 09/03/16 20

! Density-based outlier: Density-based outlier detection Outliers are points in regions of low density The outlier score of an object is the inverse of the density around the object. Inverse distance density (inverse of averaged distance to the k-nearest neighbours):, where N(x,k) is the set of k-nearest neighbors of x, N(x,k) is the size of that set and y is a nearest neighbor. No of points within region density (DBSCAN): The density around an object is equal to the no of objects that are within a specified distance d of the object. 09/03/16 21

Density-based outlier detection (the LOF approach) For each point, compute the density of its local neighborhood Compute the local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value p 2 p 1 In the Nearest-neighbor approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers 09/03/16 22

! Density-based outlier detection using relative density Average relative density (ard) is e.g. given as the ratio of the density of a point x and the average density of its nearest neighbors as follows:! ard(x, k) = density(x, k) y N (x,k) density(y, k)/ N(x, k (Eq 10.7)! A simplified version of the LOF technique using ard(x, k) is given by: 09/03/16 23

Example of relative density (LOF) approach (using k = 10) 09/03/16 24

Clustering-based outlier detection Clustering-based outlier: an object is a cluster-based outlier if the object does not strongly belong to any cluster Basic idea: Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points, they are outliers 09/03/16 25

Clustering-based outlier example 09/03/16 26

Outliers in lower dimensional projection (a grid-based approach) In high-dimensional space, data is sparse and notion of proximity becomes meaningless Every point is an almost equally good outlier from the perspective of proximity-based definitions Lower-dimensional projection methods A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density 09/03/16 27

! Outliers in lower dimensional projection (a grid-based approach) Divide each attribute into φ equal-depth intervals Each interval contains a fraction f = 1/φ of the records Consider a k-dimensional cube created by picking grid ranges from k different dimensions If attributes are independent, we expect a region to contain a fraction f k of the records If there are N points, we can measure sparsity of a cube D including n points as by the sparsity coefficient S:!! where expected fraction and standard deviation of the points in a k-dimensional cube is! Nf k Nf k (1 f k ) given by and respectively.! Negative sparsity indicates cube contains smaller number of points than expected Ref: Outlier Detection for High Dimensional Data, Charu C. Aggarwal and Philip S. Yu, ACM SIGMOD 2001 May 21-24, Santa Barbara, California, USA, 2001. 09/03/16 28

Example for sparsity coefficient N=100, φ = 5, f = 1/5 = 0.2, N f 2 = 4 (expected fraction) 09/03/16 29