COMP20008 Elements of Data Processing. Outlier Detection and Clustering

Size: px

Start display at page:

Download "COMP20008 Elements of Data Processing. Outlier Detection and Clustering"

Arnold Collins
5 years ago
Views:

1 COMP20008 Elements of Data Processing Outlier Detection and Clustering

2 Today Outlier detection for high dimensional data (part I) A digression clustering algorithms K-means Hierarchical clustering Outlier detection for high dimensional data (part II)

3 Outlier analysis Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism (Hawkins, 1980) Ex.: Unusual credit card purchase, sports: Michael Jordon, Lance Franklin, From a statistics perspective Normal (non-outlier) objects are generated using some statistical process The outlier objects deviate from this generating process

4 Abbreviatio [Stats Main][AFL Main] What if the [2013 data Stats][2015 is Stats] multidimensional? Which player are outliers? 2014 Player Stats [2014 Stats Summary] [Adelaide][Brisbane Lions][Carlton][Collingwood][Essendon][Fremantle][Geelong][Gold Coast][Greater Western Sydney][Hawthorn] [Melbourne][North Melbourne][Port Adelaide][Richmond][St Kilda][Sydney][West Coast][Western Bulldogs] [All Teams] Adelaide [Game by Game] # Player GM KI MK HB DI DA GL BH HO TK RB IF CL CG FF FA BR CP UP CM MI 1% BO GA %P 32 Dangerfield, Patrick Sloane, Rory Thompson, Scott Smith, Brodie Jaensch, Matthew Douglas, Richard Wright, Matthew Jacobs, Sam Mackay, David Betts, Eddie Podsiadly, James Brown, Luke Crouch, Brad Martin, Brodie Talia, Daniel Laird, Rory Jenkins, Josh Walker, Taylor Reilly, Brent Kerridge, Sam Multidimensional case: Who are the outliers? [From

5 Data in higher dimensions For datasets with more than 4 dimensions Difficult to visualise It may be difficult to estimate the parameters of the data to adapt statistical methods such as Grubbs test The data may not follow a normal distribution, making statistical methods less suitable This motivates the use of distance based methods as an alternative means of outlier detection

6 Setting: Distance function Distance based outlier detection methods use well-known Euclidean distance as a building block Commonly normalise each attribute into range [0,1] via a preprocessing step before computing distances Given X =(x 1,x 2,x 3,...,x n ) and Y =(y 1,y 2,y 3,...,y n ) (X, Y )= p (x 1 y 1 ) 2 +(x 2 y 2 ) 2 +(x 3 y 3 ) (x n y n ) 2

7 Basic method: distance from the centre of the data A basic method is to compute the distance of each object from the centre of the data. The further an object is from the centre (nearer the edge), the more likely it is to be an outlier. The following plots demonstrate the idea.

8 Dataset: Cross indicates centroid

9 Larger circles mean more like an outlier Dataset: bubble plot of outlier scores (distance from centroid)

10 Clustering based outlier detection An outlier is expected to be far away from any groups of normal objects What if there are multiple localities in the data? Normal activities are diverse. The outlier score of an instance should be relative to its locality. So we can generalise the previous method to consider multiple (more than one) clusters. Each instance is associated with exactly one cluster and its outlier score is equal to the distance from its cluster centre.

11 Clustering based outlier detection. 2 clusters

12 3 clusters

13 6 clusters

14 9 clusters

15 How to obtain the clusters? Need an automatic algorithm that computes the cluster centroids and assigns each object to exactly one cluster.

16 Quality: What Is a Good Clustering? A good clustering method will produce high quality clusters Objects within same cluster are close together Objects in different clusters are far apart Clustering is a major task in data analysis and visualisation, useful not just for outlier detection. Market segmentation Image analysis Search engine result presentation. 16

17 What is Cluster Analysis? Figure below from Tan, Steinbach and Kumar 2004 We will look at two classic clustering algorithm K-means Hierarchical clustering Intra-cluster distances are minimized Inter-cluster distances are maximized

18 K-Means: The best known clustering method Given parameter k, the k-means algorithm is implemented in four steps: 1. Select k seed points as the initial cluster centres 2. Assign each object to the cluster with the nearest seed point 3. Compute new seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) 4. Go back to Step 2, stop when the assignment does not change 18

19 K-means 1. Ask user how many clusters they d like. (e.g. K=5) (Example from Andrew Moore kmeans11.pdf)

20 1. Ask user how many clusters they d like. (e.g. K=5) 2. Randomly guess K cluster Center locations

21 1. Ask user how many clusters they d like. (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints)

22 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns

23 6. Repeat until no change 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. New Centers => new boundaries

24 K-means: Further detail Typically choose the initial seed points randomly Different runs of the algorithms will produce different results Closeness measured by Euclidean distance (Can also use other distance functions) Algorithm can be shown to converge (to a local optimum), typically doesn t require many iterations

25 K-Means Interactive demo AppletKM.html

26 Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits

27 Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by cutting the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )

28 Tree of Life (

29 Hierarchical Clustering Two main types of hierarchical clustering Agglomerative (our focus in this subject): Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

30 Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat - Merge the two closest clusters - Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms

31 Starting Situation Start with clusters of individual points and a proximity matrix p1 p1 p2 p3 p4 p5... p2 p3 p4 p5... Proximity Matrix... p1 p2 p3 p4 p9 p10 p11 p12

32 Intermediate Situation After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C3 C4 C2 C3 C4 C1 C5 Proximity Matrix C2 C5... p1 p2 p3 p4 p9 p10 p11 p12

33 Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C1 C2 C3 C4 C5 C3 C3 C4 C1 C4 C5 Proximity Matrix C2 C5... p1 p2 p3 p4 p9 p10 p11 p12

34 After Merging The question is How do we update the proximity matrix? C1 C2 U C5 C3 C4 C1? C3 C4 C2 U C5 C3 C4?????? C1 Proximity Matrix C2 U C5... p1 p2 p3 p4 p9 p10 p11 p12

35 How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5... Similarity? p1 p2 p3 MIN (Single Linkage) MAX (Complete Linkage) Group Average (Average Linkage) Distance Between Centroids p4 p5... Proximity Matrix

36 How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5... p1 p2 p3 p4 MIN (Single Linkage) MAX (Complete Linkage) Group Average (Average Linkage) Distance Between Centroids p5... Proximity Matrix

37 How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5... p1 p2 p3 p4 MIN (Single Linkage) MAX (Complete Linkage) Group Average (Average Linkage) Distance Between Centroids p5... Proximity Matrix

38 How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5... p1 p2 p3 p4 MIN (Single Linkage) MAX (Complete Linkage) Group Average (Average Linkage) Distance Between Centroids p5... Proximity Matrix

39 How to Define Inter-Cluster Similarity p1 p1 p2 p3 p4 p5... p2 p3 MIN (Single Linkage) MAX (Complete Linkage) Group Average (Average Linkage) Distance Between Centroids p4 p5... Proximity Matrix

40 Cluster Similarity: MIN or Single Linkage Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph.

41 Hierarchical Clustering: MIN Nested Clusters Dendrogram

42 Strength of MIN Original Points Two Clusters Can handle non-elliptical shapes

43 Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters

44 Hierarchical Clustering: MAX Nested Clusters Dendrogram

45 Strength of MAX Original Points Two Clusters Less susceptible to noise and outliers

46 Limitations of MAX Original Points Two Clusters Tends to break large clusters Biased towards globular clusters

47 Cluster Similarity: Group Average (Average Linkage) Proximity of two clusters is the average of pairwise proximity between points in the two clusters. pi Clusteri p Cluster proximity(p,p j j proximity(cluster i, Clusterj) = Cluster Cluster Need to use average connectivity for scalability since total proximity favors large clusters i i j j )

48 Hierarchical Clustering: Group Average Nested Clusters Dendrogram

49 Hierarchical Clustering: Group Average Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters

50 Hierarchical Clustering: Comparison MIN MAX Group Average

51 Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized Different schemes have problems with one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex shapes Breaking large clusters

52 Limitations of distance based clustering outlier detection Strength Work for many types of data Clusters can be regarded as summaries of the data Once the clusters are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast) Weakness Since there are many clustering methods, there are many clustering-based outlier detection methods as well Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets An alternative: proximity based methods, which tackle outliers directly

53 Clustering summary We have reviewed two well known algorithms for clustering K-means Hierarchical clustering Many implementations exist, specifically in python Scikit-learn: sklearn.cluster Scipy: scipy.cluster.vq (k-means) and scipy.cluster.hierarchy (hierarchical clustering) Now back to finish off our discussion on outlier detection.

54 Proximity based methods Distance based outlier detection K-Nearest neighbor distance

55 Distance based outlier detection (Knorr and Ng) An object is an outlier if its D-neighborhood (objects less than distance D) contains only a few objects (less than (1-p) percent of the data) D

56 Distance based outlier detection Need to choose A distance radius D A percentage parameter p (higher p enforces fewer objects inside circle) D

57 Dataset

58 D=0.5, p=0.6: Too many objects are outliers Objects with circles are outliers Circle size is not significant

59 D=0.3, p=0.8: Looks reasonable

60 D=0.3, p=0.9: Too aggressive?

61 D=0.1, p=0.98: Just right!?

62 Distance based outlier detection: observations Choosing parameters is hard! We can do it for this toy 2d example, but would be even more challenging for a real dataset Could use our domain knowledge? A student in this class is a (spatial) outlier if less than 5% of the class is sitting within 2 metres of them?? Also, doesn t provide a ranking for the outliers Object x is more of an outlier than object y Also consider: computational issues

63 K nearest neighbor outlier detection [Ramaswamy et al 2000] Having to choose 2 parameter values to decide on outliers is hard Instead employ just one parameter, k, to compute the k-th neares neighbor of an object The outlier score of an object is the distance to its k-th nearest neighbor (k-nn distance) 3-NN distance

64 Example dataset

65 K=1: Larger circle means higher outlier score

66 K=2

67 K=6

68 K=16

69 K-NN Outlier detection Given an outlier score associated with each object Sort the objects in order of score (highest to lowest) Select the n objects with highest outlier score Easier for user to specify parameters compared to distance based outlier detection we just looked at Nested Loop Algorithm For each object Compute kth nearest neighbor with sequential scan

70 Further reading The following book can be downloaded electronically from Unimelb library Outlier Analysis, Charu Aggarwal, Springer 2013 Read pages 1-9 The following book is available online electronically from Unimelb library Data Mining Concepts and Techniques, Han et al, 3rd edition Read sections 12.1 (background on outlier detection) Read 10.1 and 10.2 (background on clustering, in particular the k-means style approach)

71 Acknowledgements Material partly adapted from Data Mining Concepts and Techniques, Han et al, 2 nd edition References (for interest, not compulsory reading) Distance based outlier detection Knorr, E.M. and Ng, R.T Finding intensional knowledge of distance-based outliers. In Proc. Int. Conf. on Very Large Data Bases (VLDB), Edinburgh, Scotland. K-NN outlier detection Ramaswamy, S. Rastogi, R. and Shim, K Efficient algorithms for mining outliers from large data sets. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), Dallas, TX.

72 Plan Plan 8 April: Visualisation and clustering 11 April: Guest lecture by Scott Thomson from Google Data Google: BigQuery, DeepMind, AlphaGo, Driverless Cars & Data driven innovation 15 and 18 April: NoSQL,distributed and cloud Workshop this week is on clustering, Project Phase 1 due 12pm today. We are aiming to have it marked this week.

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline