Knowledge Discovery in Databases. Part III Clustering

Size: px

Start display at page:

Download "Knowledge Discovery in Databases. Part III Clustering"

Marian Booker
5 years ago
Views:

1 Knowledge Discovery in Databases Erich Schubert Winter Semester 7/8 Part III Clustering

2 Clustering : / Overview Basic Concepts and Applications Distance & Similarity Clustering Approaches Overview Hierarchical Methods Partitioning Methods Density-Based Methods Biclustering Methods Other Clustering Methods Evaluation Database Acceleration Model Optimization Methods

3 Clustering Basic Concepts and Applications : / Suggested reading: V. Estivill-Castro. Why so many clustering algorithms A Position Paper. In: SIGKDD Explorations. (), pp. 6 7 What is Clustering? Cluster analysis (clustering, segmentation, quantization,...) is the data mining core task to find clusters. But what is a cluster? [Est] cannot be precisely defined many different principles and models have been defined even more algorithms, with very different results when is a result valid? results are subjective in the eye of the beholder no specific definition seems best in the general case [Bon6] Common themes found in definition attempts: more homogeneous more similar cohesive

4 Clustering Basic Concepts and Applications : / What is Clustering? / Cluster analysis (clustering, segmentation, quantization,...) is the data mining core task to divide the data into clusters such that: similar (related) objects should be in the same cluster dissimilar (unrelated) objects should be in different clusters clusters are not defined beforehand (otherwise: use classification) clusters have (statistical, geometric,... ) properties such as: connectivity separation least squared deviation Separated density Clustering algorithms have different cluster models ( what is a cluster for this algorithm? ) induction principles ( how does the algorithm find clusters? ) Cohesive? Similar? Connected

5 Clustering Basic Concepts and Applications : / Applications of Clustering We can use clustering stand-alone for data exploration, or as preprocessing step for other algorithms. Usage examples: Customer segmentation: Optimize ad targeting or product design for different focus groups. Web visitor segmentation: Optimize web page navigation for different user segments. Data aggregation / reduction: Represent many data points with a single (representative) example. E.g., reduce color palette of an image to k colors Text collection organization: Group text documents into (previously unknown) topics.

6 Clustering Basic Concepts and Applications : / Applications of Clustering / Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus, and species Information retrieval: document clustering Land use: identification of areas of similar land use in an Earth observation database Marketing: help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses according to their house type, value, and geographical location Earthquake studies: observed Earthquake epicenters should be clustered along continent faults Climate: understanding Earth climate, find patterns of atmospheric and oceanic phenomena Economic Science: market research

7 Clustering Basic Concepts and Applications : / Applications of Clustering / Construction of thematic maps from satellite imagery Band Basis Earth's Surface Feature Space Band Raster image of California using wave bands: -d feature space different surfaces of the Earth have different properties in terms of reflections and emissions cluster satellite imagery into different land use (forest, agriculture, desert, buildings,...)

8 Clustering Basic Concepts and Applications : 6 / Basic Steps to Develop a Clustering Task Feature selection Select information (about objects) concerning the task of interest Aim at minimal information redundancy Weighting of information Proximity measure Similarity of two feature vectors Clustering criterion Expressed via a cost function or some rules Clustering algorithms Choice of algorithms Validation of the results Validation test Interpretation of the results Integration with applications

9 Clustering Distance & Similarity : 7 / Distances, Metrics and Similarities In mathematics, a metric or distance function is defined as a function dist : X X R that satisfies the conditions (N) dist(x, y) (non-negativity) (I) dist(x, y) = x = y (identity of indiscernibles) (S) dist(x, y) = dist(y, x) (symmetry) (T) dist(x, y) dist(x, o) + dist(o, y) (triangle inequality) In data mining, we have a less perfect world. For (I), we require only x = y dist(x, y) =, because we often have duplicate records We often do not have (T). We often allow the distance to be. Common terminology used in data mining: metric: satisfies (N), (S), and (T). distance: satisfies (N) and (S). dissimilarity: usually (but not always) satisfies (N) and (S).

10 Clustering Distance & Similarity : 8 / Distances, Metrics and Similarities / A similarity is a function s : X X R {± } such that low values indicate dissimilar objects, and large values indicate similarity. Similarities can be Symmetric: s(x, y) = s(y, x) Non-negative: s(x, y) Normalized: s(x, y) [; ] and s(x, x) = If dist(x, y) is a distance function, then it implies a non-negative, symmetric, normalized similarity: s(x, y) := [; ] + dist(x, y) If s(x, y) is a normalized and symmetric similarity, then dist(x, y) = s(x, y) is a distance (with at least (N) and (S), but usually not a metric).

11 Clustering Distance & Similarity : 9 / Distance Functions There are many different distance functions. The standard distance on R d is Euclidean distance: d Euclidean (x, y) = (x d y d ) d = x y Manhattan distance (city block metric): d Manhattan (x, y) = d x d y d = x y are special cases of Minkowski norms (L p distances, p ): ( d Lp (x, y) = x d y d p) /p d In the limit p we also get the maximum norm: d Maximum (x, y) = max d x d y d x without index usually refers to the Euclidean norm. = x y p many more distance functions see the Encyclopedia of Distances [DD9]!

12 Clustering Distance & Similarity : / Distance Functions / Many more variants: Squared Euclidean distance ( sum of squares ): Minimum distance (p ) not metric: d Euclidean (x, y) = d (x d y d ) = x y d Minimum (x, y) = min d x d y d Weighted minkowski norms: d Lp (x, y, ω) = ( d ω d x d y d p) /p Canberra distance: d Canberra (x, y) = d x d y d x d + y d many more distance functions see the Encyclopedia of Distances [DD9]!

13 Clustering Distance & Similarity : / Similarity Functions Instead of distances, we can often also use similarities: Cosine similarity: s cos (x, y) := x y x y = i x iy i i x i i y i on data normalized to L length x = y =, this simplifies to: s cos (x, y) := x y = x iy i x = y = i Popular transformations to distances: d cos (x, y) = s cos (x, y) d cos (x, y) = arccos(s cos (x, y)) Careful: if we use similarities, large values are better with distances, small values are better.

14 Clustering Distance & Similarity : / Distances for Binary Data Binary attributes: two objects o, o {, } d having only binary attributes Determine the following quantities: f = the number of attributes where o has and o has f = the number of attributes where o has and o has f = the number of attributes where o has and o has f = the number of attributes where o has and o has Simple matching coefficient (SMC): s SMC (o, o ) = f + f = f + f f + f + f + f d used for symmetric attributes where each state (, ) is equally important. Simple matching distance (SMD): d SMC (o, o ) = s SMC (o, o ) = f + f d = d Hamming(o, o ) d

15 Clustering Distance & Similarity : / Jaccard Coefficient for Sets Jaccard coefficient for sets A and B (if A = B =, we use similarity ): s Jaccard (A, B) := Jaccard distance for sets A and B: A B A B = A B A + B A B d Jaccard (A, B) := s Jaccard (A, B) = If we encode the sets as binary vectors, we get: A B A B A B [; ] [; ] s Jaccard (o, o ) = d Jaccard (o, o ) = f f + f + f f + f f + f + f

16 Clustering Distance & Similarity : / Example Distances for Categorical Data Categorical data: two lists x = {x, x,..., x d } and y = {y, y,..., y d } with each x i and y i being categorical (nominal) attributes Hamming distance: count the number of differences. dist Hamming (x, y) = d δ(x i, y i ) with δ(x i, y i ) = i= Jaccard by taking the position into account. { if x i = y i if x i y i Gower s for mixed interval, ordinal, and categorical data. if x i = y i d if variable X i categorical and x i y i dist Gower (x, y) = x i y i i= R i if variable X i is interval scale if variable X i is ordinal scale rank(x i ) rank(y i ) n where R i = max x x i min x x i is the range, and rank(x) [,..., n] is the rank. Intuition: each attribute has the same maximum distance contribution.

17 Clustering Distance & Similarity : / Mahalanobis Distance Given a data set with variables X,... X d, and n vectors x... x n, x i = (x i,..., x id ) T. The d d covariance matrix is computed as: Σ ij := S Xi X j = n n k= (x ki X i )(x kj X j ) The Mahalanobis distance is then defined on two vectors x, y R d as: d Mah (x, y) = (x y) T Σ (x y) where Σ is the inverse of the covariance matrix of the data. This is a generalization of the Euclidean distance This takes correlation between attributes into account The attributes can have different ranges of values

18 Clustering Distance & Similarity : 6 / Scaling & Normalization Gower s distance and Mahalanobis distance scale attributes. This can be beneficial (but also detrimental) for other distances, too! Popular normalization approaches: To unit range: To unit variance: such that X i =, S X i = x i = x i min X i max X i min X i [; ] x i = x i X i S Xi Principal component analysis (PCA) such that S X i X j = for i j. Many more... Scaling feature weighting!

19 Clustering Distance & Similarity : 7 / To Scale, or not to Scale? Person Age [years] Height [cm] Height [feet] A 9 6. B 6.8 C 9 6. D 6.8 Which persons are more similar, A and B, or A and C? Height [cm] Height [ft] 9 A C B D Age [years] Here, scaling often improves the results! 6 A B C D 6 8 Age [years]

20 Clustering Distance & Similarity : 8 / To Scale, or not to Scale? / Object X Y A.7 8. B.. C D To scale or not to scale? or

21 Clustering Distance & Similarity : 8 / To Scale, or not to Scale? / Object Longitude Latitude Palermo.7 8. Venice.. San Francisco Portland, OR To scale or not to scale? or Don t scale! We need to know what the attributes mean!

22 Clustering Clustering Approaches Overview : 9 / Different Kinds of Clustering Algorithms There are many clustering algorithms, with different Paradigms: Distance Similarity Variance Density Connectivity Probability Subgraph... Properties: Partitions: strict, hierarchical, overlapping Models: Prototypes, Patterns, Gaussians, Subgraphs,... Outliers / noise, or complete clustering Hard (binary) assignment or soft (fuzzy) assignment Full dimensional or subspace clustering Rectangular, spherical, or correlated clusters Iterative, or single-pass, or streaming...

23 Clustering Clustering Approaches Overview : / Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used by the method, its implementation, and its ability to discover some or all of the hidden patterns Quality of clustering There is usually a separate quality function that measures the goodness of a cluster or a clustering It is hard to define similar enough or good enough, i.e., the answer is typically highly subjective and specific to application domain

24 Clustering Clustering Approaches Overview : / Requirements and Challenges Scalability Clustering all the data instead of only sample data Ability to deal with different types of attributes Numeric, binary, categorical, ordinal, linked, and mixture of these Constraint-based clustering User may give input as constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionality

25 Clustering Clustering Approaches Overview : / Major Clustering Approaches Partitioning approaches Construct partitions that optimize some criterion, e.g., minimizing the sum of squared errors Typical methods: k-means, PAM (k-medoids), CLARA, CLARANS Density-based approaches Based on connectivity, density, and distance functions Typical methods: DBSCAN, OPTICS, DenClue, HDBSCAN* Hierarchical approaches Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: DIANA, AGNES, BIRCH Model-based Optimize the fit of a hypothesized model to the data Typical methods: Gaussian Mixture Modeling (GMM, EM), SOM, COBWEB

26 Clustering Clustering Approaches Overview : / Major Clustering Approaches / Grid-based approach based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE Frequent pattern-based Based on the analysis of frequent patterns Typical methods: p-cluster User-guided or constraint-based Clustering by considering user-specified or application-specific constraints Typical methods: COD (obstacles), constrained clustering Link-based clustering Objects are often linked together in various ways Massive links can be used to cluster objects: SimRank, LinkClus

27 Clustering Hierarchical Methods : / Hierarchical Agglomerative Clustering One of the earliest clustering methods [Sne7; Sib7; Har7; KR9]:. Initially, every object is a cluster. Find two most similar clusters, and merge them. Repeat () until only one cluster remains. Plot tree ( dendrogram ), and choose interesting subtrees Many variations that differ by: Distance / similarity measure of objects Distance measure of clusters ( Linkage ) Optimizations

28 Clustering Hierarchical Methods : / Distance of Clusters Single-linkage: minimum distance = maximum similarity d single (A, B) := min d(a, b) = max s(a, b) a A,b B a A,b B Complete-linkage: maximum distance = minimum similarity d complete (A, B) := max d(a, b) = min s(a, b) a A,b B a A,b B Average-linkage (UPGMA): average distance = average similarity d average (A, B) := A B d(a, b) a A b B Centroid-linkage: distance of cluster centers (Euclidean only) d centroid (A, B) := µ A µ B

29 Clustering Hierarchical Methods : 6 / Distance of Clusters / McQuitty (WPGMA): average of previous sub-clusters Defined recursively, e.g., via Lance-Williams equation. Average distance to the previous two clusters. Median-linkage (Euclidean only): distance from midpoint Defined recursively, e.g., via Lance-Williams equation. Median is the halfway point of the previous merge. Ward-linkage (Euclidean only): Minimum increase of squared error d Ward (A, B) := = A B A B µ A µ B Mini-Max-linkage: Best maximum distance, best minimum similarity d minimax (A, B) := min c A B max d(c, p) = max p A B min c A B p A B s(c, p)

30 Clustering Hierarchical Methods : 7 / AGNES Agglomerative Nesting [KR9] AGNES, using the Lance-Williams equations [LW67]:. Compute the pairwise distance matrix of objects. Find position of the minimum distance d(i, j) (similarity: maximum similarity s(i, j)). Combine rows and columns of i and j into one using Lance-Williams update equations d(a B, C) = LanceWilliams (d(a, C), d(b, C), d(a, B)) using only the stored, known distances d(a, C), d(b, C), d(a, B).. Repeat from (.) until only one entry remains. Return dendrogram tree Avoid to compute d(a, C) directly, as this is expensive: O(n ).

31 Clustering Hierarchical Methods : 8 / AGNES Agglomerative Nesting [KR9] / Lance-Williams update equation have the general form: D(A B, C) = α d(a, C) + α d(b, C) + βd(a, B) + γ d(a, C) d(b, C) Several (but not all) linkages can be expressed in this form (for distances): α α β γ Single-linkage / / / Complete-linkage / / +/ A B Average-group-linkage (UPGMA) A + B A + B McQuitty (WPGMA) / / A A + B B A + B A B ( A + B ) Centroid-linkage (UPGMC) Median-linkage (WPGMC) / / / Ward A + C A + B + C B + C A + B + C MiniMax linkage: cannot be computed with Lance-Williams updates, but we need to find the best cluster representative (in O(n )). C A + B + C

32 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

33 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

34 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

35 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

36 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

37 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

38 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

39 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

40 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

41 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

42 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

43 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

44 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

45 Clustering Hierarchical Methods : 9 / Example: AGNES Example with complete linkage (= maximum of the distances): We may need to merge non-adjacent rows! B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram We don t know the optimum label positions in advance Please do not print these slides save a tree.

46 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

47 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

48 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

49 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

50 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

51 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

52 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

53 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

54 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

55 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

56 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. A B C D E F A B C D E F Dendrogram Please do not print these slides save a tree.

57 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E 6 F C A B C D E F A B C D E F.7 Scatter plot Distance matrix Update distance matrix (here: keep maximum in each row/column, except diagonal) A B C D E F Dendrogram Please do not print these slides save a tree.

58 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): B A D E F C A B C D E A B C D E F Scatter plot Find minimum distance. F...8. Distance matrix A B C D E F Dendrogram Please do not print these slides save a tree.

59 Clustering Hierarchical Methods : / Example: AGNES / Example with single linkage (= minimum of the distances): In this very simple example, single and complete linkage are very similar B A D E 6 F C A B C D E F Scatter plot Distance matrix Merge clusters, and update dendrogram. Same clusters in this example, but this is usually not the case. Please do not print these slides save a tree. A B C D E F A B C D E F Dendrogram

60 Clustering Hierarchical Methods : / Extracting Clusters from a Dendrogram At this point, we have the dendrogram but not yet clusters. Various strategies have been discussed: Visually inspect the dendrogram, choose interesting branches Stop when k clusters remain (may be necessary to ignore noise [Sch+7b]) Stop at a certain distance (e.g., maximum cluster diameter with complete-link) Change in cluster sizes or density [Cam+] Constraints satisfied (semi-supervised) [Pou+]: Certain objects are labeled as must or should not be in the same clusters.

61 Clustering Hierarchical Methods : / Complexity of Hierarchical Clustering Complexity analysis of AGNES:. Computing the distance matrix: O(n ) time and memory.. Finding the minimum distance / maximum similarity: O(n ) i. Updating the matrix: O(n) i (with Lance-Williams) or O(n ) i. Number of iterations: i = O(n) Total: O(n ) time and O(n ) memory! Better algorithms can run in guaranteed O(n ) time [Sib7; Def77], with priority queues we get O(n log n) [Mur8; DE8; Epp98; Mül], or usually n time with caching [And7]. Instead of agglomerative (bottom-up), we can also begin with one cluster, and divide it (DIANA). But there are O( n ) splits need heuristics (i.e., other clustering algorithms) to split. Hierarchical clustering does not scale to large data, code optimization matters [KSZ6].

62 Clustering Hierarchical Methods : / Benefits and Limitations of HAC Benefits: Very general: any distance / similarity (for text: cosine!) Easy to understand and interpret Hierarchical result Dendrogram visualization often useful (for small data) Number of clusters does not need to be fixed beforehand Many variants Limitations: Scalability is the main problem (in particular, O(n ) memory) In many cases, users want a flat partitioning Unbalanced cluster sizes (i.e., number of points) Outliers

63 Clustering Partitioning Methods : / k-means Clustering The k-means problem: Divide data into k subsets (k is a parameter) Subsets represented by their arithmetic mean µ C Optimize the least squared error SSQ := (x i,d µ C,d ) C x i C d Important properties: Squared errors put more weight on larger deviations Arithmetic mean is the maximum likelihood estimator of centrality Data is split into Voronoi cells k-means is a good choice, if we have k signals and normal distributed measurement error Suggested reading: the history of least squares estimation (Legendre, Gauss):

64 Clustering Partitioning Methods : / The Sum of Squares Objective The sum-of-squares objective: SSQ := C }{{} d }{{} x i C }{{} every cluster every dimension every point We can rearrange these sums because of communtativity (x i,d µ C,d ) }{{} squared deviation from mean For every cluster C and dimension d, the arithmetic mean minimizes (x i,d µ C,d ) is minimized by µ C,d = x i C C x i,d x i C Assigning every point x i to its least-squares closest cluster C usually reduces SSQ, too. Note: sum of squares squared Euclidean distance: d (x i,d µ C,d ) x i µ C d Euclidean (x i, µ C ) We can therefore say that every point is assigned the closest cluster, but we cannot use arbitrary other distance functions in k-means (because the arithmetic mean only minimizes SSQ). This is not always optimal: the change in mean can increase the SSQ of the new cluster. But this difference is commonly ignored in algorithms and textbooks.

65 Clustering Partitioning Methods : 6 / The Standard Algorithm (Lloyd s Algorithm) The standard algorithm for k-means [Ste6; For6; Llo8]: Algorithm: Lloyd-Forgy algorithm for k-means Choose k points randomly as initial centers repeat Assign every point to the least-squares closest center stop if no cluster assignment changed Update the centers with the arithmetic mean Lines and both improve SSQ This is not the most efficient algorithm (despite everybody teaching this variant). ELKI [Sch+] contains variants (e.g., Sort-Means [Phi]; benchmarks in [KSZ6]). There is little reason to still use this variant in practise! The name k-means was first used by MacQueen for a slightly different algorithm [Mac67]. k-means was invented several times, and has an interesting history [Boc7].

66 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm /

67 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / Choose initial centroids

68 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / Reassign points

69 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / Recompute centroids

70 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / Reassign points

71 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / Recompute centroids

72 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / Reassign points

73 Clustering Partitioning Methods : 7 / Example: k-means Clustering with Lloyd s algorithm / k-means has converged in the third iteration with SSQ =

74 Clustering Partitioning Methods : 8 / Example: k-means Clustering with Lloyd s algorithm / Let s try again!

75 Clustering Partitioning Methods : 8 / Example: k-means Clustering with Lloyd s algorithm / Choose initial centroids

76 Clustering Partitioning Methods : 8 / Example: k-means Clustering with Lloyd s algorithm / Reassign points

77 Clustering Partitioning Methods : 8 / Example: k-means Clustering with Lloyd s algorithm / Recompute centroids

78 Clustering Partitioning Methods : 8 / Example: k-means Clustering with Lloyd s algorithm / k-means has converged in the second iteration with SSQ = Reassign points

79 Clustering Partitioning Methods : 9 / Example: k-means Clustering with Lloyd s algorithm / Let s try again!

80 Clustering Partitioning Methods : 9 / Example: k-means Clustering with Lloyd s algorithm / Choose initial centroids

81 Clustering Partitioning Methods : 9 / Example: k-means Clustering with Lloyd s algorithm / Reassign points

82 Clustering Partitioning Methods : 9 / Example: k-means Clustering with Lloyd s algorithm / Recompute centroids

83 Clustering Partitioning Methods : 9 / Example: k-means Clustering with Lloyd s algorithm / k-means has converged in the second iteration with SSQ = Reassign points

84 Clustering Partitioning Methods : / Non-determinism & non-optimality Most k-means algorithms do not guarantee to find the global optimum (would be NP-hard too expensive) give different local optima, depending on the starting point In practical use: data is never exact, or complete the optimum result is not necessarily the most useful Usually, we gain little by finding the true optimum It is usually good enough to try multiple random initializations and keep the best More precisely: static point. The standard algorithm may fail to even find a local minimum [HW79]. Least squares, i.e., lowest SSQ this does not mean it will actually give the most insight.

85 Clustering Partitioning Methods : / Initialization Ideally, each initial center is from a different cluster. But the chance of randomly drawing one centroid from each cluster is small: Assuming (for simplification) that all clusters are the same size n, then p = number of ways to select one centroid from each cluster number of ways to select k centroids For example, if k =, then probability =!/ =.6 = k!nk (kn) k = k! k k We can run k-means multiple times, and keep the best solution. But we still have a low chance of getting one center from each cluster! For k = and tries: p =.. Need attempts for % chance. On the other hand, even if we choose suboptimal initial centroids, the algorithm still can converge to a good solution. And even with one initial centroid from each cluster, it can converge to a bad solution!

86 Clustering Partitioning Methods : / Initialization / Several strategies for initializing k-means exist: Initial centers given by domain expert Randomly assign points to partitions... k (not very good) [For6] Randomly generate k centers from a uniform distribution (not very good) Randomly choose k data points as centers (uniform from the data) [For6] First k data points [Mac67, incremental k-means] Choose a point randomly, then use always the farthest point to get k initial points (often quite well; initial centers are often outliers, and gives similar results when run often) Weighting points by their distance [AV7, K-means++] Points are chosen randomly with p min c x i c (c = all current seeds) Run a few k-means iterations on a sample, then use centroids from the sample result....

87 Clustering Partitioning Methods : / Complexity of k-means Clustering In the standard algorithm:. Initialization is usually cheap, O(k) (k-means++: O(N k d) [AV7]). Reassignment is O(N k d) i. Mean computation is O(N d) i. Number of iterations i Ω( N) [AV6] (but fortunately, usually i N). Total: O(N k d i) Worst case is superpolynomial, but in practice the method will usually run much better than n. We can force a limit on the number of iterations, e.g., i =, with little loss in quality usually. In practice, often the fastest clustering algorithm we use. Improved algorithms primarily reduce the number of computations for reassignment, but usually with no theoretical guarantees (so still O(N k d) i worst-case)

88 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = :.9 Initial centers Please do not print these slides save a tree.

89 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

90 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

91 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

92 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

93 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

94 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

95 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

96 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

97 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

98 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

99 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

100 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

101 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

102 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Iteration # Please do not print these slides save a tree.

103 Clustering Partitioning Methods : / Clusters Changes are Increasingly Incremental Convergence on Mouse data set with k = : Please do not print these slides save a tree.

104 Clustering Partitioning Methods : / Benefits and Drawbacks of k-means Benefits: Very fast algorithm (O(k d N), if we limit the number of iterations) Convenient centroid vector for every cluster (We can analyze this vector to get a topic ) Can be run multiple times to get different results Limitations: Difficult to choose the number of clusters, k Cannot be used with arbitrary distances Sensitive to scaling requires careful preprocessing Does not produce the same result every time Sensitive to outliers (squared errors emphasize outliers) Cluster sizes can be quite unbalanced (e.g., one-element outlier clusters)

105 Clustering Partitioning Methods : 6 / Choosing the Optimum k for k-means A key challenge of k-means is choosing k: Trivial to prove: SSQ optimum,k SSQ optimum,k+. Avoid comparing SSQ for different k or different data (including normalization). SSQ k=n = perfect solution? No: useless. SSQ k may exhibit an elbow or knee : initially it improves fast, then much slower. Use alternate criteria such as Silhouette [Rou87], AIC [Aka77], BIC [Sch78; ZXF8]. Computing silhouette is O(n ) more expensive than k-means. AIC, BIC try to reduce overfitting by penalizing model complexity (= high k). More details will come in evaluation section. Nevertheless, these measures are heuristics other k can be better in practice! Methods such as X-means [PM] split clusters as long as a quality criterion improves.

106 Clustering Partitioning Methods : 7 / Example: Choosing the Optimum k / Toy mouse data set:.9.8 This toy data set is difficult for k-means because the clusters have very different sizes (i.e., extend)

107 Clustering Partitioning Methods : 7 / Example: Choosing the Optimum k / On the toy (mouse) data set: Best results for initializations, k =... :. is not considered to be a good Silhouette Min Mean Max.6. Min Mean Max Best SSQ Sum of Squares Knee? Typical SSQ curve Silhoue e.... k = is best, but k = is similar Number of clusters k All tested measures either prefer or clusters. Number of clusters k

108 Clustering Partitioning Methods : 7 / Example: Choosing the Optimum k / Toy mouse data set with k = :

109 Clustering Partitioning Methods : 7 / Example: Choosing the Optimum k / Toy mouse data set with k = :

110 Clustering Partitioning Methods : 7 / Example: Choosing the Optimum k /.9 Toy mouse data set:.9 Best with k = :.9 Best with k = :

111 Clustering Partitioning Methods : 8 / k-means and Distances Could we use a different distance in the k-means? SSQ := dist(x i µ C ) C x i C We can still assign every point to the closest center.

112 Clustering Partitioning Methods : 8 / k-means and Distances Could we use a different distance in the k-means? SSQ := dist(x i µ C ) C x i C We can still assign every point to the closest center. But: is the mean the optimum cluster center?

113 Clustering Partitioning Methods : 8 / k-means and Distances Could we use a different distance in the k-means? SSQ := dist(x i µ C ) C x i C We can still assign every point to the closest center. But: is the mean the optimum cluster center? The mean is a least-squares estimator in R, i.e., it minimizes x i,d µ C,d. That does not imply the mean minimizes other distances! Counter example:,,,, R; Euclidean distance becomes x i m (no square!) The mean is m =. Mean distance:. The median is m =. Mean distance:. the mean does not minimize linear Euclidean distance, Manhattan distance,... The k-means algorithm minimizes Bregman divergences [Ban+]. For other distances, we need to replace the mean in k-means.

114 Clustering Partitioning Methods : 9 / k-means Variations for Other Distances k-medians: instead of the mean, we use the median in each dimension. [BMS96] For use with Manhattan norm x i m j. k-modes: use the mode instead of the mean. [Hua98] For categorical data, using Hamming distance. k-prototypes: mean on continuous variables, mode on categorical. [Hua98] For mixed data, using squared Eucldiean respectively Hamming distance. k-medoids: using the medoid (element with smallest distance sum). For arbitrary distance functions. Spherical k-means: use the mean, but normalized to unit length. For cosine distance. Gaussian Mixture Modeling: using mean and covariance. For use with Mahalanobis distance.

115 Clustering Partitioning Methods : / k-means for Text Clustering Cosine similarity is closely connected to squared Euclidean distance. Spherical k-means [DM] uses: Input data is normalized to have x i = At each iteration, the new centers are normalized to µ C := µ C = µ C minimizes average cosine similarity [DM] xi, µ = C C xi, µ C x i C x i C Sparse nearest-centroid computations in O(d ) where d is the number of non-zero values Result is similar to a SVD factorization of the document-term-matrix [DM]

116 Clustering Partitioning Methods : / Pre-processing and Post-processing Pre-processing and post-processing commonly used with k-means: Pre-processing: Scale / normalize continuous attributes to [; ] or unit variance. Encode categorical attributes as binary attributes. Eliminate outliers Post-processing Eliminate clusters with few elements (probably outliers) Split loose clusters, i.e., clusters with relatively high SSE Merge clusters that are close and that have relatively low SSE Can use these steps during the clustering process E.g., ISODATA algorithm [BH6], X-means [PM], G-means [HE]

117 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example badly chosen k:

118 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example badly chosen k:

119 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example different diameter:

120 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example different diameter:

121 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example scaling:

122 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example scaling:

123 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example different densities:

124 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example different densities:

125 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example cluster shapes:

126 Clustering Partitioning Methods : / Limitations of k-means k-means has problems when clusters are of differing sizes, densities, or have non-spherical shape k-means has problems when the data contains outliers. Example cluster shapes:

127 Clustering Partitioning Methods : / k-medoids Clustering This approach tries to improve over two weaknesses of k-means: k-means is sensitive to outliers (because of the squared errors) k-means cannot be used with arbitrary distances. Idea: The medoid of a set is the object with the least distance to all others. The most central, most representative object k-medoids objective function: absolute-error criterion k TD = dist(x j, m i ) i= x j C i where m i is the medoid of cluster C i. As with k-means, the k-medoid problem is NP-hard. The algorithm Partitioning Around Medoids (PAM) guesses a result, then uses iterative refinement, similar to k-means.

128 Clustering Partitioning Methods : / Partitioning Around Medoids Partitioning Around Medoids (PAM, [KR9]) choose a good initial set of medoids iteratively improve the clustering by doing the best (m i, o j ) swap, where m i is a medoid, and o j is a non-medoid. if we cannot find a swap that improves TD, the algorithm has converged. good for small data, but not very scalable

129 Clustering Partitioning Methods : / Algorithm: Partitioning Around Medoids PAM consists of two parts: PAM BUILD, to find an initial clustering: Algorithm: PAM BUILD: Find initial cluster centers m point with the smallest distance sum TD to all other points for i=...k do m i point which reduces TD most return TD, {m,..., m k }

130 Clustering Partitioning Methods : / Algorithm: Partitioning Around Medoids PAM consists of two parts: PAM BUILD, to find an initial clustering: Algorithm: PAM BUILD: Find initial cluster centers m point with the smallest distance sum TD to all other points for i=...k do m i point which reduces TD most return TD, {m,..., m k } This needs O(n k) time, and is best implemented with O(n ) memory. We could use this to seed k-means, too. But it is usually slower than a complete k-means run.

131 Clustering Partitioning Methods : 6 / Algorithm: Partitioning Around Medoids / PAM SWAP, to improve the clustering: Algorithm: PAM SWAP: Improve the clustering repeat for m i {m,..., m k } do for c j {m,..., m k } do TD TD with c j medoid instead of m i Remember (m i, c j, TD ) for the best TD 6 stop if the best TD did not improve the clustering 7 swap (m i, c j ) of the best TD 8 return TD, M, C

132 Clustering Partitioning Methods : 6 / Algorithm: Partitioning Around Medoids / PAM SWAP, to improve the clustering: Algorithm: PAM SWAP: Improve the clustering repeat for m i {m,..., m k } do for c j {m,..., m k } do TD TD with c j medoid instead of m i Remember (m i, c j, TD ) for the best TD 6 stop if the best TD did not improve the clustering 7 swap (m i, c j ) of the best TD 8 return TD, M, C This needs O(k(n k) ) time for each iteration i. The authors of PAM assumed, that only few iterations will be needed, because the initial centers are supposed to be good already.

133 Clustering Partitioning Methods : 7 / Algorithm: CLARA (Clustering Large Applications) [KR9] Algorithm: CLARA: Clustering Large Applications TD best, M best, C best,, for i =... do S i sample of + k objects M i medoids from running PAM(S i ) TD i, C i compute total deviation and cluster assignment using M i 6 if TD i < TD best then 7 TD best, M best, C best TD i, M i, C i 8 return T D best, M best, C best sampling-based method: apply PAM to a random sample of whole dataset builds clustering from multiple random samples and returns best clustering as output

134 Clustering Partitioning Methods : 7 / Algorithm: CLARA (Clustering Large Applications) [KR9] Algorithm: CLARA: Clustering Large Applications TD best, M best, C best,, for i =... do S i sample of + k objects M i medoids from running PAM(S i ) TD i, C i compute total deviation and cluster assignment using M i 6 if TD i < TD best then 7 TD best, M best, C best TD i, M i, C i 8 return T D best, M best, C best sampling-based method: apply PAM to a random sample of whole dataset builds clustering from multiple random samples and returns best clustering as output runtime complexity: O(ks + k(n k)) with sample size s + k applicable to larger data sets but the result is only based on a small sample samples need to be representative, medoids are only chosen from the best sample

135 Clustering Partitioning Methods : 8 / Algorithm: CLARANS [NH] Algorithm: CLARANS: Clustering Large Applications based on RANdomized Search TD best, M best, C best,, for l =... numlocal do // Number of times to restart M l random medoids TD l, C l compute total deviation and cluster assignment using M l for i =... maxneighbor do // Attempt to improve the medoids 6 m j, o k randomly select a medoid m j, and a non-medoid o k 7 TD l, C l total deviation when replacing m j with o k 8 if TD l < TD l then // swap improves the result 9 m j, TD l, C l o k, TD l, C l // accept the improvement i // restart inner loop if TD l < TD best then // keep overall best result TD best, M best, C best TD l, M l, C l return TD best, M best, C best

136 Clustering Partitioning Methods : 9 / Algorithm: CLARANS [NH] / Clustering Large Applications based on RANdomized Search (CLARANS) [NH] considers at most maxneighbor many randomly selected pairs to find one improvement replace medoids as soon as we found a better candidate, rather than testing all k (n k) alternatives for the best improvement possible search for k optimal medoids is repeated numlocal times (c.f., restarting multiple times with k-means) complexity: O(numlocal maxneighbor swap n) good results only if we choose maxneighbor large enough in practice the typical runtime complexity of CLARANS is similar to O(n )

137 Clustering Partitioning Methods : 6 / k-medoids, Lloyd style We can adopt Lloyd s approach also for k-medoids: Algorithm: k-medoids with alternating optimization [Ach+] {m,..., m k } choose k random initial medoids repeat {C,..., C k } assign every point to the nearest medoid s cluster stop if no cluster assignment has changed foreach cluster C i do 6 m i object with the smallest distance sum to all others within C i 7 return TD, M, C similar to k-means, we alternate between () optimizing the cluster assignment, and () choosing the best medoids. can update all medoids in each iteration. complexity: O(k n + n ) per iteration, comparable to PAM

138 Clustering Partitioning Methods : 6 / From k-means to Gaussian EM Clustering.9 k-means can not handle clusters with different radius well. Toy mouse data set: Best -means: Best -means: could we estimate mean and radius? model the data with multivariate Gaussian distributions

139 Clustering Model Optimization Methods : 6 / Expectation-Maximization Clustering EM (Expectation-Maximization) is the underlying principle in Lloyd s k-means:. Choose initial model parameters θ. Expect latent variables (e.g., cluster assignment) from θ and the data.. Update θ to maximize the likelihood of observing the data. Repeat (.) (.) until a stopping condition holds

140 Clustering Model Optimization Methods : 6 / Expectation-Maximization Clustering EM (Expectation-Maximization) is the underlying principle in Lloyd s k-means:. Choose initial model parameters θ. Expect latent variables (e.g., cluster assignment) from θ and the data.. Update θ to maximize the likelihood of observing the data. Repeat (.) (.) until a stopping condition holds Recall Lloyd s k-means:. Choose k centers randomly (θ: random centers). Expect cluster labels by choosing the nearest center as label. Update cluster centers with maximum-likelihood estimation of centrality. Repeat (.) (.) until change = Objective: optimize SSQ.

141 Clustering Model Optimization Methods : 6 / Expectation-Maximization Clustering EM (Expectation-Maximization) is the underlying principle in Lloyd s k-means:. Choose initial model parameters θ. Expect latent variables (e.g., cluster assignment) from θ and the data.. Update θ to maximize the likelihood of observing the data. Repeat (.) (.) until a stopping condition holds Gaussian Mixture Modeling (GMM): [DLR77]. Choose k centers randomly, unit covariance, and uniform weight: θ = (µ, Σ, w, µ, Σ, w,... µ k, Σ k, w k ). Expect cluster labels based on Gaussian distribution density. Update Gaussians with mean and covariance matrix. Repeat (.) (.) until change < ɛ Objective: Optimize log-likelihood: log L(θ) := x log P (x θ)

142 Clustering Model Optimization Methods : 6 / Gaussian Mixture Modeling & EM The multivariate Gaussian density with center µ i and covariance matrix Σ i is: P (x C i ) =pdf(x, µ i, Σ i ) := (π) d Σ i e ((x µ i ) T Σ i (x µ i )) For the Expectation Step, we use Bayes rule: P (C i x) = pdf(x, µ P (C i, Σ i )P (C i ) i x) is the relative responsibility of C i for point x P (x) P (C i x) proportional to w i pdf i Using P (C i ) = w i and the law of total probability: P (C i x) w ipdf(p, µ i, Σ i) w i pdf(p, µ i, Σ i ) P (C i x) = j w jpdf(p, µ j, Σ j ) For the Maximization Step: Use weighted mean and weighted covariance to recompute cluster model. µ i,j = x P (C i x) P (C i x)x j weighted X / S XY x i cluster, j, k dimensions Σ i,j,k = x P (C i x) P (C i x)(x j µ j )(x k µ k ) x w i =P (C i ) = X P (C i x) x

143 Clustering Model Optimization Methods : 6 / Fitting Multiple Gaussian Distributions Probability density function of a multivariate Gaussian: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) If we constrain Σ we can control the cluster shape: We always want symmetric and positive semi-definite Σ covariance matrix: rotated ellipsoid (A) B C A Σ diagonal ( variance matrix ): ellipsoid (B) Σ scaled unit matrix: spherical (C) Same Σ for all clusters, or different Σ i each µ.

144 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := e (π)σ ( ) (x µ) σ

145 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := e (π)σ ( ) (x µ) σ

146 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := (π)σ e ((x µ)σ (x µ))

147 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := (π)σ e ((x µ)σ (x µ)) Normalization (to a total volume of ) and squared deviation from center

148 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := (π)σ e ((x µ)σ (x µ)) Normalization (to a total volume of ) and squared deviation from center Compare this to Mahalanobis distance: d Mahalanobis (x, µ, Σ) := (x µ) T Σ (x µ)

149 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := (π)σ e ((x µ)σ (x µ)) Normalization (to a total volume of ) and squared deviation from center Compare this to Mahalanobis distance: d Mahalanobis (x, µ, Σ) := (x µ) T Σ (x µ)

150 Clustering Model Optimization Methods : 6 / Understanding the Gaussian Density Multivariate normal distribution: pdf(x, µ, Σ) := (π) d Σ e ((x µ) T Σ (x µ)) -dimensional normal distribution: pdf(x, µ, σ) := (π)σ e ((x µ)σ (x µ)) Normalization (to a total volume of ) and squared deviation from center Compare this to Mahalanobis distance: d Mahalanobis (x, µ, Σ) := (x µ) T Σ (x µ) Σ/Σ plays a central role here, the remainder is squared Euclidean distance!

151 Clustering Model Optimization Methods : 66 / Decomposing the Inverse Covariance Matrix Σ Covariance matrixes are symmetric, non-negative on the diagonal, and can be inverted. (This may need a robust numerical implementation; ignore constant attributes) We can use the eigendecomposition to factorize Σ into: Σ = V ΛV and, therefore, Σ = V Λ V where V is orthonormal and contains the eigenvectors = rotation and Λ is diagonal and contains the eigenvalues = squared scaling. (You may want to refresh your linear algebra knowledge.)

152 Clustering Model Optimization Methods : 67 / Decomposing the Inverse Covariance Matrix Σ / Based on this decomposition Σ = V ΛV, let λ i be the diagonal entries of Λ. Build Ω using ω i = λ i = λ i. Then Ω T = Ω (because it is diagonal) and Λ = Ω T Ω. Σ = V Λ V = V Ω T ΩV T = (ΩV T ) T ΩV T

153 Clustering Model Optimization Methods : 67 / Decomposing the Inverse Covariance Matrix Σ / Based on this decomposition Σ = V ΛV, let λ i be the diagonal entries of Λ. Build Ω using ω i = λ i = λ i. Then Ω T = Ω (because it is diagonal) and Λ = Ω T Ω. Σ = V Λ V = V Ω T ΩV T = (ΩV T ) T ΩV T d Mahalanobis = (x µ) T Σ (x µ)

154 Clustering Model Optimization Methods : 67 / Decomposing the Inverse Covariance Matrix Σ / Based on this decomposition Σ = V ΛV, let λ i be the diagonal entries of Λ. Build Ω using ω i = λ i = λ i. Then Ω T = Ω (because it is diagonal) and Λ = Ω T Ω. Σ = V Λ V = V Ω T ΩV T = (ΩV T ) T ΩV T d Mahalanobis = (x µ) T Σ (x µ) = (x µ) T (ΩV T ) T ΩV T (x µ)

155 Clustering Model Optimization Methods : 67 / Decomposing the Inverse Covariance Matrix Σ / Based on this decomposition Σ = V ΛV, let λ i be the diagonal entries of Λ. Build Ω using ω i = λ i = λ i. Then Ω T = Ω (because it is diagonal) and Λ = Ω T Ω. Σ = V Λ V = V Ω T ΩV T = (ΩV T ) T ΩV T d Mahalanobis = (x µ) T Σ (x µ) = (x µ) T (ΩV T ) T ΩV T (x µ) = ΩV T (x µ), ΩV T (x µ)

156 Clustering Model Optimization Methods : 67 / Decomposing the Inverse Covariance Matrix Σ / Based on this decomposition Σ = V ΛV, let λ i be the diagonal entries of Λ. Build Ω using ω i = λ i = λ i. Then Ω T = Ω (because it is diagonal) and Λ = Ω T Ω. Σ = V Λ V = V Ω T ΩV T = (ΩV T ) T ΩV T d Mahalanobis = (x µ) T Σ (x µ) = (x µ) T (ΩV T ) T ΩV T (x µ) = ΩV T (x µ), ΩV T (x µ) = ΩV T (x µ) Mahalanobis Euclidean distance in a rotated, scaled space. V T : rotation that rotates eigenvectors to the coordinate axes Ω: scale coordinates such that the eigenvalues become unit After this transformation, the new covariance matrix is the unit matrix!

157 Clustering Model Optimization Methods : 68 / Decomposing the Inverse Covariance Matrix Σ / Example of projecting the data with ΩV T : ΩV T (X µ)

158 Clustering Model Optimization Methods : 69 / Algorithm: EM Clustering The generic EM clustering algorithm is: Algorithm: EM Clustering θ choose initial model log L log likelihood of initial model repeat P (C i x) expect probabilities for all x // Expectation Step θ maximize new model parameters // Maximization Step 6 log L previous log L 7 log L log likelihood of new model θ // Model Evaluation 8 stop if log L previous log L < ε The log likelihood log L := x log P (x θ) = x log ( i w ip (x C i )) is improved by both the E-step, and the M-step. For Gaussian Mixture modeling, θ = (µ, Σ, w, µ, Σ, w,..., µ k, Σ k, w k ) In contrast to k-means, we need to use a stopping threshold, because the change will not become.

159 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

160 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

161 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

162 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

163 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

164 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

165 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

166 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

167 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

168 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

169 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

170 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

171 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

172 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

173 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

174 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

175 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

176 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : Iteration # Please do not print these slides save a tree

177 Clustering Model Optimization Methods : 7 / Gaussian Mixture Modeling Example Clustering mouse data set with k = : # # # # # # # # # # # # # #6 #7 #8 #9 # Please do not print these slides save a tree.

178 Clustering Model Optimization Methods : 7 / GMM with the EM Algorithm: Discussion Complexity of Gaussian Mixture Modeling: O(n k d + k d ) for each iteration (with a simpler diagonal Σ model: O(n k d)) in general, number of iterations is quite high numerical issues require a careful implementation for covariance estimation, we need a lot of data. Should be n > kd. As it is the case for k-means and k-medoid, result and runtime very much depend on initial choice of model parameters θ in particular, a good choice of parameter k data needs to contain Gaussians Modification for a partitioning of the data points into k disjoint clusters: every point is only assigned to the cluster to which it belongs with the highest probability

179 Clustering Model Optimization Methods : 7 / Clustering Other Data with EM We cannot use Gaussian EM on text: text is not Gaussian distributed. text is discrete and sparse, Gaussians are continuous. covariance matrixes have O(d ) entries: memory requirements (text has a very high dimensionality d) data requirements (to reliably estimate the parameters, we need very many data points) matrix inversion is even O(d ) But the general EM principle can be used with other distributions. For example, we can use a mixture of Bernoulli or multinomial distributions. PLSI/PLSA uses multinomial distributions [Hof99]. clickstreams can be modeled with a mixture of Markov models [Cad+; YH] fuzzy c-means is a soft k-means (but without a parametric model) [Dun7; Bez8] many more

180 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example badly chosen k:

181 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example badly chosen k:

182 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example different diameter:

183 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example different diameter:

184 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example scaling:

185 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example scaling:

186 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example different densities:

187 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example different densities:

188 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example different densities:

189 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example cluster shapes:

190 Clustering Model Optimization Methods : 7 / Limitations of Gaussian Mixture Modeling Gaussian Mixture Modeling works much better than k means if the data consists of Gaussians. But it otherwise shares many of the limitations. Example cluster shapes:

191 Clustering Density-Based Methods : 7 / Density-based Clustering: Core Idea Density connected regions form clusters:

192 Clustering Density-Based Methods : 7 / Density-based Clustering: Core Idea Density connected regions form clusters:

193 Clustering Density-Based Methods : 7 / Density-based Clustering: Core Idea Density connected regions form clusters:

194 Clustering Density-Based Methods : 7 / Density-based Clustering: Core Idea Density connected regions form clusters:

195 Clustering Density-Based Methods : 7 / Density-based Clustering: Core Idea Density connected regions form clusters:

196 Clustering Density-Based Methods : 7 / Density-based Clustering: Foundations Idea clusters are regions with high object density clusters are separated by regions with lower density clusters can be of arbitrary shape (not just spherical) Core assumptions for density-based clusters objects of the cluster have high density (e.g., above a specific threshold) a cluster consists of a connected region of high density

197 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = n c b

198 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = n c b

199 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = n c b

200 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = n n is not core. c b

201 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = n n is not core. c c is a core point. b

202 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = n n is not core. c c is a core point. b b is not core.

203 Clustering Density-Based Methods : 76 / Density-based Clustering: Foundations / Dense objects ( core objects ) parameters ε > and minpts specify a density threshold an object p O (O being set of objects) is a core object if N ε (p) minpts where N ε (p) = {q O dist(p, q) ε} Direct density reachability: An object q O is direct-density-reachable from p (denote as q p or p q) if and only if p is a core point, and dist(p, q) ε. minpts = Direct density reachability graph c n n is not core. c is a core point. b b is not core.

204 Clustering Density-Based Methods : 77 / Density-reachability and Density-connectivity Based on direct-density-reachability for a given ε and minpts, we define: Density-reachability: q is density reachable from p (q p or p q) if and only if there exists a chain of direct-density-reachable objects is the transitive p = o o... o k = q closure of Density-connectivity: p and q are density connected (p q) if and only if there exists an object r such that both p and q are density reachable from r. We use the following notation: p q p q p q p q p q p r q q is direct density reachable from p q is density reachable from p p and q are mutually direct-density-reachable p and q are mutually density-reachable p and q are density-connected p and q can both be border points, all others must be core

205 Clustering Density-Based Methods : 78 / Density-reachability n a c b Density-based cluster: A non-empty subset C O is a density-based cluster, given ε and minpts, if: () Maximality: p, q O: if p C and q is density-reachable from p, then q C p, q O : p C p q q C () Connectivity: p, q C: p and q are density-connected p, q C : p q

206 Clustering Density-Based Methods : 78 / Density-reachability c b c b n a c b Density-based cluster: A non-empty subset C O is a density-based cluster, given ε and minpts, if: () Maximality: p, q O: if p C and q is density-reachable from p, then q C p, q O : p C p q q C () Connectivity: p, q C: p and q are density-connected p, q C : p q

207 Clustering Density-Based Methods : 78 / Density-reachability c a c a n a c b Density-based cluster: A non-empty subset C O is a density-based cluster, given ε and minpts, if: () Maximality: p, q O: if p C and q is density-reachable from p, then q C p, q O : p C p q q C () Connectivity: p, q C: p and q are density-connected p, q C : p q

208 Clustering Density-Based Methods : 78 / Density-reachability a c b a b n a c b Density-based cluster: A non-empty subset C O is a density-based cluster, given ε and minpts, if: () Maximality: p, q O: if p C and q is density-reachable from p, then q C p, q O : p C p q q C () Connectivity: p, q C: p and q are density-connected p, q C : p q

209 Clustering Density-Based Methods : 79 / Clustering Approach Border and noise points: A point b that is not a core point, but that is density-reachable is called a border point. A point n that is not density-reachable is called a noise point. Core if N(p, ε) minpts Type(p) = Border if N(p, ε) < minpts q q p Noise if N(p, ε) < minpts q q p Density-based clustering: a density-based clustering C of a set of objects O, given parameters ε and minpts, is a complete set of density-based clusters Noise C is the set of all objects in O that do not belong to any density-based cluster C i C, i.e., Noise C = O \ (C C k )

210 Clustering Density-Based Methods : 8 / Abstract DBSCAN Algorithm Density-Based Clustering of Applications with Noise: Algorithm: Abstract DBSCAN algorithm [Sch+7a] Compute neighbors of each point and identify core points // Identify core points Join neighboring core points into clusters // Assign core points foreach non-core point do Add to a neighboring core point if possible // Assign border points Otherwise, add to noise // Assign noise points This abstract algorithm finds connected components of core points such that p,q p q (lines + ) + their border point neighbors remaining points are noise The original DBSCAN algorithm [Est+96] is a database oriented, iterative algorithm to solve this. The successor algorithm HDBSCAN* [CMS] only uses p q, all non-core points are noise.

211 Clustering Density-Based Methods : 8 / DBSCAN Algorithm DBSCAN KDD 96 high-level view:. begin with any yet unlabeled core point p (if it is not core, label it as noise and repeat). begin a new cluster C i, and continue with its neighbors N = N(p, ε). for each neighbor n N:. if n is already labeled by some cluster, continue with the next n. label n as cluster C i. if n core, N N N(n, ε). when we have processed all (transitive) neighbors N, the cluster is complete. continue until there is no unlabeled point Since the initial point is core, for all its neighbors n we have p n. For each n that is core, and each neighbor n, we have n n, and therefore p n. it is easy to see that this constructs density-connected clusters according to the definition. Non-core points can be noise, or can be reachable from more than one cluster. In this case we only keep one label, although we could allow multi-assignment.

212 Clustering Density-Based Methods : 8 / DBSCAN Algorithm / Algorithm: Pseudocode of original sequential DBSCAN algorithm [Est+96; Sch+7a] Data: label: Point labels, initially undefined foreach point p in database DB do // Iterate over every point if label(p) undefined then continue // Skip processed points Neighbors N RangeQuery(DB, dist, p, ε) // Find initial neighbors if N < minpts then // Non-core points are noise label(p) Noise 6 continue 7 c next cluster label // Start a new cluster 8 label(p) c 9 Seed set S N \ {p} // Expand neighborhood foreach q in S do if label(q) = Noise then label(q) c if label(q) undefined then continue Neighbors N RangeQuery(DB, dist, q, ε) label(q) c if N < minpts then continue // Core-point check 6 S S N

213 Clustering Density-Based Methods : 8 / DBSCAN Algorithm / Properties of the algorithm on core points and noise points, the result is exactly according to the definitions border points are only in one cluster, from which they are reachable (This could trivially be allowed, but is not desirable in practise!) complexity: O(n T ) with T the complexity to find the ε-neighbors In many applications T log n with indexing, but this cannot be guaranteed. Typical indexed behavior then is n log n, but worst-case is O(n ) distance computations. Challenges how to choose ε and minpts? which distance function to use? which index for acceleration? how to evaluate the result?

214 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

215 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

216 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

217 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

218 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

219 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

220 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

221 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

222 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds ABCD

223 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

224 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

225 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

226 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

227 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

228 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHI

229 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHI

230 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHI

231 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHIJ

232 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHIJ

233 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHIJ

234 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHIJ

235 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds FGHIJ

236 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

237 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

238 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

239 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

240 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds

241 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOP

242 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOP

243 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRS

244 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRS

245 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

246 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

247 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

248 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

249 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

250 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

251 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

252 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

253 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

254 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

255 Clustering Density-Based Methods : 8 / DBSCAN example C A D B G F E I H J K L M N O P Q R S T ε =.7 minpts = Seeds MNOPQRST

256 Clustering Density-Based Methods : 8 / Choosing DBSCAN parameters Choosing minpts: usually easier to choose heuristic: dimensionality [Est+96] choose larger values for large and noisy data sets [Sch+7a] Choosing ε: too large clusters: reduce radius ε [Sch+7a] too much noise: increase radius ε [Sch+7a] first guess: based on distances to the (minpts )-nearest neighbor [Est+96] Detecting bad parameters: [Sch+7a] ε range query results are too large slow execution almost all points are in the same cluster (largest cluster should be < % to < % usually) too much noise (should usually be % to < %, depending on the application)

257 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

258 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

259 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

260 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

261 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

262 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

263 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

264 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

265 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

266 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance data points

267 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance Sort the distances, look for a knee, or choose a small quantile. sorted data points

268 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance Sort the distances, look for a knee, or choose a small quantile. first knee sorted data points

269 Clustering Density-Based Methods : 86 / Choosing DBSCAN parameters / Example: minpts =, therefore plot -nearest-neighbor distances: minpts = n a c b NN distance Sort the distances, look for a knee, or choose a small quantile. Suggested parameter for minpts = is ε. With an index, this takes about n log n time, O(n ) without. On large data, we can query only a subset to make this faster. first knee sorted data points

270 Clustering Density-Based Methods : 87 / Choosing DBSCAN parameters / Heuristics on the mouse data set NN-distance.. Knee?.6... ε =... Objects

271 Clustering Density-Based Methods : 87 / Choosing DBSCAN parameters / Heuristics on the mouse data set NN-distance.. Knee?.6... ε =... Objects

272 Clustering Density-Based Methods : 87 / Choosing DBSCAN parameters / Heuristics on the nested clusters data set NN-distance.. Knee? ε = ,,,8,,,7 Objects

273 Clustering Density-Based Methods : 87 / Choosing DBSCAN parameters / Heuristics on the nested clusters data set NN-distance.. Knee? ε = ,,,8,,, Objects only a heuristic there are no correct parameters (in particular with hierarchies) on Gaussian distributions with different densities, we often get a smooth curve sometimes, we see multiple knees, sometimes none

274 Clustering Density-Based Methods : 88 / Generalized Density-based Clustering [San+98] Motivation traditional clustering approaches are designed for point coordinate objects apply clustering to other types of data geospatial data such as polygons data with a notion of similarity, but not a distance graph and network data pixel data... Generalize the notion of density in DBSCAN: ε threshold d(p, q) ε Neighbor predicate NPred(p, q) ε range query N(p) := {q d(p, q) ε} Neighborhood N(p) := {q NPred(p, q)} minpts threshold N(o) minpts Core predicate IsCore(N(o)) Example neighbor predicates: polygon overlap, pixel adjacency, similarity,... Example core predicates: count, total polygon area, homogenity,...

275 Clustering Density-Based Methods : 89 / Accelerated DBSCAN Variations Methods for finding the same result, but faster: Grid-based Accelerated DBSCAN [MM8] Partition the data into grid cells, overlapping by ε Run DBSCAN on each cell Merge the results into a single clustering For Minkowski L p distances Anytime Density-Based Clustering (AnyDBC) [MAS6] avoids doing a ε-range-query for every point cover the data with a non-transitive set of primitive clusters infer core property also by counting how often a point is found query only points that may cause primitive clusters to merge some points will remain maybe-core (border or core)

276 Clustering Density-Based Methods : 9 / Improved DBSCAN Variations Methods that aim at improving the results or usability of DBSCAN: Locally Scaled Density Based Clustering (LSDBC) [BY7] estimate density of each point always continue DBSCAN with the most dense unlabeled point stop expanding when the density drops below α density of the first point can find clusters of different densities Hierarchical DBSCAN* (HDBSCAN*) [CMS] clusters consists only of core points, no border points ( nicer theory; the star * in the name) perform a hierarchical agglomerative clustering based on density reachability runtime O(n ), adds the challenge of extracting clusters from the dendrogram inspired by OPTICS / continuation of OPTICS (next slides...) Many, many, more...in 7, people still publish new DBSCAN variations

277 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering The k-nearest neighbor distance k-dist(p) of a point p is defined as the unique distance with: (i) for at least k objects o D, o p: dist(p, o) k-dist(p) (ii) for at most k objects o D, o p: dist(p, o) < k-dist(p) The k-nearest neighbors knn(p) of a point p are defined as: knn(p) = {o D o p dist(o, p) k-dist(p)} 6 (minpts )-dist is k-dist with k = minpts

278 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering The k-nearest neighbor distance k-dist(p) of a point p is defined as the unique distance with: (i) for at least k objects o D, o p: dist(p, o) k-dist(p) (ii) for at most k objects o D, o p: dist(p, o) < k-dist(p) The k-nearest neighbors knn(p) of a point p are defined as: knn(p) = {o D o p dist(o, p) k-dist(p)} Note: A point p is a DBSCAN core point if and only if its (minpts )-dist(p) ε 6 6 (minpts )-dist is k-dist with k = minpts

279 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering The k-nearest neighbor distance k-dist(p) of a point p is defined as the unique distance with: (i) for at least k objects o D, o p: dist(p, o) k-dist(p) (ii) for at most k objects o D, o p: dist(p, o) < k-dist(p) The k-nearest neighbors knn(p) of a point p are defined as: knn(p) = {o D o p dist(o, p) k-dist(p)} Note: A point p is a DBSCAN core point if and only if its (minpts )-dist(p) ε 6 Note: A point p is a DBSCAN core point for all ε (minpts )-dist(p) monotonicity: any DBSCAN cluster at ε is a subset of some DBSCAN cluster at ε > ε. 6 (minpts )-dist is k-dist with k = minpts

280 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering The k-nearest neighbor distance k-dist(p) of a point p is defined as the unique distance with: (i) for at least k objects o D, o p: dist(p, o) k-dist(p) (ii) for at most k objects o D, o p: dist(p, o) < k-dist(p) The k-nearest neighbors knn(p) of a point p are defined as: knn(p) = {o D o p dist(o, p) k-dist(p)} Note: A point p is a DBSCAN core point if and only if its (minpts )-dist(p) ε 6 Note: A point p is a DBSCAN core point for all ε (minpts )-dist(p) monotonicity: any DBSCAN cluster at ε is a subset of some DBSCAN cluster at ε > ε. Idea: process points based on their (minpts )-dist. Can we get DBSCAN results for all ε at once? 6 (minpts )-dist is k-dist with k = minpts

281 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering / For performance reasons, OPTICS uses ε-range queries to find neighbors (discussed later). Core-distance of an object p CoreDist ε,minpts (p) = { (minpts )-dist(p) if N ε (p) minpts otherwise with (minpts )-dist(p) being the smallest distance for which p is a core object. Reachability-distance 7 of object p from object o ReachDist ε,minpts (p o) = max{coredist ε,minpts (o), dist(o, p)} minpts = p o q 7 This is not a distance by our definitions, because it is not symmetric! We use p o to emphasize this.

282 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering / For performance reasons, OPTICS uses ε-range queries to find neighbors (discussed later). Core-distance of an object p CoreDist ε,minpts (p) = { (minpts )-dist(p) if N ε (p) minpts otherwise with (minpts )-dist(p) being the smallest distance for which p is a core object. Reachability-distance 7 of object p from object o ReachDist ε,minpts (p o) = max{coredist ε,minpts (o), dist(o, p)} minpts = p q ε-query radius ε-range o 7 This is not a distance by our definitions, because it is not symmetric! We use p o to emphasize this.

283 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering / For performance reasons, OPTICS uses ε-range queries to find neighbors (discussed later). Core-distance of an object p CoreDist ε,minpts (p) = { (minpts )-dist(p) if N ε (p) minpts otherwise with (minpts )-dist(p) being the smallest distance for which p is a core object. Reachability-distance 7 of object p from object o ReachDist ε,minpts (p o) = max{coredist ε,minpts (o), dist(o, p)} minpts = ε-range p o q ε-query radius CoreDist(o) 7 This is not a distance by our definitions, because it is not symmetric! We use p o to emphasize this.

284 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering / For performance reasons, OPTICS uses ε-range queries to find neighbors (discussed later). Core-distance of an object p CoreDist ε,minpts (p) = { (minpts )-dist(p) if N ε (p) minpts otherwise with (minpts )-dist(p) being the smallest distance for which p is a core object. Reachability-distance 7 of object p from object o ReachDist ε,minpts (p o) = max{coredist ε,minpts (o), dist(o, p)} minpts = ε-range p o q ε-query radius CoreDist(o) ReachDist(q o) 7 This is not a distance by our definitions, because it is not symmetric! We use p o to emphasize this.

285 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering / For performance reasons, OPTICS uses ε-range queries to find neighbors (discussed later). Core-distance of an object p CoreDist ε,minpts (p) = { (minpts )-dist(p) if N ε (p) minpts otherwise with (minpts )-dist(p) being the smallest distance for which p is a core object. Reachability-distance 7 of object p from object o ReachDist ε,minpts (p o) = max{coredist ε,minpts (o), dist(o, p)} minpts = ε-range p o q ε-query radius CoreDist(o) ReachDist(q o) ReachDist(p o) 7 This is not a distance by our definitions, because it is not symmetric! We use p o to emphasize this.

286 Clustering Density-Based Methods : 9 / Density-based Hierarchical Clustering / For performance reasons, OPTICS uses ε-range queries to find neighbors (discussed later). Core-distance of an object p CoreDist ε,minpts (p) = { (minpts )-dist(p) if N ε (p) minpts otherwise with (minpts )-dist(p) being the smallest distance for which p is a core object. Reachability-distance 7 of object p from object o ReachDist ε,minpts (p o) = max{coredist ε,minpts (o), dist(o, p)} minpts = p o q CoreDist(r) ReachDist(o r) r 7 This is not a distance by our definitions, because it is not symmetric! We use p o to emphasize this.

287 Clustering Density-Based Methods : 9 / OPTICS Clustering [Ank+99] OPTICS high-level view: Process points in a priority queue ordered by lowest reachability.. get the minimum-reachability point p from the priority queue (if the queue is empty, choose any unprocessed point with reachability set to.). query the neighbors N ε (p) and determine CoreDist ε,minpts. add p to the output, with its current minimum reachability and its core distance. mark p as processed. add each unprocessed neighbor n to the queue, using ReachDist ε,minpts (n p), (but only improving the reachability if the point is already in the queue) 6. repeat, until all points are processed Note: the reachability of a processed point is never updated.

288 Clustering Density-Based Methods : 9 / OPTICS Clustering [Ank+99] OPTICS high-level view: Process points in a priority queue ordered by lowest reachability.. get the minimum-reachability point p from the priority queue (if the queue is empty, choose any unprocessed point with reachability set to.). query the neighbors N ε (p) and determine CoreDist ε,minpts. add p to the output, with its current minimum reachability and its core distance. mark p as processed. add each unprocessed neighbor n to the queue, using ReachDist ε,minpts (n p), (but only improving the reachability if the point is already in the queue) 6. repeat, until all points are processed Note: the reachability of a processed point is never updated. Complexity: assuming queries are in log n, we get n log n. Worst case: O(n )

289 Clustering Density-Based Methods : 9 / Cluster Order OPTICS does not directly produce a (hierarchical) clustering but a cluster ordering. The cluster order with respect to ε and minpts begins with an arbitrary object the next object has the minimum reachability distance among all unprocessed objects discovered so far Reachability plot: A B C D E F G H cluster order The core distance can also be plotted similar, and can be used to identify noise B A C D E F G H

290 Clustering Density-Based Methods : 9 / OPTICS Algorithm [Ank+99] Algorithm: OPTICS Clustering Algorithm [Ank+99] SeedList empty priority queue ClusterOrder empty list repeat if SeedList.empty() then if no unprocessed points then stop 6 (r, o) (, next unprocessed point) // Get an unprocessed point 7 else (r, o) SeedList.deleteMin() // next point from seed list 8 N RangeQuery(o, dist, ε) // get neighborhood 9 ClusterOrder.add( (r, o, CoreDist(o, N)) ) // add point to cluster order Mark o as processed if p is Core then foreach n N do // Explore neighborhood if n is processed then continue p ReachDist(n o) // compute reachability if n SeedList then SeedList.decreaseKey( (p, n) ) // update priority queue else SeedList.insert( (p, n) )

291 Clustering Density-Based Methods : 96 / Priority Queues A priority queue is a data structure storing object associated with a priority, designed for finding the element with the lowest priority. Several operations can be defined on priority queues, we need the following: Operation DeleteMin: remove and return the smallest element by priority. Operation Insert: add an element to the heap. Operation DecreaseKey: decrease the key of an object on the heap Trivial implementation: unsorted list & linear scanning (insert in O(), all other in O(n)) Better implementation: binary heap refresh your HeapSort knowledge 6 > < > 9 < >

292 Clustering Density-Based Methods : 97 / Priority Queues / DeleteMin: remove the first element and replace by last, use Heapify-Down to repair, and return the removed element. Insert: append new element, use Heapify-Up to repair the heap. DecreaseKey: update existing element, use Heapify-Up to repair the heap. Algorithm: Heapify-Down(array, i, x) while i + < len(array) do c i + // Left child index if c + < len(array) and array[c + ] array[c] then c c + // Right child index if x < array[c] then break array[i] array[c] // Pull child up 6 i c // Move down 7 array[i] x Algorithm: Heapify-Up(array, i, x) while i > do j (i )/ // Parent position if array[j] < x then break array[i] array[j] // Pull parent down i j // Move upward 6 array[i] x

293 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance I G H J K F L M E N O P Q R S T C D A B

294 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance L; I G H J K 9 F L M E N O P Q R S T C D A B

295 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance L; I G H J K 9 F L M E N O P Q R S T C D A B L

296 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance K; I M; N; G H J K F L O; P; R; M E N O P Q R S T C D A B L

297 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance K; I M; N; G H J K F L O; P; R; M E N O P Q R S T C D A B L

298 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance K; I M; N; G H J K F L O; P; R; M E N O P Q R S T C D A B L K

299 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance R; I M; N; G H J K F L O; P; F; G; H; I; J; M E N O P Q R S T C D A B L K

300 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance M; I O; G; G H J K F L H; J; F; N; R; I; P; M E N O P Q R S T C D A B L K

301 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance M; I O; G; G H J K F L H; J; F; N; R; I; P; M E N O P Q R S T C D A B L KM

302 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance P; I O; G; G H J K F L H; J; F; N; R; I; Q; S; T; M E N O P Q R S T C D A B L KM

303 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance P; I O; N; G H J K F L R; Q; T; G; H; I; J; S; F; M E N O P Q R S T C D A B L KM

304 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance P; I O; N; G H J K F L R; Q; T; G; H; I; J; S; F; M E N O P Q R S T C D A B L KMP

305 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance F; I O; N; G H J K F L R; Q; T; G; H; I; J; S; M E N O P Q R S T C D A B L KMP

306 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance O; I R; N; G H J K F L H; S; T; G; F; I; J; Q; M E N O P Q R S T C D A B L KMP

307 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance O; I R; N; G H J K F L H; S; T; G; F; I; J; Q; M E N O P Q R S T C D A B L KMP O

308 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance Q; I R; N; G H J K F L H; S; T; G; F; I; J; M E N O P Q R S T C D A B L KMP O

309 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance R; I Q; N; G H J K F L H; S; T; G; F; I; J; M E N O P Q R S T C D A B L KMP O

310 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance R; I Q; N; G H J K F L H; S; T; G; F; I; J; M E N O P Q R S T C D A B L KMP OR

311 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance J; I Q; N; G H J K F L H; S; T; G; F; I; M E N O P Q R S T C D A B L KMP OR

312 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance Q; I S; N; G H J K F L H; J; T; G; F; I; M E N O P Q R S T C D A B L KMP OR

313 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance Q; I S; N; G H J K F L H; J; T; G; F; I; M E N O P Q R S T C D A B L KMP ORQ

314 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance I; I S; N; G H J K F L H; J; T; G; F; M E N O P Q R S T C D A B L KMP ORQ

315 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance S; I I; N; G H J K F L H; J; T; G; F; M E N O P Q R S T C D A B L KMP ORQ

316 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance S; I I; N; G H J K F L H; M E N O P Q R S T C D A B L KMP ORQ S

317 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance F; I I; N; G H J K F L H; J; T; G; M E N O P Q R S T C D A B L KMP ORQ S

318 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance N; I I; T; G H J K F L H; J; F; G; M E N O P Q R S T C D A B L KMP ORQ S

319 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance N; I I; T; G H J K F L H; J; F; G; M E N O P Q R S T C D A B L KMP ORQ S N

320 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance G; I I; T; G H J K F L H; J; F; M E N O P Q R S T C D A B L KMP ORQ S N

321 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance T; I I; G; G H J K F L H; J; F; M E N O P Q R S T C D A B L KMP ORQ S N

322 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance T; F; I I; G; G H J K F L H; J; F; M E N O P Q R S T C D A B L KMP ORQ S N

323 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance I; I H; G; G H J K F L F; J; M E N O P Q R S T C D A B L KMP ORQ S N T

324 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I; H; G; F; J; I I

325 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J; G; H; F;

326 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J; G; H; F; J J

327 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J F; G; H;

328 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G; F; H;

329 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G; F; H; G G

330 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H; F; E;

331 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H; F; E; H H

332 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H E; F;

333 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F; E;

334 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F; E; F F

335 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E;

336 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E; E E

337 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B; C; D;

338 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B; C; D; B B

339 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D; C; A;

340 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D; C; A; D D

341 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D D A; C;

342 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D D A; C; A A

343 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D D A A C;

344 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D D A A C;

345 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance A B C D E F G H I J K L M N O P Q R S T L L K K M M P P O O R R Q Q S S N N T T I I J J G G H H F F E E B B D D A A C C

346 Clustering Density-Based Methods : 98 / OPTICS Example minpts =, ε =, Manhattan distance I G H J K F L M E N O P Q R S T C D A B

347 Clustering Density-Based Methods : 99 / OPTICS Reachability Plots right ear le ear...

348 Clustering Density-Based Methods : 99 / OPTICS Reachability Plots

349 Clustering Density-Based Methods : 99 / OPTICS Reachability Plots

350 Clustering Density-Based Methods : 99 / OPTICS Reachability Plots

351 Clustering Density-Based Methods : / Extracting Clusters from OPTICS Reachability Plots Horizontal cut: merge points with reachability ε to their predecessor result is like DBSCAN with ε = height of the cut (but only needs O(n) to extract) ξ method for hierarchical clusters: [Ank+99] identify steep points where the reachability changes by a factor of ξ merge neighboring steep points into a steep area. find matching pairs of steep down and steep up points in particular the end of the cluster needs to be refined carefully must be at least minpts points apart can be nested in longer intervals to get a hierarchical clustering

352 Clustering Density-Based Methods : / Extracting Clusters from OPTICS Reachability Plots Horizontal cut: merge points with reachability ε to their predecessor result is like DBSCAN with ε = height of the cut (but only needs O(n) to extract) ξ method for hierarchical clusters: [Ank+99] identify steep points where the reachability changes by a factor of ξ merge neighboring steep points into a steep area. find matching pairs of steep down and steep up points in particular the end of the cluster needs to be refined carefully must be at least minpts points apart can be nested in longer intervals to get a hierarchical clustering

353 Clustering Density-Based Methods : / Extracting Clusters from OPTICS Reachability Plots Horizontal cut: merge points with reachability ε to their predecessor result is like DBSCAN with ε = height of the cut (but only needs O(n) to extract) ξ method for hierarchical clusters: [Ank+99] identify steep points where the reachability changes by a factor of ξ merge neighboring steep points into a steep area. find matching pairs of steep down and steep up points in particular the end of the cluster needs to be refined carefully must be at least minpts points apart can be nested in longer intervals to get a hierarchical clustering

354 Clustering Density-Based Methods : / Extracting Clusters from OPTICS Reachability Plots Horizontal cut: merge points with reachability ε to their predecessor result is like DBSCAN with ε = height of the cut (but only needs O(n) to extract) ξ method for hierarchical clusters: [Ank+99] identify steep points where the reachability changes by a factor of ξ merge neighboring steep points into a steep area. find matching pairs of steep down and steep up points in particular the end of the cluster needs to be refined carefully must be at least minpts points apart can be nested in longer intervals to get a hierarchical clustering

355 Clustering Density-Based Methods : / Role of the Parameters ε and minpts Properties of OPTICS and its cluster order the cluster order is not unique, but very much order dependent because of starting point(s) and many identical reachability distances reachabilities are r p = min{reachdist(p n) n comes before p dist(p, n) ε}. ε serves as a cut-off, and improves performance from O(n ) to possibly n log n with indexes too small ε causes many points with reachability = disconnected components heuristic for ε: the largest minpts-distance based on a sample of the data larger minpts produce a smoother plot, because ReachDist = CoreDist more often minpts result is equivalent to single-link clustering too small minpts can cause chaining effects similar to single-link rule of thumb: minpts between and often works

356 Clustering Density-Based Methods : / Other Density-based Clustering Algorithms Mean-Shift: [FH7] Replace each point with the weighted average of its ε neighbors. Points that move to the same location (± a small threshold) form a cluster Runtime: O(n ) per iteration, many iterations DenClue: [HK98] Use kernel density estimation Use gradients to find density maxima Approximate the data using a grid [HG7] Density-Peak Clustering: [RL] Estimate density of each point Every point connects to the nearest neighbor of higher density Runtime: O(n ); needs user input to select clusters

357 Clustering Biclustering Methods : / Biclustering & Subspace Clustering Popular in gene expression analysis. } Every row is a gene or transposed Every column is a sample Only a few genes are relevant No semantic ordering of rows or columns Some samples may be contaminated Numerical value may be unreliable, only high or low Key idea of biclustering: Cheng and Church [CC] Find a subset of rows and columns (submatrix, after permutation), such that all values are high/low or exhibit some pattern.

358 Clustering Biclustering Methods : / Bicluster Patterns [CC] Some examples for bicluster patterns: constant rows columns additive multiplicative clusters may overlap in rows and columns patterns will never be this ideal, but noisy! many algorithms focus on the constant pattern type only, as there are O( N d ) possibilities.

359 Clustering Biclustering Methods : / Density-based Subspace Clustering Subspace clusters may be visible in one projection, but not in another: Column Column Column Column Column Column

360 Clustering Biclustering Methods : / Density-based Subspace Clustering Subspace clusters may be visible in one projection, but not in another: Popular key idea: Find dense areas in -dimensional projections Combine subspaces as long as the cluster remains dense Examples: CLIQUE [Agr+98], PROCLUS [Agg+99], SUBCLU [KKK] There also exist correlation clustering, for rotated subspaces.

361 Clustering Other Clustering Methods : 6 / BIRCH Clustering [ZRL96] Cluster Features and the CF-tree: idea: summarize/compress the data into a tree structure phase : build a tree, which for every node, stores a cluster feature CF = (n, LS, SS): number of points n; linear sum LS := i x i; sum of squares SS := i x i points within a threshold criterion are added to an existing node, otherwise add a new node when running out of memory, rebuild the tree with a larger threshold insert all leaf cluster features (not the raw data) into a new tree phase : build a more compact tree phase : run a clustering algorithm on the aggregated cluster features e.g., k-means, hierarchical clustering, CLARANS phase : re-scan the data, and map all data points to the nearest cluster extensions: data bubbles for DBSCAN [Bre+], two-step clustering for categorical data [Chi+]

362 Clustering Other Clustering Methods : 7 / BIRCH Clustering [ZRL96] / Cluster Features properties: CF additivity theorem: we can merge two cluster features CF + CF : CF + CF := (n + n, LS + LS, SS + SS ) we can derive the following aggregates from a cluster feature CF: 8 Centroid: µ = i xi LS = n n Radius : R = n i ( x i µ) = nss µ Diameter : D = i,j x i x j /(n(n )) = we can compute several distances of two CF as: centroid Euclidean D (CF, CF ) := µ µ centroid Manhattan D (CF, CF ) := d µ,d µ,d average inter-cluster distance D (CF, CF ) := average intra-cluster distance D (CF, CF ) :=... variance increase distance D (CF, CF ) :=... (n SS LS )/ ( n n SS +n SS LS LS n n 8 Note that the radius is rather the average distance from the mean, and the diameter is the average pairwise distance. Beware that this suffers from numerical instability with floating point math! )

363 Clustering Other Clustering Methods : 8 / CURE Clustering [GRS98] Clustering Using REpresentatives (CURE): draw a random sample of s n points split the sample s into p partitions of size s p (default p = ) cluster each partition into s pq clusters (default q = ) cluster the resulting s q representative points of all partitions label all n data points with the cluster of the nearest representative point using a modified hierarchical clustering algorithm use a k-d-tree to accelerate neighbor search, and keep candidates in a priority queue always merge the closest two clusters for the merged cluster, find c farthest points, but move them by α to the mean

364 Clustering Other Clustering Methods : 9 / ROCK Clustering [GRS99] RObust Clustering using links (ROCK) for categorical data: uses Jaccard a primary similarity: J(x, y) := x y / x y a link is an object z x, y with J(x, z) < θ and J(z, y) < θ let link(x, y) = {z J(x, z) < θ J(z, y) < θ} be the number of links try to maximize merge clusters with best goodness g(c i, C j ) = E l = n i i f(θ) is application and data dependent. Example in market basket analysis: f(θ) = θ +θ x,y C i link(x,y) n +f(θ) i x C i y C j link(x,y) (n i +n j ) +f(θ) n +f(θ) i n +f(θ) j

365 Clustering Other Clustering Methods : / CHAMELEON [KHK99] CHAMELEON: Clustering Using Dynamic Modeling. generate a sparse graph, e.g., using the k-nearest neighbors only. try to minimize the edge cut EC, the weight of edges cut when partitioning the graph. partition the graph using hmetis 9 into relatively small sub-clusters. merge sub-clusters back into clusters based on relative interconnectivity (EC of merging the cluster, normalized by the average of a fictional partition of the two child clusters) and relative closeness (the average length of a edge between the clusters, compared to the weighted average length of an edge within the clusters) 9 Not published, how this really works. This is proprietary for circuit design. Implementations of Chameleon usually require installing the original hmetis binary to do the partioning.

366 Clustering Other Clustering Methods : / Spectral Clustering [SM; MS; NJW] Based on graph theory, use the spectrum (eigenvalues) of the similarity matrix for clustering. compute the similarity matrix S = (s ij ) (with s ij ) Popular similarities include s ij = if i, j are k nearest neighbors, or ε neighbors compute the diagonal matrix D = (d ij ) with d ii := j s ij compute a graph Laplacian matrix such as: non-normalized: L = D S Shi-Malik normalized cut: Lshi-malik = D L = I D S [SM] random walk normalized: L rw = D / LD / = I D / SD / [NJW] compute smallest eigenvectors and eigenvalues each eigenvalue indicates a connected component we can project onto the eigenvectors of the smallest eigenvalues: well connected points will be close, disconnected points will be far finally, cluster in the projected space, e.g., using k-means

367 Clustering Other Clustering Methods : / Affinity Propagation Clustering [FD7] Based on the idea of message passing: each object communicates to others which object they would prefer to be a cluster representative, and how much responsibility they received. compute a similarity matrix, e.g., s(x, y) = x y set the diagonal to the median of s(x, y) (adjust to control the number of clusters) iteratively send responsibility r(i, k) and availability a(i, k) messages: r(i, k) s(i, k) max{a(i, j) + s(i, j)} j k { min{, r(k, k) + j {i,k} max(, r(j, k))} a(i, k) i k max(, r(j, k)) i = k j {k} objects i are assigned to object k which maximizes a(i, k) + r(i, k); if i = k (and the value is positive), then they are cluster representatives and form a cluster the number of clusters is decided indirectly, by s(i, i). runtime complexity: O(n ), and needs n memory to store (s ij ), (r ij ), (a ij ).

368 Clustering Other Clustering Methods : / Further Clustering Approaches We only briefly touched many subdomains, such as: biclustering e.g., Cheng and Church,... projected and subspace clustering e.g., CLIQUE, PROCLUS, PreDeCon, SUBCLU, PC, StatPC,... Some notable subtypes not covered at all in this class: correlation clustering e.g., ORCLUS, LMCLUS, COPAC, C, ERiC, CASH,... online and streaming algorithms e.g., StreamKM++, CluStream, DenStream, D-Stream,... graph clustering topic modeling e.g., plsi, LDA,...

369 Clustering Evaluation : / Evaluation of Clustering We can distinguish between four different kinds of evaluation: Unsupervised evaluation (usually based on distance / deviation) Statistics such as SSQ, Silhouette, Davies-Bouldin,... Used as: heuristics for choosing model parameters such as k Supervised evaluation based on labels Indexes such as Adjusted Rand Index, Normalized Mutual Information, Purity,... Used as: similarity with a reference solution Indirect evaluation Based on the performance of some other (usually supervised) algorithm in a later stage E.g., how much does classification improve, if we use the clustering as feature Expert evaluation Manual evaluation by a domain expert

370 Clustering Evaluation : / Unsupervised Evaluation of Clusterings Recall the basic idea of distance-based clustering algorithms: Items in the same cluster should be more similar Items in other clusters should be less similar compare the distance within the same cluster to distances to other clusters Some simple approaches (N points, d dimension): MeanDistance(C) := N MeanSquaredDistance(C) := N RMSSTD(C) := Ci Ci x C i dist(x, µ Ci ) x C i dist(x, µ Ci ) C d ( C i ) i Ci x C i x µ Ci How can we combine this with the distance to other clusters?

371 Clustering Evaluation : 6 / R : Coefficient of Determination The total sum of squares (TTS) of a data set X is: TSS = x X = SSE(X) x this can be decomposed into: TSS = ( x C i + Ci X ) C i C i x C i = C i SSE(C i ) }{{} + C i C i X C i }{{} TSS = WCSS + BCSS TSS WCSS BCSS Total Sum of Squares Within-Cluster Sum of Squares Between-Cluster Sum of Squares Because total sum of squares TSS is constant, minimizing WCSS = maximizing BCSS Explained variance R := BCSS / TSS = (TSS WCSS) / TSS [, ]

372 Clustering Evaluation : 7 / Variance-Ratio-Criterion [CH7] The Calinski-Harabasz Variance-Ratio-Criterion (VRC) is defined as: VRC := Increasing k increases BCSS and decreases WCSS. Here, both terms get a penalty for increasing k! BCSS/(k ) WCSS/(N k) = N k BCSS k WCSS Connected to statistics: one-way analysis of variance (ANOVA) test ( F-test ). Null hypothesis: all samples are drawn from distributions with the same mean. A large F-value indicates that the difference in means is significant. Beware: you must not use the same data to cluster and to do this test! The proper test scenario is to, e.g., test if a categorical attribute exhibits a significant difference in means of a numerical attribute (and which is only used in the test).

373 Clustering Evaluation : 8 / Silhouette [Rou87] Define the mean distance of x to its own cluster, and to the closest other cluster: a(x C i ) := mean y Ci,y x d(x, y) b(x C i ) := min Cj C i mean y Cj d(x, y) The silhouette width of a point x (can be used for plotting) then is: { b(x) a(x) max{a(x),b(x)} if C i > s(x C i ) := otherwise The silhouette of a clustering C then is: Silhouette(C) := mean x s(x) Silhouette of a point: if a b (well assigned point) if a = b (point is between the two clusters), and - if a b (poorly assigned point). An average Silhouette of >. is considered weak, >. is reasonable, >.7 is strong.

374 Clustering Evaluation : 9 / Silhouette [Rou87] / Challenges with using the Silhouette criterion: The complexity of Silhouette is: O(n ) does not scale to large data sets. Simplified Silhouette: use the distances to cluster centers instead of average distances, O(n k) When a cluster has a single point, a(x) is not defined. Rousseeuw [Rou87] suggests to use s(x) = then (see the definition of s(x)) Which distance should we use, e.g., with k-means Euclidean, or squared Euclidean? In high-dimensional data, a(x) b(x) due to the curse of dimensionality. Then s(x)

375 Clustering Evaluation : / Davies-Bouldin Index [DB79] Let the scatter of a cluster C i be (recall L p -norms): ( S i := C i x µ Ci p x C p i Let the separation of clusters be: M ij := µ Ci µ Cj p The similarity of two clusters then is defined as: R ij := S i+s j M ij Clustering quality is the average maximum similarity: DB(C) := mean Ci max Cj C i R ij Power mean (with power p) of ( x y p ) p = i (xi yi)p ) /p For p =, S i is standard deviation, M ij is the Euclidean distance! A small Davies-Bouldin index is better, i.e., S i + S j M ij, scatter distance

376 Clustering Evaluation : / Other Internal Clustering Indexes There have been many more indexes proposed over time: Dunn index: cluster distance / maximum cluster diameter [Dun7; Dun7] Gamma and Tau: P (within-cluster distance < between-cluster distance ) [BH7] C-Index: sum of within-cluster distances / same number of smallest distances [HL76] Xie-Beni index for fuzzy clustering [XB9] Gap statistic: compare to results on generated random data [TWH] I-Index: maximum between cluster centers / sum of distances to the cluster center [MB] S_Dbw: separation and density based [HV; HBV] PBM Index: distance to cluster center / distance to total center [PBM] DBCV: density based cluster validation [Mou+]...

377 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data: Min Mean Max Sum of Squares Number of clusters k

378 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data: Min Mean Max 8 kmssq Number of clusters k

379 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data: 8 Min Mean Max Best SSQ 7 Variance Ratio Criteria 6 Number of clusters k

380 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data:.6 Min Mean Max Best SSQ.. Silhoue e... Number of clusters k

381 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data: Min Mean Max Best SSQ.8 Davies Bouldin Index Number of clusters k

382 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data:.9 Min Mean Max Best SSQ.8.7 C-Index Number of clusters k

383 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data:..9 Min Mean Max Best SSQ.8.7 PBM-Index Number of clusters k

384 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data:.9 Min Mean Max Best SSQ.8.7 Gamma Number of clusters k

385 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data:.6 Min Mean Max Best SSQ.. Tau... Number of clusters k

386 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data: Sum of Squares Min Mean Max Silhoue e Min Mean Max Best SSQ Davies Bouldin Index Min Mean Max Best SSQ Variance Ratio Criteria Min Mean Max Best SSQ..8 Number of clusters k Min Mean Max Best SSQ.9.8 Number of clusters k. Min Mean Max Best SSQ Number of clusters k Min Mean Max Best SSQ.9.8 Number of clusters k.6 Min Mean Max Best SSQ. C-Index PBM-Index Gamma Tau Number of clusters k Number of clusters k Number of clusters k Number of clusters k

387 Clustering Evaluation : / Examples: Mouse data Revisiting k-means on the mouse data: Toy mouse data set: Least SSQ with k = : Least SSQ with k = :

388 Clustering Evaluation : / Supervised Cluster Evaluation External evaluation measures assume we know the true clustering. In the following, every point has a cluster C(x) and a true class K(x). The raw data (e.g., vectors) will not be used by these measures. In literature this is popular to compare the potential of algorithms. Often, classification data is used, and it is assumed that good clusters = the classes.

389 Clustering Evaluation : / Supervised Cluster Evaluation External evaluation measures assume we know the true clustering. In the following, every point has a cluster C(x) and a true class K(x). The raw data (e.g., vectors) will not be used by these measures. In literature this is popular to compare the potential of algorithms. Often, classification data is used, and it is assumed that good clusters = the classes. On real data this will often not be possible: no labels available. There may be more than one meaningful clustering of the same data! Sometimes, we can at least label some data, or treat some properties as potential labels, then choose the clustering that has the best agreement on the labeled part of the data.

390 Clustering Evaluation : / Supervised Cluster Evaluation / The matching problem: Clusters C are usually enumerated,,,..., k True classes K are usually labeled with meaningful classes Which C is which class K? Clustering is not classification, we cannot evaluate it the same way

391 Clustering Evaluation : / Supervised Cluster Evaluation / The matching problem: Clusters C are usually enumerated,,,..., k True classes K are usually labeled with meaningful classes Which C is which class K? Clustering is not classification, we cannot evaluate it the same way What if there are more clusters than classes? What if a cluster contains two classes? What if a class contains two clusters?

392 Clustering Evaluation : / Supervised Cluster Evaluation / The matching problem: Clusters C are usually enumerated,,,..., k True classes K are usually labeled with meaningful classes Which C is which class K? Clustering is not classification, we cannot evaluate it the same way What if there are more clusters than classes? What if a cluster contains two classes? What if a class contains two clusters? To overcome this Choose the best (C, K) matching with, e.g., the Hungarian algorithm (uncommon) Compare every cluster C to every class K

393 Clustering Evaluation : / Purity, Precision and Recall A simple measure popular in text clustering (but not much in other clustering): C Purity(C i, K) := max i K j Kj C i Purity(C, K) := N C i Purity(C i, K) = i N max K i j C i K j A cluster, which only contains elements of class K j has purity similar to precision in classification and information retrieval But: every document is its own cluster has purity, is this really optimal? We could also define a recall equivalent: C Recall(C, K j ) := max i K j Ci K j Recall(C, K) := N K j Recall(C, K j ) = j N max C j i C i K j But this is even more misleading: if we put all objects into one cluster, we get recall!

394 Clustering Evaluation : 6 / Pair-counting Evaluation Many cluster evaluation measures are based on pair-counting. If (and only if) i and j are in the same cluster, then (i, j) is a pair. This gives us a binary, classification-like problem: K(i) = K(j): pair K(i) K(j): no pair C(i) = C(j) pair true positive (a) false positive (b) C(i) C(j) no pair false negative (c) true negative (d) which can be computed using a = ( Ci K j ) i,j b = ( Ci ) i a c = ( Ki ) ( i a d = N ) a b c Objects are a pair iff they are related ( should be together ) we are predicting which objects are related (pairs), and which are not

395 Clustering Evaluation : 7 / Pair-counting Cluster Evaluation Measures K(i) = K(j): pair K(i) K(j): no pair C(i) = C(j) pair true positive (a) false positive (b) C(i) C(j) no pair false negative (c) true negative (d) Precision = a a+b Recall = a a+c Rand index [Ran7] = a+d a+b+c+d = (a + d)/( ) N = Accuracy Fowlkes-Mallows [FM8] = Precision Recall = a / (a + b) (a + c) Adjusted Rand Index [HA8] = Jaccard = a+b+c F -Measure = Precision Recall = Precision+Recall a F β -Measure = (+β ) Precision Recall β Precision+Recall a a+b+c Rand index E[Rand index] optimal Rand index E[Rand index] = (N )(a+d) ((a+b)(a+c)+(c+d)(b+d)) ( N ) ((a+b)(a+c)+(c+d)(b+d))

396 Clustering Evaluation : 8 / Mutual Information [Mei; Mei; Mei] Evaluation by Mutual Information: Entropy: I(C, K) = i j P (C i K j ) log P (C i K j ) P (C i ) P (K j ) = C i K j log N C i K j i j N C i K j H(C) = i P (C i) log P (C i ) = i Normalized Mutual Information (NMI): I(C, K) NMI(C, K) = (H(C) + H(K))/ I and H are the usual Shannon entropy. = I(C, K) + I(K, C) I(C, C) + I(K, K) Using log a b = log b a C i N log C i = I(C, C) N There exist variations of normalized mutual information [VEB].

397 Clustering Evaluation : 9 / Further Supervised Cluster Evaluation Measures Some further evaluation measures: Adjustment for chance is general principle, Adjusted Index = Index E[Index] optimal Index E[Index] For example Adjusted Rand Index [HA8] or Adjusted Mutual Information [VEB] B-Cubed evaluation [BB98] Set matching purity [ZK] and F [SKK] Edit distance [PL] Visual comparison of multiple clusterings [Ach+] Gini-based evaluation [Sch+]...

398 Clustering Evaluation : / Supervised Evaluation of the Mouse Data Set Revisiting k-means on the mouse data:.6. Min Mean Max Best SSQ.. Jaccard Number of clusters k

399 Clustering Evaluation : / Supervised Evaluation of the Mouse Data Set Revisiting k-means on the mouse data:.6 Min Mean Max Best SSQ.. ARI... Number of clusters k

400 Clustering Evaluation : / Supervised Evaluation of the Mouse Data Set Revisiting k-means on the mouse data:.. Min Mean Max Best SSQ.. NMI Joint Number of clusters k

401 Clustering Evaluation : / Supervised Evaluation of the Mouse Data Set Revisiting k-means on the mouse data:

402 Clustering Evaluation : / Supervised Evaluation of the Mouse Data Set Revisiting k-means on the mouse data: Jaccard Min Mean Max Best SSQ. Number of clusters k On this toy data set, most unsupervised methods prefer k =. NMI prefers the right solution with k = 6. SSQ prefers the left (worse?) solution. ARI Min Mean Max Best SSQ Number of clusters k NMI Joint.. Min Mean Max Best SSQ Number of clusters k

403 Clustering Database Acceleration : / Spatial Index Structures Key idea of spatial search: aggregate the data into spatial partitions identify partitions that are relevant for our search search within relevant partitions only do this recursively Requirements for efficient search: the search result is only a small subset we have a compact summary of the partition filtering rules can prune non-relevant partitions partitions do not overlap too much partitions do not vary in the number of objects too much

404 Clustering Database Acceleration : / Spatial Index Structures / Observations index structures already realize some (very rough) clustering similar objects are usually nearby in the index can usually be constructed fairly fast (with bulk-loading techniques etc.) can be re-used and shared for multiple applications can store complex objects such as polygons, often using bounding rectangles objects and partitions overlap no universal index needs to be carefully set up (distances, queries,...) if they work, speedups are often substantial

405 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR):

406 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR): Instead of computing the distances to each point,

407 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR): MBR Instead of computing the distances to each point,

408 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR): MBR MinDist(q, MBR) Instead of computing the distances to each point, compute the minimum distance to the MBR.

409 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR): MBR MinDist(q, MBR) Instead of computing the distances to each point, compute the minimum distance to the MBR. If MinDist(q, MBR) > ε, then no point in the MBR can be closer than ε.

410 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR): MBR MinDist(q, MBR) Instead of computing the distances to each point, compute the minimum distance to the MBR. If MinDist(q, MBR) > ε, then no point in the MBR can be closer than ε. If MinDist(q, MBR) ε, we need to process the contents of the MBR.

411 Clustering Database Acceleration : / Spatial Data Search: ε-range Queries Example: summarize partitions using Minimum Bounding Rectangles (MBR): MBR MinDist(q, MBR) Instead of computing the distances to each point, compute the minimum distance to the MBR. If MinDist(q, MBR) > ε, then no point in the MBR can be closer than ε. If MinDist(q, MBR) ε, we need to process the contents of the MBR.

412 Clustering Database Acceleration : / Spatial Data Search: k-nearest Neighbors Example: ball cover: q c, p c, p Bounding sphere: for all child points c ij of p i, we have dist(p i, c ij ) r i. By the triangle inequality, we have dist(q, c ij ) + dist(c ij, p i ) dist(q, p i ). by using r i, we get the bound dist(q, c ij ) dist(q, p i ) dist(c ij, p i ) dist(q, p i ) r i Put nodes p i into a priority queue, ordered by MinDist(q, p i ) := dist(q, p i ) r i. If we have k results that are better than the remaining queue entries, we can stop. Balls will usually contain nested balls in a tree structure.

413 Clustering Database Acceleration : / Spatial Data Search: k-nearest Neighbors Example: ball cover: q c, r p c, p r Bounding sphere: for all child points c ij of p i, we have dist(p i, c ij ) r i. By the triangle inequality, we have dist(q, c ij ) + dist(c ij, p i ) dist(q, p i ). by using r i, we get the bound dist(q, c ij ) dist(q, p i ) dist(c ij, p i ) dist(q, p i ) r i Put nodes p i into a priority queue, ordered by MinDist(q, p i ) := dist(q, p i ) r i. If we have k results that are better than the remaining queue entries, we can stop. Balls will usually contain nested balls in a tree structure.

414 Clustering Database Acceleration : / Spatial Data Search: k-nearest Neighbors Example: ball cover: q c, r p c, p r Bounding sphere: for all child points c ij of p i, we have dist(p i, c ij ) r i. By the triangle inequality, we have dist(q, c ij ) + dist(c ij, p i ) dist(q, p i ). by using r i, we get the bound dist(q, c ij ) dist(q, p i ) dist(c ij, p i ) dist(q, p i ) r i Put nodes p i into a priority queue, ordered by MinDist(q, p i ) := dist(q, p i ) r i. If we have k results that are better than the remaining queue entries, we can stop. Balls will usually contain nested balls in a tree structure.

415 Clustering Database Acceleration : / Spatial Data Search: k-nearest Neighbors Example: ball cover: q MinDist(q, p i ) r c, p c, p r Bounding sphere: for all child points c ij of p i, we have dist(p i, c ij ) r i. By the triangle inequality, we have dist(q, c ij ) + dist(c ij, p i ) dist(q, p i ). by using r i, we get the bound dist(q, c ij ) dist(q, p i ) dist(c ij, p i ) dist(q, p i ) r i Put nodes p i into a priority queue, ordered by MinDist(q, p i ) := dist(q, p i ) r i. If we have k results that are better than the remaining queue entries, we can stop. Balls will usually contain nested balls in a tree structure.

416 Clustering Database Acceleration : / Spatial Index Structures Many different strategies for partitioning data exist: grid Quadtree [FB7], Gridfile [NHS8],... binary splits k-d-tree [Ben7],... minimal bounding boxes R-tree [Gut8], R + -tree [SRF87], R -tree [Bec+9],... ball covers ball and outside VP-tree [Uhl9; Yia9], Ball-tree [Omo89], M-tree [CPZ97], idistance[yu+], Cover-tree [BKL6],... Most use either rectangles (coordinate ranges), or balls (triangle inequality). Simpler partition description = less data to read / process / update /...

417 Clustering Database Acceleration : 6 / R-trees [Gut8] Objective: efficient management and querying of spatial objects (points, lines, polygons,...) Common index queries: point query: all objects that contain a given point region query: all objects that are fully contained in (overlap with) a given query rectangle range query: all object within a certain radius knn query: find the k nearest neighbors Approach management, storage, and search of axis-parallel rectangles objects are represented by their minimum bounding rectangles in R d search queries operate on minimum bounding rectangles

418 Clustering Database Acceleration : 7 / R-trees [Gut8] / R-tree: generalization of B + -tree to multiple dimensions (balanced tree, page fill...) stores d-dimensional minimum bounding rectangles (MBR) objects (e.g., polygons) are represented by their MBR objects stored in leaf nodes only objects and nodes are not disjoint, but may overlap inner nodes store the MBRs of their child nodes the root node is the MBR of the entire data set R + -tree [SRF87] no overlap allowed requires splitting and clipping of spatial objects; leads to some redundancy guarantees unique path from root to leaf node R -tree [Bec+9] improved split heuristic for pages, aimed at minimization of both coverage and overlap

419 Clustering Database Acceleration : 8 / R-trees [Gut8] / Example: R-tree for the mouse data set, bulk-loaded: Leaves + inner nodes root node

420 Clustering Database Acceleration : 9 / R-trees [Gut8] / Further examples: Gaussians in D:. Wikidata coordinates for Germany:

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A