Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize a given set of objects to a hierarchy No or very little supervision Some heuristic quality guidances on the quality of the hierarchy Jian Pei: MPT 459/741 lustering (2) 1

Hierarchical lustering Group data objects into a tree of clusters Top-down versus bottom-up Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) Jian Pei: MPT 459/741 lustering (2) 2

AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step cluster merging, until all objects form a cluster Single-link approach Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters Jian Pei: MPT 459/741 lustering (2) 3

Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multilevel nested partitioning (a tree of clusters) A clustering of the data objects: cutting the dendrogram at the desired level Each connected component forms a cluster Jian Pei: MPT 459/741 lustering (2) 4

DIANA (Divisive ANAlysis) Initially, all objects are in one cluster Step-by-step splitting clusters until each cluster contains only one object 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: MPT 459/741 lustering (2) 5

Jian Pei: MPT 459/741 lustering (2) 6 Distance Measures Minimum distance Maximum distance Mean distance Average distance = = = = i j j i j i p q j i j i avg j i j i mean q p j i q p j i q p d n n d m m d d q p d d q p d d ), ( 1 ), ( ), ( ), ( ), ( max ), ( ), ( min ), (, max, min m: mean for a cluster : a cluster n: the number of objects in a cluster

hallenges Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical High complexity O(n 2 ) Integrating hierarchical clustering with other techniques BIRH, URE, HAMELEON, ROK Jian Pei: MPT 459/741 lustering (2) 7

BIRH Balanced Iterative Reducing and lustering using Hierarchies F (lustering Feature) tree: a hierarchical data structure summarizing object info lustering objects à clustering leaf nodes of the F tree Jian Pei: MPT 459/741 lustering (2) 8

lustering Feature Vector lustering Feature: F = (N, LS, SS) N: Number of data points LS: N i=1 =o i SS: N i=1 =o i 2 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 F = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Jian Pei: MPT 459/741 lustering (2) 9

F-tree in BIRH lustering feature: Summarize the statistics for a cluster Many cluster quality measures (e.g., radium, distance) can be derived Additivity: F 1 +F 2 =(N 1 +N 2, L 1 +L 2, SS 1 +SS 2 ) A F tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or children The nonleaf nodes store sums of the Fs of children Jian Pei: MPT 459/741 lustering (2) 10

F Tree B = 7 L = 6 F 1 F 2 child 1 child 2 child 3 child 6 F 3 F 6 Non-leaf node Root F 1 F 2 child 1 child 2 child 3 child 5 F 3 F 5 Leaf node Leaf node prev F 1 F 2 F 6 next prev F 1 F 2 F 4 next Jian Pei: MPT 459/741 lustering (2) 11

Parameters of a F-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes Jian Pei: MPT 459/741 lustering (2) 12

BIRH lustering Phase 1: scan DB to build an initial inmemory F tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the F-tree Jian Pei: MPT 459/741 lustering (2) 13

Pros & ons of BIRH Linear scalability Good clustering with a single scan Quality can be further improved by a few additional scans an handle only numeric data Sensitive to the order of the data records Jian Pei: MPT 459/741 lustering (2) 14

Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size and density K: the parameter of number of clusters Good only if k can be reasonably estimated Jian Pei: MPT 459/741 lustering (2) 15

URE: the Ideas Each cluster has c representatives hoose c well scattered points in the cluster Shrink them towards the mean of the cluster by a fraction of α The representatives capture the physical shape and geometry of the cluster Merge the closest two clusters Distance of two clusters: the distance between the two closest representatives Jian Pei: MPT 459/741 lustering (2) 16

ure: The Algorithm Draw random sample S Partition sample to p partitions Partially cluster each partition Eliminate outliers Random sampling + remove clusters growing too slowly luster partial clusters until only k clusters left Shrink representatives of clusters towards the cluster center Jian Pei: MPT 459/741 lustering (2) 17

Data Partitioning and lustering y y y y y x x x x x Jian Pei: MPT 459/741 lustering (2) 18

Shrinking Representative Points Shrink the multiple representative points towards the gravity center by a fraction of α Representatives capture the shape y è y x x Jian Pei: MPT 459/741 lustering (2) 19

lustering ategorical Data: ROK Robust lustering using links # of common neighbors between two points Use links to measure similarity/proximity Not distance based 2 2 O( n + nm m + n log n) m a Basic ideas: Similarity function and neighbors: Let T1 = {1,2,3}, T2={3,4,5} Sim( T1, T2) Sim( T, T ) = = { 3} 1. { 1, 2, 3, 4, 5} = 5 = 0 2 1 2 T T T 1 2 T 1 2 Jian Pei: MPT 459/741 lustering (2) 20

Limitations Merging decision based on static modeling No special characteristics of clusters are considered 1 2 1 2 URE and BIRH merge 1 and 2 1 and 2 are more appropriate for merging Jian Pei: MPT 459/741 lustering (2) 21

hameleon Hierarchical clustering using dynamic modeling Measures the similarity based on a dynamic model Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters A two-phase algorithm Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters Find the genuine clusters by repeatedly combining subclusters Jian Pei: MPT 459/741 lustering (2) 22

Overall Framework of HAMELEON onstruct Sparse Graph Partition the Graph Data Set Merge Partition Final lusters Jian Pei: MPT 459/741 lustering (2) 23

To-Do List Read hapter 10.3 (for thesis-based graduate students only) read the paper BIRH: an efficient data clustering method for very large databases Jian Pei: MPT 459/741 lustering (2) 24