Clustering fundamentals - PDF Free Download

Elena Baralis, Tania Cerquitelli Politecnico di Torino What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maimized DataBase and Data Mining Group

Applications of Cluster Analsis Understanding Group related documents for browsing, group genes and proteins that have similar functionalit, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets 4 Discovered Clusters Applied-Matl-DOWN,Ba-Network-Down,-COM-DOWN, Cabletron-Ss-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Teas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-Cit-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanle-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Industr Group Technolog-DOWN Technolog-DOWN Financial-DOWN Oil-UP Clustering precipitation in Australia Notion of a Cluster can be Ambiguous How man clusters? Si Clusters Two Clusters Four Clusters 4 DataBase and Data Mining Group

Tpes of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in eactl one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree 5 Partitional Clustering A Partitional Clustering 6 DataBase and Data Mining Group

Hierarchical Clustering p p p p4 p p p p4 Traditional Hierarchical Clustering Traditional Dendrogram p p p p4 p p p p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram 7 Clustering Algorithms K-means and its variants Hierarchical clustering Densit-based clustering 8 DataBase and Data Mining Group 4

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is ver simple 9 Two different K-means Clusterings.5.5.5 - -.5 - -.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Optimal Clustering Sub-optimal Clustering DataBase and Data Mining Group 5

Importance of Choosing Initial Centroids Iteration 4 56.5.5.5 - -.5 - -.5.5.5 Importance of Choosing Initial Centroids Iteration Iteration Iteration.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Iteration 4 Iteration 5 Iteration 6.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 DataBase and Data Mining Group 6

Importance of Choosing Initial Centroids Iteration 4 5.5.5.5 - -.5 - -.5.5.5 Importance of Choosing Initial Centroids Iteration Iteration.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Iteration Iteration 4 Iteration 5.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 4 DataBase and Data Mining Group 7

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. SSE K i C dist ( m, ) i is a data point in cluster C i and m i is the representative point for cluster C i can show that m i corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One eas wa to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K i 5 Solutions to Initial Centroids Problem Multiple runs Helps, but probabilit is not on our side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widel separated Postprocessing Bisecting K-means Not as susceptible to initialization issues 6 DataBase and Data Mining Group 8

Pre-processing and Post-processing Pre-processing Normalize the data Eliminate outliers Post-processing Eliminate small clusters that ma represent outliers Split loose clusters, i.e., clusters with relativel high SSE Merge clusters that are close and that have relativel low SSE 7 Can use From: these Tan,Steinbach, steps Kumar, Introduction during to Data the Mining, McGraw clustering Hill 6 process Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers. 8 DataBase and Data Mining Group 9

Limitations of K-means: Differing Sizes K-means ( Clusters) 9 Limitations of K-means: Differing Densit K-means ( Clusters) DataBase and Data Mining Group

Limitations of K-means: Non-globular Shapes K-means ( Clusters) Overcoming K-means Limitations K-means Clusters One solution is to use man clusters. Find parts of clusters, but need to put together. DataBase and Data Mining Group

Overcoming K-means Limitations K-means Clusters Overcoming K-means Limitations K-means Clusters 4 DataBase and Data Mining Group

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 6 5..5. 4 4 5.5 5 4 6 5 Strengths of Hierarchical Clustering Do not have to assume an particular number of clusters An desired number of clusters can be obtained b cutting the dendogram at the proper level The ma correspond to meaningful taonomies Eample in biological sciences (e.g., animal kingdom, phlogen reconstruction, ) 6 DataBase and Data Mining Group

Hierarchical Clustering Two main tpes of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until onl one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarit or distance matri Merge or split one cluster at a time 7 Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward. Compute the proimit matri. Let each data point be a cluster. Repeat 4. Merge the two closest clusters 5. Update the proimit matri 6. Until onl a single cluster remains Ke operation is the computation of the proimit of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 8 DataBase and Data Mining Group 4

How to Define Inter-Cluster Similarit p p p p4 p5... Similarit? p p p p4 MIN MAX Group Average Distance Between Centroids Other methods driven b an objective function Ward s Method uses squared error p5... Proimit Matri 9 Hierarchical Clustering: Comparison 5 4 4 5 6 MIN MAX 5 4 5 6 4 4 5 4 5 6 5 Ward s Method 5 Group Average 4 4 6 DataBase and Data Mining Group 5

DBSCAN DBSCAN is a densit-based algorithm. Densit = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is an point that is not a core point or a border point. DBSCAN: Core, Border, and Noise Points DataBase and Data Mining Group 6

DBSCAN: Core, Border, and Noise Points Point tpes: core, border and noise Eps =, MinPts = 4 When DBSCAN Works Well Clusters Resistant to Noise Can handle clusters of different shapes and sizes 4 DataBase and Data Mining Group 7

When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Varing densities High-dimensional data (MinPts=4, Eps=9.6) 5 Measures of Cluster Validit The validation of clustering structures is the most difficult task To evaluate the goodness of the resulting clusters, some numerical measures can be eploited Numerical measures are classified into two main classes Eternal Inde: Used to measure the etent to which cluster labels match eternall supplied class labels. e.g., entrop, purit Internal Inde: Used to measure the goodness of a clustering structure without respect to eternal information. e.g., Sum of Squared Error (SSE), cluster cohesion, cluster separation, Rand- Inde, adjusted rand-inde 6 DataBase and Data Mining Group 8

Eternal Measures of Cluster Validit: Entrop and Purit 7 Internal Measures: Cohesion and Separation A proimit graph based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 8 DataBase and Data Mining Group 9

Final Comment on Cluster Validit The validation of clustering structures is the most difficult and frustrating part of cluster analsis. Without a strong effort in this direction, cluster analsis will remain a black art accessible onl to those true believers who have eperience and great courage. Algorithms for Clustering Data, Jain and Dubes 9 DataBase and Data Mining Group