Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Similar documents
Clustering Algorithms for general similarity measures

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering Lecture 3: Hierarchical Methods

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

TELCOM2125: Network Science and Analysis

Clustering Part 3. Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Clustering CS 550: Machine Learning

Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

4. Ad-hoc I: Hierarchical clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CSE 5243 INTRO. TO DATA MINING

Data Mining Concepts & Techniques

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Community Detection. Community

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering: Overview and K-means algorithm

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Visual Representations for Machine Learning

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Unsupervised Learning

Behavioral Data Mining. Lecture 18 Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Mining Social Network Graphs

Cluster Analysis. Angela Montanari and Laura Anderlucci

Hierarchical and Ensemble Clustering

Lesson 3. Prof. Enza Messina

Clustering: K-means and Kernel K-means

Distances, Clustering! Rafael Irizarry!

Hierarchical clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Lesson 2 7 Graph Partitioning

Hierarchical Clustering

Clustering. Chapter 10 in Introduction to statistical learning

V4 Matrix algorithms and graph partitioning

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS570: Introduction to Data Mining

Information Retrieval and Web Search Engines

Midterm Examination CS540-2: Introduction to Artificial Intelligence

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Hierarchical Clustering 4/5/17

Unsupervised Learning Hierarchical Methods

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Introduction to Data Mining

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

A Computational Theory of Clustering

Clustering part II 1

Hierarchical Clustering

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Lecture 4 Hierarchical clustering

Web Structure Mining Community Detection and Evaluation

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Sergei Silvestrov, Christopher Engström. January 29, 2013

Unsupervised Learning

Modularity CMSC 858L

University of Florida CISE department Gator Engineering. Clustering Part 2

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Chapter 9. Classification and Clustering

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering Part 4 DBSCAN

CS7267 MACHINE LEARNING

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Cluster Analysis: Agglomerate Hierarchical Clustering

Machine learning - HT Clustering

CSE494 Information Retrieval Project C Report

Clustering algorithms and introduction to persistent homology

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

Minimum Spanning Trees

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Introduction to Machine Learning

Extracting Information from Complex Networks

Clustering in Ratemaking: Applications in Territories Clustering

Hierarchical clustering

Information Retrieval and Web Search Engines

Network Traffic Measurements and Analysis

Problem Definition. Clustering nonlinearly separable data:

Gene expression & Clustering (Chapter 10)

21 The Singular Value Decomposition; Clustering

MSA220 - Statistical Learning for Big Data

University of Florida CISE department Gator Engineering. Clustering Part 4

Hierarchical Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

CPSC 425: Computer Vision

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 11

Transcription:

Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive = top-down break up cluster containing all objects into smaller clusters both agglomerative and divisive give hierarchies hierarchy can be trivial: 1 (.. )... 2 ((.. ). ).. 3 (((.. ). ). ). 4 ((((.. ). ). ). ) 1 2 Similarity between clusters Possible definitions: I. similarity between most similar pair of objects with one in each cluster called single link II......... ^ ^ similarity between least similar pair objects, one from each cluster called complete linkage........ ^ ^ 3 Similarity between clusters, cont. Possible definitions: III. average of pairwise similarity between all pairs of objects, one from each more computation Generally no representative point for a cluster; compare K-means If using Euclidean distance as metric centroid bounding box 4 General Agglomerative Uses any computable cluster similarity measure sim(c i, C j ) For n objects v 1,, v n, assign each to a singleton cluster C i = {v i }. repeat { identify two most similar clusters C j and C k (could be ties chose one pair) delete C j and C k and add (C j U C k ) to the set of clusters } until only one cluster Dendrograms diagram the sequence of cluster merges. Agglomerative: remarks Intro. to IR discusses in great detail for cluster similarity: single-link, complete-link, avg. of all pairs, centroid Uses priority queues to get time complexity O((n 2 logn)*(time to compute cluster similarity)) one priority queue for each cluster: contains similarities to all other clusters plus bookkeeping info time complexity more precisely: O((n 2 ) *(time to compute object-object similarity) + (n 2 logn)* (time to compute sim(cluster z, cluster j U cluster k ) if know sim(cluster z, cluster j ) and sim(cluster z, cluster k )) ) 5 Problem with priority queue? 6 1

Single pass agglomerative-like Given arbitrary order of objects to cluster: v 1, v n and threshold τ Put v 1 in cluster C 1 by itself For i = 2 to n { for all existing clusters C j calculate sim(v i, C j ); record most similar cluster to v i as C max(i) if sim(v i, C max(i) ) > τ add v i to C max(i) else create new cluster {v i } } Issues put v i in cluster after seeing only v 1, v i-1 not hierarchical tends to produce large clusters depends on τ depends on order of v i 7 8 Alternate perspective for single-link algorithm Build a minimum spanning tree (MST) - graph alg. edge weights are pair-wise similarities since in terms of similarities, not distances, really want maximum spanning tree For some threshold τ, remove all edges of similarity < τ Tree falls into pieces => clusters Not hierarchical, but get hierarchy for sequence of τ 9 Hierarchical Divisive: Template 1. Put all objects in one cluster 2. Repeat until all clusters are singletons a) choose a cluster to split what criterion? b) replace the chosen cluster with the sub-clusters split into how many? how split? reversing agglomerative => split in two cutting operation: cut-based measures seem to be a natural choice. focus on similarity across cut - lost similarity not necessary to use a cut-based measure 10 An Example An Example: 1 st cut 11 12 2

An Example: 2 nd cut An Example: stop at 3 clusters 13 14 Compare k-means result Cut-based optimization weaken the connection between objects in different clusters rather than strengthening connection between objects within a cluster Are many cut-based measures We will look at one 15 16 Inter / Intra cluster costs Given: V = {v 1,, v n }, the set of all objects A partitioning clustering C 1, C 2, C k of the objects: V = U i=1,, k C i. Define: cutcost (C p ) = sim(v i, v j ). v i in C p v j in V-C p intracost(c p ) = sim(v i, v j ). v i, v j in C p Cost of a clustering total relative cut cost (C 1,, C k ) = k p=1 cutcost (C p ) intracost (C p ) contribution each cluster: ratio external similarity to internal similarity Optimization Find clustering C 1,, C k that minimizes total relative cut cost(c 1,, C k ) 17 18 3

Simple example six objects similarity 1 if edge shown similarity 0 otherwise choice 1: choice 2: choice 3: cost UNDEFINED + 1/4 cost 1/1 + 1/3 = 4/3 cost 1/2 + 1/2 = 1 *prefer balance Hierarchical divisive revisited can use one of cut-based algorithms to split a cluster how choose cluster to split next? if building entire tree, doesn t matter if stopping a certain point, choose next cluster based on measure optimizing e.g. for total relative cut cost, choose C i with largest cutcost(c i ) / intracost(c i ) 19 20 Divisive Algorithm: Iterative Improvement; no hierarchy 1. Choose initial partition C 1,, C k 2. repeat { unlock all vertices repeat { choose some C i at random choose an unlocked vertex v j in C i move v j to that cluster, if any, such that move gives maximum decrease in cost lock vertex v j } until all vertices locked Observations on algorithm heuristic uses randomness convergence usually improvement < some chosen threshold between outer loop iterations vertex locking insures that all vertices are examined before examining any vertex twice there are many variations of algorithm can use at each division of hierarchical divisive algorithm with k=2 more computation than an agglomerative merge }until converge 21 22 Compare to k-means Similarities: number of clusters, k, is chosen in advance an initial clustering is chosen (possibly at random) iterative improvement is used to improve clustering Important difference: divisive algorithm can minimize a cut-based cost total relative cut cost uses external and internal measures k-means maximizes only similarity within a cluster ignores cost of cuts 23 Eigenvalues and clustering General class of techniques for clustering a graph using eigenvectors of adjacency matrix (or similar matrix) called Spectral clustering First described in 1973 24 4

Spectral clustering: brief overview Given: k: number of clusters nxn object-object sim. matrix S of non-neg. val.s Compute: 1. Derive matrix L from S (straightforward computation) e.g. Laplacian: are variations in def. 2. find eigenvectors corresp. to k smallest eigenval.s of L 3. use eigenvectors to define clusters variety of ways to do this all involve another, simpler, clustering e.g. points on a line Spectral clustering optimizes a cut measure Comparing clusterings Define external measure to comparing two clusterings as to similarity if one clustering correct, one clustering by an algorithm, measures how well algorithm doing refer to correct clusters as classes gold standard refer to computed clusters as clusters External measure independent of cost function optimized by algorithm 25 26 similar to total relative cut cost One measure: motivated by F-score in IR Given: a set of classes S 1, S k of the objects use to define relevance a computed clustering C 1, C k of the objects use to define retrieval Consider pairs of objects pair in same class, call similar pair relevant pair in different classes irrelevant pair in same clusters retrieved pair in different clusters not retrieved Clustering f-score precision of the clustering w.r.t the gold standard = # similar pairs in the same cluster # pairs in the same cluster recall of the clustering w.r.t the gold standard = # similar pairs in the same cluster # similar pairs f-score of the clustering w.r.t the gold standard = 2*precision*recall precision + recall Use to define precision and recall 27 28 Properties of cluster F-score always 1 Perfect match computed clusters to classes gives F-score = 1 Symmetric Two clusterings {C i } and {K j }, neither gold standard treat {C i } as if are classes and compute F-score of {K j } w.r.t. {C i } = F-score {Ci} ({K j }) treat {K j } as if are classes and compute F-score of {C i } w.r.t. {K j } = F-score {Kj} ({C i }) F-score {Ci} ({K j }) = F-score {Kj} ({C i }) 29 Clustering: wrap-up many applications application determines similarity between objects menu of cost functions to optimizes similarity measures between clusters types of algorithms flat/hierarchical constructive/iterative algorithms within a type 30 5