Clustering Algorithms for general similarity measures

Similar documents
Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering: Overview and K-means algorithm

4. Ad-hoc I: Hierarchical clustering

CSE 5243 INTRO. TO DATA MINING

Clustering CS 550: Machine Learning

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Visual Representations for Machine Learning

TELCOM2125: Network Science and Analysis

Clustering Part 3. Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods

Clustering: Overview and K-means algorithm

CSE 5243 INTRO. TO DATA MINING

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical Clustering

Community Detection. Community

Information Retrieval and Web Search Engines

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CSE 5243 INTRO. TO DATA MINING

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Cluster Analysis. Ying Shen, SSE, Tongji University

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering: K-means and Kernel K-means

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning

Data Mining Concepts & Techniques

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Mining Social Network Graphs

Information Retrieval and Web Search Engines

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Behavioral Data Mining. Lecture 18 Clustering

CS 534: Computer Vision Segmentation and Perceptual Grouping

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Hierarchical Clustering

CS7267 MACHINE LEARNING

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Machine learning - HT Clustering

Hierarchical clustering

Clustering. Chapter 10 in Introduction to statistical learning


Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

CS570: Introduction to Data Mining

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering Algorithms. Margareta Ackerman

Lesson 3. Prof. Enza Messina

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Introduction to Machine Learning

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Lecture 7: Segmentation. Thursday, Sept 20

Hierarchical and Ensemble Clustering

V4 Matrix algorithms and graph partitioning

Clustering algorithms and introduction to persistent homology

Lesson 2 7 Graph Partitioning

Introduction to Information Retrieval

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Expectation Maximization!

Clustering Part 4 DBSCAN

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 11

Hierarchical Clustering

Chapter 9. Classification and Clustering

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Distances, Clustering! Rafael Irizarry!

Clustering (COSC 416) Nazli Goharian. Document Clustering.

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering part II 1

University of Florida CISE department Gator Engineering. Clustering Part 2

Understanding Clustering Supervising the unsupervised

Cluster Analysis for Microarray Data

EE 701 ROBOT VISION. Segmentation

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Machine Learning. Unsupervised Learning. Manfred Huber

Gene Clustering & Classification

Modularity CMSC 858L

Text Documents clustering using K Means Algorithm

Web Structure Mining Community Detection and Evaluation

Data Clustering. Danushka Bollegala

A Computational Theory of Clustering

ECLT 5810 Clustering

PV211: Introduction to Information Retrieval

L9: Hierarchical Clustering

ECLT 5810 Clustering

Hierarchical clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Machine Learning (BSMC-GA 4439) Wenke Liu

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Transcription:

Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative versus divisive construction agglomerative = bottom-up build up clusters from single objects divisive = top-down break up cluster containing all objects into smaller clusters both agglom tive and divisive give hierarchies hierarchy can be trivial: 1 (.. )... 2 ((.. ). ).. 3 (((.. ). ). ). 4 ((((.. ). ). ). ) 2 Similarity between clusters Possible definitions: I. similarity between most similar pair of objects with one in each cluster called single link II......... ^ ^ similarity between least similar pair objects, one from each cluster called complete linkage........ ^ ^ 3 Similarity between clusters, cont. Possible definitions: III. average of pairwise similarity between all pairs of objects, one from each cluster centroid similarity IV. average of pairwise similarity between all pairs of distinct objects, including w/in same cluster group average similarity Generally no representative point for a cluster; compare K-means If using Euclidean distance as metric centroid bounding box 4 General Agglomerative Uses any computable cluster similarity measure sim(c i, C j ) For n objects v 1,, v n, assign each to a singleton cluster C i = {v i }. repeat { identify two most similar clusters C j and C k (could be ties chose one pair) delete C j and C k and add (C j U C k ) to the set of clusters } until only one cluster Dendrograms diagram the sequence of cluster merges. Agglomerative: remarks Intro. to IR discusses in great detail for cluster similarity: single-link, complete-link, group average, centroid Uses priority queues to get time complexity O((n 2 logn)*(time to compute cluster similarity)) one priority queue for each cluster: contains similarities to all other clusters plus bookkeeping info time complexity more precisely: O((n 2 ) *(time to compute object-object similarity) + (n 2 logn)* (time to compute sim(cluster z, cluster j U cluster k ) if know sim(cluster z, cluster j ) and sim(cluster z, cluster k )) ) 5 Problem with priority queue? 6 1

Example applications in search Query evaluation: cluster pruning ( 7.1.6) cluster all documents choose representative for each cluster evaluate query w.r.t. cluster reps. evaluate query for docs in cluster(s) having most similar cluster rep.(s) Results presentation: labeled clusters cluster only query results e.g. Yippy.com (metasearch) Single pass agglomerative-like Given arbitrary order of objects to cluster: v 1, v n and threshold τ Put v 1 in cluster C 1 by itself For i = 2 to n { for all existing clusters C j calculate sim(v i, C j ); record most similar cluster to v i as C max(i) if sim(v i, C max(i) ) > τ add v i to C max(i) else create new cluster {v i } } ISSUES? hard / soft? flat / hier? 7 8 Issues put v i in cluster after seeing only v 1, v i-1 not hierarchical tends to produce large clusters depends on τ depends on order of v i Alternate perspective for single-link algorithm Build a minimum spanning tree (MST) graph algorithm edge weights are pair-wise similarities since in terms of similarities, not distances, really want maximum spanning tree For some threshold τ, remove all edges of similarity < τ Tree falls into pieces => clusters 9 Not hierarchical, but get hierarchy for sequence of τ 10 Hierarchical Divisive: Template 1. Put all objects in one cluster 2. Repeat until all clusters are singletons a) choose a cluster to split what criterion? b) replace the chosen cluster with the sub-clusters split into how many? how split? reversing agglomerative => split in two cutting operation: cut-based measures seem to be a natural choice. focus on similarity across cut - lost similarity not necessary to use a cut-based measure 11 An Example 12 2

An Example: 1 st cut An Example: result of 1 st cut 13 14 An Example: 2 nd cut An Example: stop at 3 clusters 15 16 Compare k-means result Cut-based optimization focus on weak connections between objects in different clusters rather than strong connections between objects within a cluster Are many cut-based measures We will look at two 17 18 3

Inter / Intra cluster costs Given: V = {v 1,, v n }, the set of all objects A partitioning clustering C 1, C 2, C k of the objects: V = U i=1,, k C i. Define: cutcost (C p ) = sim(v i, v j ). Cost of a clustering total relative cut cost (C 1,, C k ) = k p=1 cutcost (C p ) intracost (C p ) contribution each cluster: ratio external similarity to internal similarity v i in C p v j in V-C p intracost(c p ) = sim(v i, v j ). Optimization Find clustering C 1,, C k that minimizes total relative cut cost(c 1,, C k ) (v i, v j ) in C p 19 20 Simple example six objects similarity 1 if edge shown similarity 0 otherwise choice 1: Second cut-based measure: Conductance define: s_degree(c p ) = cutcost(c p )+2*intracost(C p ) choice 2: choice 3: cost UNDEFINED + 1/4 cost 1/1 + 1/3 = 4/3 cost 1/2 + 1/2 = 1 *prefer balance 21 model as graph, similarity = edge weights s_degree is sum over all vertices in C p of weights of edges touching vertex conductance (C p ) = cutcost(c p ) min{s_degree(c p ), s_degree(v-c p ) } 22 Optimization using conductance Choices: minimize k p=1 conductance (C p ) minimize MAX k p=1 conductance (C p ) Observations conductance (C p ) = conductance (V-C p ) Finding a cut (C, V-C) with minimum conductance is NP-hard Simple example six objects similarity 1 if edge shown similarity 0 otherwise choice 1: choice 2: choice 3: conductance 1/min(1,9) = 1 conductance 1/min(3, 7) = 1/3 conductance1/min(5, 5) = 1/5 *prefer balance 23 24 4

Hierarchical divisive revisited can use one of cut-based algorithms to split a cluster how choose cluster to split next? if building entire tree, doesn t matter if stopping a certain point, choose next cluster based on measure optimizing e.g. for total relative cut cost, choose C i with largest cutcost(c i ) / intracost(c i ) 25 Divisive Algorithm: Iterative Improvement; no hierarchy 1. Choose initial partition C 1,, C k 2. repeat { unlock all vertices repeat { choose some C i at random choose an unlocked vertex v j in C i move v j to that cluster, if any, such that move gives maximum decrease in cost lock vertex v j } until all vertices locked }until converge 26 Observations on algorithm heuristic uses randomness convergence usually improvement < some chosen threshold between outer loop iterations vertex locking insures that all vertices are examined before examining any vertex twice there are many variations of algorithm can use at each division of hierarchical divisive algorithm with k=2 more computation than an agglomerative merge 27 Compare to k-means Similarities: number of clusters, k, is chosen in advance an initial clustering is chosen (possibly at random) iterative improvement is used to improve clustering Important difference: divisive algorithm can minimize a cut-based cost total relative cut cost, conductance use external and internal measures k-means maximizes only similarity within a cluster ignores cost of cuts 28 Eigenvalues and clustering General class of techniques for clustering a graph using eigenvectors of adjacency matrix (or similar matrix) called Spectral clustering First described in 1973 spectrum of a graph is list of eigenvalues, with multiplicity, of its adjacency matrix 29 Spectral clustering: brief overview Given: k: number of clusters nxn object-object sim. matrix S of non-neg. val.s Compute: 1. Derive matrix L from S (straightforward computation) variety of definitions of L e.g. Laplacian L=I-E if similarity is edge/no edge 2. find eigenvectors corresp. to k smallest eigenval.s of L 3. use eigenvectors to define clusters variety of ways to do this all involve another, simpler, clustering e.g. points on a line Spectral clustering optimizes a cut measure similar to total relative cut cost 30 5

Comparing clusterings Define external measure to comparing two clusterings as to similarity if one clustering correct, one clustering by an algorithm, measures how well algorithm doing refer to correct clusters as classes gold standard refer to computed clusters as clusters External measure independent of cost function optimized by algorithm 31 One measure: motivated by F-score in IR Given: a set of classes S 1, S k of the objects use to define relevance a computed clustering C 1, C k of the objects use to define retrieval Consider pairs of objects pair in same class, call similar pair relevant pair in different classes irrelevant pair in same clusters retrieved pair in different clusters not retrieved Use to define precision and recall 32 Clustering f-score precision of the clustering w.r.t the gold standard = # similar pairs in the same cluster # pairs in the same cluster recall of the clustering w.r.t the gold standard = # similar pairs in the same cluster # similar pairs f-score of the clustering w.r.t the gold standard = 2*precision*recall precision + recall 33 Properties of cluster F-score always 1 Perfect match computed clusters to classes gives F-score = 1 Symmetric Two clusterings {C i } and {K j }, neither gold standard treat {C i } as if are classes and compute F-score of {K j } w.r.t. {C i } = F-score {Ci} ({K j }) treat {K j } as if are classes and compute F-score of {C i } w.r.t. {K j } = F-score {Kj} ({C i }) F-score {Ci} ({K j }) = F-score {Kj} ({C i }) 34 another related external measure Rand index ( # similar pairs in the same cluster + # dissimilar pairs in the different clusters ) N (N-1)/2 percentage pairs that are correct Clustering: wrap-up many applications application determines similarity between objects menu of cost functions to optimizes similarity measures between clusters types of algorithms flat/hierarchical constructive/iterative algorithms within a type 35 36 6