INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Similar documents
INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering Part 3. Hierarchical Clustering

Hierarchical clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Unsupervised Learning

Hierarchical Clustering

Chapter 6: Cluster Analysis

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Hierarchical Clustering

Clustering CS 550: Machine Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Hierarchical Clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering Lecture 3: Hierarchical Methods

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Gene Clustering & Classification

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Unsupervised Learning and Clustering

4. Ad-hoc I: Hierarchical clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Hierarchical Clustering

Based on Raymond J. Mooney s slides

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis for Microarray Data

Hierarchical clustering

Unsupervised Learning : Clustering

CHAPTER 4: CLUSTER ANALYSIS

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 16

Clustering Algorithms for general similarity measures

Lesson 3. Prof. Enza Messina

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Information Retrieval and Web Search Engines

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Information Retrieval and Web Search Engines

Unsupervised Learning Hierarchical Methods

Unsupervised: no target value to predict

Cluster analysis. Agnieszka Nowak - Brzezinska

Machine Learning. Unsupervised Learning. Manfred Huber

CSE 5243 INTRO. TO DATA MINING

SGN (4 cr) Chapter 11

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Hierarchical Clustering 4/5/17

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Road map. Basic concepts

Supervised vs unsupervised clustering

10701 Machine Learning. Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

DATA CLASSIFICATORY TECHNIQUES

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Concepts & Techniques

Unsupervised Learning and Clustering

CS7267 MACHINE LEARNING

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Introduction to Information Retrieval

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Lecture 15 Clustering. Oct

Unsupervised learning, Clustering CS434

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

What to come. There will be a few more topics we will cover on supervised learning

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Understanding Clustering Supervising the unsupervised

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Unsupervised Learning

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CS570: Introduction to Data Mining

Data Informatics. Seon Ho Kim, Ph.D.

Lecture 4 Hierarchical clustering

CSE 347/447: DATA MINING

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering and Visualisation of Data

Artificial Intelligence. Programming Styles

Clustering: K-means and Kernel K-means

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Network Traffic Measurements and Analysis

Midterm Examination CS540-2: Introduction to Artificial Intelligence

INF 4300 Classification III Anne Solberg The agenda today:

K-Means Clustering 3/3/17

Transcription:

INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012

Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score Unsupervised machine learning for class discovery: Clustering. Flat vs. hierarchical clustering. Example of flat / partional clustering: k-means clustering. Clustering as an optimization problem, minimizing an objective cost function. Topics for today More on clustering Agglomerative clustering: Bottom-up hierarchical clustering How to measure the inter-cluster similarity ( linkage criterions ). Divisive clustering. 2

Agglomerative clustering 3 Initially; regards each object as its own singleton cluster. Iteratively agglomerates (merges) the groups in a bottom-up fashion. Each merge defines a binary branch in the tree. Terminates; when only one cluster remains (the root). parameters: {o 1, o 2..., o n}, sim C = {{o 1}, {o 2},..., {o n}} T = [] do for i = 1 to n 1 {c j, c k } arg max sim(c j, c k ) {c j,c k } C j k C C\{c j, c k } C C {c j c k } T[i] {c j, c k } At each stage, we merge the pair of clusters that are most similar, as defined by some measure of inter-cluster similarity; sim. Plugging in a different sim gives us a different sequence of merges T.

Dendrograms 4 A hierarchical clustering is often visualized as a binary tree structure known as a dendrogram. A merge is shown as a horizontal line connecting two clusters. The y-axis coordinate of the line corresponds to the similarity of the merged clusters. We here assume dot-products of normalized vectors (self-similarity = 1).

Definitions of inter-cluster similarity 5 So far we ve looked at ways to the define the similarity between pairs of objects. objects and a class. Now we ll look at ways to define the similarity between collections. In agglomerative clustering, a measure of cluster similarity sim(c i, c j ) is usually referred to as a linkage criterion: Single-linkage Complete-linkage Centroid-linkage Average-linkage The linkage criterion determines which pair of clusters we will merge to a new cluster in each step.

Single-linkage 6 Merge the two clusters with the minimum distance between any two members. Nearest-Neighbors. Can be computed efficiently by taking advantage of the fact that it s best-merge persistent: Let the nearest neighbor of cluster ck be in either c i or c j. If we merge c i c j = c l, the nearest neighbor of c k will be in c l. The distance of the two closest members is a local property that is not affected by merging. Undesirable chaining effect: Tendency to produce stretched and straggly clusters.

Complete-linkage 7 Merge the two clusters where the maximum distance between any two members is smallest. Farthest-Neighbors. Amounts to merging the two clusters whose merger has the smallest diameter. Preference for compact clusters with small diameters. Sensitive to outliers. Not best-merge persistent: Distance defined as the diameter of a merge is a non-local property that can change during merging.

Centroid-linkage Similarity of two clusters c i and c j defined as the similarity between their cluster centroids µ i and µ j (the mean vectors). Equivalent to the average pairwise similarity between objects from different clusters: sim(c i, c j ) = µ i µ j = 1 c i c j x c i Like complete-link, not best-merge persistent. y c j x y Not monotonic, subject to inversions: The combination similarity can increase during the clustering. 8

Inversions a problem with centroid-linkage 9 We usually assume that a clustering is monotonic, i.e. the combination similarity is guaranteed to decrease between iterations. For a non-monotonic clustering criterion (e.g. centroid-linkage), inversions would show in the dendrogram as crossing lines. The dotted circles in the dendrogram above indicate inversions: The horizontal merge bar is lower than the bar of a previous merge. Violates the fundamental assumption that small clusters are more coherent than large clusters.

Average-linkage (1:2) 10 AKA group-average agglomerative clustering. Merge the two clusters with the highest average pairwise similarities in their union. Aims to maximize the coherency of the merged cluster by considering all pairwise similarities between the objects in the clusters. Let c i c j = c k, and sim(c i, c j ) = W (c i c j ) = W (c k ): W (c k ) = 1 c k ( c k 1) x c k y x c k x y Self-similarities are excluded in order to not penalize large clusters (which have a lower ratio of self-similarities).

Average-linkage (2:2) 11 Monotonic but not best-merge persistent. Compromise of completeand single-linkage. Commonly considered the best default linkage criterion for agglomerative clustering. Can be computed very efficiently if we assume (i) the dot-product for computing the similarity of (ii) normalized feature vectors. 1 W (c k ) = 2 x c k c k ( c k 1) x c k

Cutting the tree 12 Hierarchical methods actually produce several partitions; one for each level of the tree. However, for many applications we will want to extract a set of disjoint clusters. In order to turn the nested partitions into a single flat partitioning, we cut the dendrogram. A cutting criterion can be defined as a threshold on e.g. combination similarity, relative drop in the similarity, number of root nodes, etc.

Divisive hierarchical clustering 13 Generates the nested partitions top-down: Start by considering all objects part of the same cluster (the root). Split the cluster using a flat clustering algorithm (e.g. by applying k-means for k = 2). Recursively split the clusters until only singleton clusters remain (or some specified number of levels is reached). Flat methods such as k-means are generally very effective; linear in the number of objects. Divisive methods are thereby also generally more efficient than agglomerative methods, which are at least quadratic (single-link). Also has the advantage of being able to initially consider the global distribution of the data, while the agglomerative methods must commit to early decisions based on local patterns.

The proximity matrix 14 For a given set of objects {o 1,..., o m } the proximity matrix, R is a square m m matrix where an element R ij gives the proximity (or distance) between o i and o j. Alternatively known as a distance matrix. For our word space, R ij would give the dot-product of the normalized feature vectors x i and x j, representing the words o i and o j. Note that, if our similarity measure is symmetric, i.e. sim( x, y) = sim( y, x), then R will also be symmetric, i.e. R ij = R ji Computing all the pairwise similarities once and then storing them in R can help save time in many applications. R will provide the input to many clustering methods. By sorting the row elements of R, we get access to an important type of similarity relation; nearest neighbors.

Neighbor relations in the proximity matrix Examples of neighbor relations extracted for the word space created for the 3nd assignment. (20.000 sentences from Brown, sentence-level BoW, no stemming, dot-product similarity on normalized vectors.) knn: k nearest neigbors. The k objects closest to a given target in the space. RNN: Reciprocal nearest neighbors. Each object is the other s nearest neighbor (or within some specified k for knn). 10-NNs for salt K Word Sim. 1 pepper 0.56 2 mustard 0.48 3 sauce 0.47 4 butter 0.42 5 water 0.37 6 eggs 0.35 7 rice 0.24 8 milk 0.20 9 center 0.19 10 coffee 0.19 Examples of RNNs salt pepper (0.56) college school (0.68) russia china (0.51) john james (0.70) edward robert (0.74) 15

Looking back, looking ahead 16 So far we ve only modeled pointwise observations or single objects. In our case, the objects happened to be different word types. We ve looked at how to model relations between the objects and how to predict class labels for them. Next we will start to deal with sequences of observations. In our case, the sequences will be strings of characters or words... but abstractly they could be anything; gene or protein sequences, acoustic signals (e.g. speech), gestures, stock transactions, etc. We will look at e.g. how to validate and compute the likelihood of different sequences of observations, and how to predict class labels for them (sequence classification),... using tools such as regular expressions, finite state automata, hidden Markov models and Viterbi decoding.