INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012

Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score Unsupervised machine learning for class discovery: Clustering. Flat vs. hierarchical clustering. Example of flat / partional clustering: k-means clustering. Clustering as an optimization problem, minimizing an objective cost function. Topics for today More on clustering Agglomerative clustering: Bottom-up hierarchical clustering How to measure the inter-cluster similarity ( linkage criterions ). Divisive clustering. 2

Agglomerative clustering 3 Initially; regards each object as its own singleton cluster. Iteratively agglomerates (merges) the groups in a bottom-up fashion. Each merge defines a binary branch in the tree. Terminates; when only one cluster remains (the root). parameters: {o 1, o 2..., o n}, sim C = {{o 1}, {o 2},..., {o n}} T = [] do for i = 1 to n 1 {c j, c k } arg max sim(c j, c k ) {c j,c k } C j k C C\{c j, c k } C C {c j c k } T[i] {c j, c k } At each stage, we merge the pair of clusters that are most similar, as defined by some measure of inter-cluster similarity; sim. Plugging in a different sim gives us a different sequence of merges T.

Dendrograms 4 A hierarchical clustering is often visualized as a binary tree structure known as a dendrogram. A merge is shown as a horizontal line connecting two clusters. The y-axis coordinate of the line corresponds to the similarity of the merged clusters. We here assume dot-products of normalized vectors (self-similarity = 1).

Definitions of inter-cluster similarity 5 So far we ve looked at ways to the define the similarity between pairs of objects. objects and a class. Now we ll look at ways to define the similarity between collections. In agglomerative clustering, a measure of cluster similarity sim(c i, c j ) is usually referred to as a linkage criterion: Single-linkage Complete-linkage Centroid-linkage Average-linkage The linkage criterion determines which pair of clusters we will merge to a new cluster in each step.

Single-linkage 6 Merge the two clusters with the minimum distance between any two members. Nearest-Neighbors. Can be computed efficiently by taking advantage of the fact that it s best-merge persistent: Let the nearest neighbor of cluster ck be in either c i or c j. If we merge c i c j = c l, the nearest neighbor of c k will be in c l. The distance of the two closest members is a local property that is not affected by merging. Undesirable chaining effect: Tendency to produce stretched and straggly clusters.

Complete-linkage 7 Merge the two clusters where the maximum distance between any two members is smallest. Farthest-Neighbors. Amounts to merging the two clusters whose merger has the smallest diameter. Preference for compact clusters with small diameters. Sensitive to outliers. Not best-merge persistent: Distance defined as the diameter of a merge is a non-local property that can change during merging.

Centroid-linkage Similarity of two clusters c i and c j defined as the similarity between their cluster centroids µ i and µ j (the mean vectors). Equivalent to the average pairwise similarity between objects from different clusters: sim(c i, c j ) = µ i µ j = 1 c i c j x c i Like complete-link, not best-merge persistent. y c j x y Not monotonic, subject to inversions: The combination similarity can increase during the clustering. 8

Inversions a problem with centroid-linkage 9 We usually assume that a clustering is monotonic, i.e. the combination similarity is guaranteed to decrease between iterations. For a non-monotonic clustering criterion (e.g. centroid-linkage), inversions would show in the dendrogram as crossing lines. The dotted circles in the dendrogram above indicate inversions: The horizontal merge bar is lower than the bar of a previous merge. Violates the fundamental assumption that small clusters are more coherent than large clusters.

Average-linkage (1:2) 10 AKA group-average agglomerative clustering. Merge the two clusters with the highest average pairwise similarities in their union. Aims to maximize the coherency of the merged cluster by considering all pairwise similarities between the objects in the clusters. Let c i c j = c k, and sim(c i, c j ) = W (c i c j ) = W (c k ): W (c k ) = 1 c k ( c k 1) x c k y x c k x y Self-similarities are excluded in order to not penalize large clusters (which have a lower ratio of self-similarities).

Average-linkage (2:2) 11 Monotonic but not best-merge persistent. Compromise of completeand single-linkage. Commonly considered the best default linkage criterion for agglomerative clustering. Can be computed very efficiently if we assume (i) the dot-product for computing the similarity of (ii) normalized feature vectors. 1 W (c k ) = 2 x c k c k ( c k 1) x c k

Cutting the tree 12 Hierarchical methods actually produce several partitions; one for each level of the tree. However, for many applications we will want to extract a set of disjoint clusters. In order to turn the nested partitions into a single flat partitioning, we cut the dendrogram. A cutting criterion can be defined as a threshold on e.g. combination similarity, relative drop in the similarity, number of root nodes, etc.

Divisive hierarchical clustering 13 Generates the nested partitions top-down: Start by considering all objects part of the same cluster (the root). Split the cluster using a flat clustering algorithm (e.g. by applying k-means for k = 2). Recursively split the clusters until only singleton clusters remain (or some specified number of levels is reached). Flat methods such as k-means are generally very effective; linear in the number of objects. Divisive methods are thereby also generally more efficient than agglomerative methods, which are at least quadratic (single-link). Also has the advantage of being able to initially consider the global distribution of the data, while the agglomerative methods must commit to early decisions based on local patterns.

The proximity matrix 14 For a given set of objects {o 1,..., o m } the proximity matrix, R is a square m m matrix where an element R ij gives the proximity (or distance) between o i and o j. Alternatively known as a distance matrix. For our word space, R ij would give the dot-product of the normalized feature vectors x i and x j, representing the words o i and o j. Note that, if our similarity measure is symmetric, i.e. sim( x, y) = sim( y, x), then R will also be symmetric, i.e. R ij = R ji Computing all the pairwise similarities once and then storing them in R can help save time in many applications. R will provide the input to many clustering methods. By sorting the row elements of R, we get access to an important type of similarity relation; nearest neighbors.

Neighbor relations in the proximity matrix Examples of neighbor relations extracted for the word space created for the 3nd assignment. (20.000 sentences from Brown, sentence-level BoW, no stemming, dot-product similarity on normalized vectors.) knn: k nearest neigbors. The k objects closest to a given target in the space. RNN: Reciprocal nearest neighbors. Each object is the other s nearest neighbor (or within some specified k for knn). 10-NNs for salt K Word Sim. 1 pepper 0.56 2 mustard 0.48 3 sauce 0.47 4 butter 0.42 5 water 0.37 6 eggs 0.35 7 rice 0.24 8 milk 0.20 9 center 0.19 10 coffee 0.19 Examples of RNNs salt pepper (0.56) college school (0.68) russia china (0.51) john james (0.70) edward robert (0.74) 15

Looking back, looking ahead 16 So far we ve only modeled pointwise observations or single objects. In our case, the objects happened to be different word types. We ve looked at how to model relations between the objects and how to predict class labels for them. Next we will start to deal with sequences of observations. In our case, the sequences will be strings of characters or words... but abstractly they could be anything; gene or protein sequences, acoustic signals (e.g. speech), gestures, stock transactions, etc. We will look at e.g. how to validate and compute the likelihood of different sequences of observations, and how to predict class labels for them (sequence classification),... using tools such as regular expressions, finite state automata, hidden Markov models and Viterbi decoding.