Iterated Consensus Clustering: A Technique We Can All Agree On

Iterated Consensus Clustering: A Technique We Can All Agree On Mindy Hong, Robert Pearce, Kevin Valakuzhy, Carl Meyer, Shaina Race Abstract Cluster Analysis is a field of Data Mining used to extract underlying patterns in unclassified data. Many existing clustering algorithms are inadequate in that they require knowledge of how many clusters exist in the data, otherwise known as k, and that their underlying assumptions make them ineffective in certain situations. The method of Consensus Clustering seeks to rectify the latter problem by incorporating the results of multiple clustering algorithms to achieve one final grouping. We investigate a novel method of Iterative Consensus Clustering (ICC) which solves both issues. The iteration of the consensus clustering technique widens the eigengap associated with the Perron cluster, giving a more definitive, and more accurate, estimation of the number of clusters, k. 1

Contents 1 Background Information 2 2 Problems in Clustering 3 1 Curse of Dimensionality............................. 3 2 The Fundamental Problem of Clustering / Finding k............. 3 3 Distance Metrics................................. 4 4 Fisher s Iris Data................................. 5 3 Algorithms 5 1 Hierarchical Clustering.............................. 5 2 Principal Direction Divisive Partitioning.................... 6 3 k - Means..................................... 8 4 DBSCAN..................................... 9 5 Expectation Maximization............................ 10 4 Dimension Reduction 10 1 Principal Component Analysis......................... 10 2 Singular Value Decomposition.......................... 11 3 Non-Negative Matrix Factorization....................... 11 5 Consensus Clustering 12 1 Random Walks.................................. 12 2 Reversibility.................................... 13 3 Uncoupled and Nearly-uncoupled Markov Chains............... 13 6 Iterated Consensus Clustering 14 7 Results 15 8 Future Work 17 9 Acknowledgments 17 References 17 1 Background Information Patterns are to data what pictures are to words. Just as the saying A picture is worth a thousand words, patterns can be used to meaningfully interpret unimaginably large datasets. These patterns in data can be more useful than the data itself, as patterns can explain the underlying behavior of data and predict future events. The need for these patterns can be seen in a wide range of areas from predicting anything from weather [4], 2

to the outcome of labor negotiations [6]. Now, with the rapid advance of technology, new data is obtained faster than it can be interpreted. This makes patterns, and the methods to find them, even more valuable. Data mining is the process of finding these patterns. By understanding these patterns, we can extract meaning from data. Google s use of the Page Rank algorithm is a particularly famous example of using data mining to obtain patterns used to optimize web searches [1]. In this paper we dive into the field of clustering, a field of data mining with a goal of grouping data based on some measure of similarity. The hope is that these groupings will manifest underlying patterns in the data. The difference between clustering and other data mining techniques, such as classification, is that it s an unsupervised learning approach. With supervised learning techniques, such as classification, there exists a correct answer for the groupings. They used sample data sets to train their algorithms to understand patterns in the dataset before classifying new observations. In unsupervised learning, the fitness of a clustering is decided by some combination of how similar the points in a cluster are versus how dissimilar points in different clusters are. Unsupervised techniques seek to maximize those values in order to create meaningful clusters. In this paper, we discuss the background knowledge needed to understand clustering, introduce some of the algorithms that we tested, and finally summarize the results from a new technique of clustering. 2 Problems in Clustering 1 Curse of Dimensionality The Curse of Dimensionality is a phenomenon that occurs when analyzing high-dimensional spaces. The Curse of Dimensionality suggests that the higher the dimensionality of the space, the less applicable distance formulas become. In other words, when dimensionality increases, the volume of the space increases so quickly that the available data becomes sparse. This affects methods that require statistical significance because the amount of data needed to support an accurate result normally grows exponentially with dimensionality. Furthermore, objects in high dimensional data tend to be dissimilar in several ways, preventing efficient data organization. Here, three data sets, each with 300 random points from a [0,1] uniform distributions, are generated with dimensions d = 2, 4, and 40. For each data set, the pairwise distance between all of the points are calculated, and a histogram is plotted to show the distributions. 2 The Fundamental Problem of Clustering / Finding k There are numerous algorithms that are used in cluster analysis, each with different interpretations of what constitutes a cluster and what method is most efficient in the clustering 3

Figure 1: Histograms with Dimensions d = 2,4, and 40 process. However, as Jacob Kogan stated, the fundamental problem of clustering is that there does not exist a best method, that is, one which is superior to all other methods. [7] Each clustering algorithm has its strengths and weaknesses, varying in terms of speed and accuracy. Thus, there is no clustering algorithm that surpasses all other algorithms for every data set. Another essential problem of clustering is determining an accurate value of k, the number of clusters in a data set. Because each algorithm differs in its criteria of what makes up a cluster, k tends to vary depending on each algorithm s attributes [8]. As a result, one of the goals of cluster analysis is to determine the most reasonable value for k. 3 Distance Metrics Distance metrics can be thought of as measures of dissimilarity, they indicate how different two pieces of data are. These techniques make use of the geometric interpretation of our data. The main distance metric we used was the Euclidean distance, a specific form of the Minkowski distance. d d(x, y) = x j y j r j=1 The parameter, r, determines how much weight is given to the combination of distances. When r = 1, this function gives the Manhattan Distance. This takes the absolute difference between each parameter in x and y and sums them together. The effect is that all distances are calculated as one walked from point A to B on a grid, unable to walk in any direction except parallel to an axis. Taking r = 2 you get the Euclidean Distance. This values the distance as the mean of all the differences between parameters, which can be likened to the distance if one took the shortest path between points in the plane or in space. If you let r approach infinity, you will get the Chebyshev Distance or the Maximum Distance. This gives as the distance the largest difference between any one parameter, preventing any other parameters from contributing to the distance calculation. 1/r 4

4 Fisher s Iris Data This dataset is one the most well-known real world datasets used for testing clustering algorithms [4, 8, 12], due to its high level of difficulty and small amount of data. The data consists of 4 observations from 150 different irises: the petal length, petal width, sepal length, and sepal width. The observations are split equally among the three species: Iris setosa, Iris virginica, and Iris versicolor. Upon observation, however, it is only evident that there are two different clusters in the data, the virginica and versicolor groups slightly overlapping each other in almost every set of observations. Separating these two clusters is the main challenge given by this dataset. Figure 2: A graph depicting 2 of the 4 dimensions of Fisher s Iris Dataset. This set includes 150 observations spread evenly among 3 different types of Iris, Setosa, Virginica, and Versicolor. These observations include the petal length, petal width, sepal length, and sepal width of the flowers. This dataset is popular in data mining due to its small size and difficulty in separating the Virginica and Versicolor groups. 3 Algorithms 1 Hierarchical Clustering Hierarchical clustering is a method of clustering that groups data by generating a cluster tree or dendrogram. The tree is a multilevel hierarchy in which clusters at one level are joined as clusters at the next level. There are generally two approaches for hierarchical clustering - agglomerative and divisive. Agglomerative clustering is more common; it 5

starts with having each point as an individual cluster, and then merges the closest pair of clusters step by step until one overall cluster remains. Divisive is, in sense, the reverse, where the algorithm starts with one overall cluster and then splits the cluster step by step until there is one cluster per point. In order to define proximity between clusters in agglomerative clustering,the following three methods are most commonly used - complete linkage (max), single linkage (min), and group average. When using complete linkage, the maximum distance, or the distance between the farthest two points in different clusters, is taken to be the cluster proximity. Single linkage (min) takes the proximity between the two closest points in different clusters as the proximity, and lastly, the group average technique takes the average pairwise proximity of all pairs of points as the cluster proximity. There are different benefits to incorporating hierarchical clustering. The algorithm allows the user to decide what level of clustering is most appropriate in his/her application. In addition, hierarchical clustering allows the usage of any valid measure of distance, and also makes relatively good local decisions when it comes to combining two clusters together. On the other hand, one of the main disadvantages to the algorithm decides locally on which clusters should be merged together, once a decision is made to merge the clusters, it cannot be undone. Hierarchical clustering can be represented in the first figure below, which uses the data from the Fisher s Iris data set. The algorithm calculates the Euclidean distance between observations in the data set (using pdist to calculate the distance and linkage to group the objects into a hierarchical cluster tree), then plots the hierarchical clustering in a dendrogram. By looking at the bottom of the dendrogram, we can see that there are 3-4 clusters in this data set. The middle figure zooms in so that only 15 of the observations from the data set are shown, providing a more clear view of the dendrogram. Finally, the last figure shows a scatter plot of the data set, and each cluster is indicated by a different cluster. As a result, the graph shows that there are 3 clusters in the Fisher s Iris data set. 2 Principal Direction Divisive Partitioning Another algorithm that is commonly used in clustering is the PDDP algorithm, which stands for Principal Direction Divisive Partitioning. PDDP is much like divisive hierarchical clustering, which initially begins with one overall cluster and then splits the clusters up in each step until there are n clusters for n individual data points. PDDP is essentially a clustering algorithm based on principal directions. The algorithm divides the entire collection of data points into two clusters by using the principal direction. Next, each of the two clusters will be further divided into two more subclusters through the same process, 6

Figure 3: Hierarchical Clustering, Fisher s Iris Data Set and this division stops when the specified cluster quantity is satisfied. The result is a hierarchical structure of divisions arranged into a binary tree, where each partition/division is either a leaf node (meaning it hasn t been separated) or has been separated into two more subgroups. The process of how PDDP works is as follows: the algorithm initially takes the data sets as an input and finds the direction of maximal variance. Then, the observations are projected down onto the line through the origin of the direction of maximal variance. PDDP selects the cluster with the highest scatter (a measure of the non-cohesiveness of a cluster) or variance value in order to choose which cluster to split next. The algorithm continued to split clusters until the initial specified number of clusters, k, is reached. The results of this algorithm can then be represented in a tree diagram, in which the number of leaf nodes indicates the number of clusters in the data set. The primary advantage to using PDDP is that it is an extremely fast algorithm, and can therefore be very efficient when it comes to clustering particularly large data sets. On the other hand, one of the disadvantages of using PDDP is that cluster decisions cannot be undone once made, and the algorithm tends to perform poorly if initial clusters are not separated well. However, the accuracy of PDDP can be improved by using dimension reduction techniques, such as singular value decomposition (SVD). 7

Below are a few figures that show the dendrograms of data sets having used the PDDP algorithm. The first figure is a dendrogram representing the clusters in the Fisher s Iris data set, and the second figure shows clusters from the TwentyNewsGroups data set. Figure 4: PDDP Dendrogram for Fisher s Iris data set Figure 5: PDDP Dendrogram for TwentyNewsGroups data set 3 k - Means The k - Means algorithm is an iterative process that attempts to locate the centroids of the groups of data, thereby allowing it to group each data point with the closest centroid [5]. Its relative simplicity and ease of implementation makes it a popular clustering algorithm. The algorithm is as follows: 1. Choose k locations as estimations of the centroids of the final groupings. Call these points Pseudocentroids or pcentroids. 2. Group each data point with the closest pcentroid. 3. Move each pcentroid to the mean value of all the data points in its group. 4. Repeat Step 2 and 3 until the pcentroids stop moving 8

This is a method of successively minimizing the total sum of distances between the points and the centroids of the group that they are associated with. It is apparent that this process will converge, as each step minimizes the distance, and there are a finite number of groupings for a finite set of points. The first problem this algorithm has is the issue of local minima. There is no assurance that the sum of distances between the data points and pcentroids represents the true minimum given any set of pcentroids. Based on the initialization, the final result can differ based on which local minima that the algorithm finds itself in. Related to this is the issue of initialization. Without prior knowledge of the nature of the dataset, it is hard to make an educated guess as to the locations of the centroids of the minimum sum of distances. Some algorithms pick a random point in the space where the data is held, while others use a random point of data as the initial guess. Another technique is to find the averages of the groupings of another algorithm and use those points as the initializations for k - Means. 4 DBSCAN DBSCAN is a clustering algorithm that clusters based on distance and density. If a single point contains the minimum number of points, p, in its ɛ-neighborhood, then call it dense. If a point is dense, then it is initialized as a cluster. If a point within the cluster is also dense, then the points within its ɛ-neighborhood are also added to the cluster, and if those points are within another cluster, the clusters merge. Figure 6: 9

5 Expectation Maximization Expectation maximization is a broad term that covers many techniques that involve iterative maximum likelihood estimates of parameters of statistical models. In clustering, this typically uses multivariate normal distributions, and modifies the parameters until all the points are associated with a particular distribution or cluster. The algorithm initially guesses the parameters of the normal distributions and then uses the maximum log likelihood of the distributions to determine what distribution each data point is in. After determining the clusters, the algorithm uses the same maximum likelihood to estimate the parameters of the distribution, and it iteratively computes parameters and clusters until the solution converges. Figure 7: 4 Dimension Reduction Using clustering algorithms alone is not enough. Sometimes the size of datasets limit the algorithms that can be used. The computational time may by too long. In other cases the data may be too noisy, or it may have too many dimensions making clustering much more difficult due to the curse of dimensionality. In these situations, techniques called dimension reductions are implemented. They use different methods to compress information by combining the different attributes and only taking the ones that contribute most to the data at hand. 1 Principal Component Analysis PCA generates a new set of dimensions that better capture the variability of the data. We first center the data by subtracting the mean of each dimension from every point in that dimension and then find the eigenvectors of the covariance matrix to create a new matrix that will transform the original vectors into new vectors with respect to the orthogonal 10

new basis. The new basis is more useful because the variance is the greatest for the first component and decreases as the eigenvalue for each basis vector decreases. The covariance between each axis is also zero, and to reduce the dimensions of the data, we can simply omit the coordinates associated with lower variance, assuming that it is mostly noise. Figure 8: 2 Singular Value Decomposition SVD is used to give the closest rank r approximation of a matrix of column vectors, A, as X = USV T where U and V are orthogonal matrices, and S is a diagonal matrix of rank r. The matrix S contains decreasing eigenvalues of the covariance matrix of X, and U contains the corresponding eigenvectors in its columns that form the new PCA basis for X. The rows of V are the dimensionally reduced vectors in our new basis. Figure 9: 3 Non-Negative Matrix Factorization Non-Negative Matrix Factorization is a method of decomposing a nonnegative data matrix X into the product of W, a matrix of basis or topic vectors, and a reduced matrix, H, both nonnegative, so X = W H. Matlab does this by varying the matricies W and H to minimize the frobenius norm of the residual, X W H. Clustering done on the W matrix is meaningful due to underlying 11

patterns in the matrix that mimic the results from other clustering algorithms [3]. 5 Consensus Clustering If we were to cluster the columns of our data matrix, X, using any single clustering algorithm we could display the result in an adjacency matrix for an undirected graph. Each observation (or document) in X would be represented by a vertex and an edge would be drawn between two vertices if those two observations were placed in the same cluster. We do not include self-loops or edges from a vertex to itself on this graph. Let E be the set of edges for this graph. Then the adjacency matrix, A, for the resulting graph is simply { 1 : (i, j) E A ij = 0 : otherwise The absence of self-loops ensures the diagonal entries of A are zero. Now suppose we were to use N different clustering algorithms on the same data matrix, X. The result would be N different adjacency matrices A 1, A 2,..., A N. Summing these adjacency matrices, we form our consensus matrix, C. C = C ij N i=1 is then the number of times that document (or observation) i was clustered with document (or observation) j. Again, we will think of C as an undirected graph, this time with weighted edges. It is important to note here that the N different clusterings do not have to assume the same number of clusters, k, in the data. In fact we can vary the number of clusters, k, for each algorithm and combine all of the results in C. This very approach will be taken in an attempt to determine the number of clusters in our data. A i 1 Random Walks We consider a random walk on the undirected graph defined by our consensus matrix, C. Let D be a diagonal matrix whose diagonal entries are the corresponding row sums of C, D = diag(ce), where e is a vector of all ones. We prevent singleton clusters, assuring that the diagonal entries in D are nonzero. The transition probability matrix for this Markov chain, P, and the stationary distribution, π T are given by P = D 1 C (1) π T = et D e T De (2) 12

2 Reversibility It is well known that random walks on connected, undirected graphs are reversible Markov chains [11], and thus satisfy the so-called detailed balance equations, diag(π)pdiag(π) 1 = P T. Equivalently, we can write diag(π) 1/2 P diag(π) 1/2 = diag(π) 1/2 P T diag(π) 1/2 which shows that P is similar to a symmetric matrix and thus has real eigenvalues. In an ideal scenario, the graph created by the consensus matrix would actually show k disconnected components, each component strongly connected in and of itself but disconnected from the others. In this case, we would essentially have a collection of k reversible Markov chains, each with a transition matrix with real eigenvalues ensuring that the collective transition matrix also has real eigenvalues. Such a scenario is impractical in practice, but motivates our continued discussion. 3 Uncoupled and Nearly-uncoupled Markov Chains Definition A Markov chain with transition probability matrix, P, is called uncoupled if there exists a permutation matrix, Q, such that QPQ T is block diagonal: P 11 0... 0 QPQ T 0 P 22... 0 =...... 0 0... P kk where each P ii is square. If we consider our data observations as vertices on a graph, where edges exist only between observations that belong in the same cluster, then each P ii can be thought to define the transition probabilities of a random walk on the vertices of one cluster. If each P ii is irreducible, primitive, and reversible, which is guaranteed by our connected undirected graph components, then the algebraic multiplicity of the dominant eigenvalue, the socalled Perron root, λ 1 (P) = 1 is exactly k. While it is generally unrealistic to expect our clustering algorithms to produce a consensus matrix which would provide such an uncoupled transition matrix, we do expect to get something close. In fact, if we have faith in our clustering algorithms, we expect graph edges within a cluster to have much higher weight than those errant edges that connect vertices from different clusters. This provides us with a so-called nearly uncoupled Markov chain. Such a graph and the associated probability transition matrix are depicted in Figure 10. The magnitude of edge weights are indicated by the thickness of the edges, and three clusters are labelled A, B, and C. The Markov chain in Figure 10 is such that P A e 1, P B e 1, P C e 1, hence we call these submatrices nearly stochastic. It has been shown 13

A P A ε AB ε AC P= ε AB P B ε BC B ε AC ε BC P C C Figure 10: A Nearly Uncoupled Markov System that a nearly uncoupled Markov chain, containing k diagonal blocks which are nearly stochastic and sporadic off-diagonal elements with small magnitude, has k eigenvalues that are close to 1 [9, 10]. If there is no further decomposition (or uncoupling) of the blocks P ii then there are no more than k eigenvalues close to 1. This group of eigenvalues close to 1 is called the Perron cluster [10, 9]. Definition Let P be an n n stochastic matrix from a reversible Markov chain with eigenvalues, including multiplicities of 1 = λ 1 λ 2 λ 3 λ n. If the largest difference between consecutive eigenvalues occurs between λ k and λ k+1, then the set {1,..., λ k } is called the Perron Cluster of P, and the difference is called the eigengap. Others have used the number of eigenvalues in the Perron cluster to identify the number of clusters in various applications [10, 9, 2]. This is a natural approach to the problem, however when applied to the consensus matrix the resulting Perron cluster is often unclear or uninformative. We will show that iterating the consensus process widens the eigengap between λ k and λ k+1 and better estimates the number of clusters. 6 Iterated Consensus Clustering We propose a new method that seeks to refine the results gained through consensus clustering in order to gain a true consensus between algorithms. This is done by taking the consensus matrix after a run of consensus clustering and treating it like a new piece of data by running it through the same consensus clustering algorithm again. Between each iteration, values that are below a certain threshold, called the drop tolerance, are dropped. We do this under the assumption that the algorithms we use will 14

rarely make similar mistakes. Each algorithm will make incorrect connections between points, but rarely will multiple algorithms cause the same error. By taking only the strongest connections between points, determined by the percentage of algorithms that agree, we get a clustering that reflects the consensus of all of the algorithms. 7 Results Upon experimenting with the individual clustering algorithms, we decided to use k - Means, PDDP, and Expectation Maximization in our ICC code. We also included another k-means run, but with the centroids initialized as the centroids of the final clusterings of PDDP on the data. There were five different types of input data, either using raw data or the output from one of four dimension reduction techniques: PCA, SVD, NMF, and PCA used on the output from NMF. Everything was repeated for each value of k that we wanted to test, and every algorithm, except the four that ran on raw data, were repeated for each number of dimensions that we wanted to calculate, also known as our r values. We used code to extract the number of dimensions that were required for 60%, 75%, and 90% of the variance in the data, using those three amounts of dimensions as our r values. In order to see the performance results of our Iterated Consensus Clustering algorithm, we ran the algorithm against several data sets. For each data set, we ran the ICC algorithm to determine the number of clusters and find agreement using the correct number of k within each data set. For each data set, we generated the first 20 iterations using drop tolerances of p = 0, 0.1 and 0.2 to see which condition generated the best, most accurate, results. Seeing that p = 0.2 generated somewhat better results than the others, we decided to generate the eigengap graphs using a 0.2 drop tolerance. To simulate an attempt to find k, we took runs of all the algorithms at values of k from a wide range around the true value and added the consensus matrices to get a final result. Furthermore, we generated the accuracies of each data set (using the last column of the agreement matrix), and these were generated by 15

normalizing either the rows or columns of the matrix, depending on the size of the data. Here, we generated the eigengap images of each data set, taking the graphs from the first, second and third iterations. Figure 11: Iris Dataset ICC Eigenvalues (1st Iteration) Figure 12: Iris Dataset ICC Eigenvalues (1st Iteration) Figure 13: Iris Dataset ICC Eigenvalues (1st Iteration) These figures show the eigengaps of the Iris data set - here, we can see that by the third iteration, a significant eigengap has appeared, and the location of the eigengap shows us how many clusters are in the data set, which is three in this dataset. In calculating these graphs we used a k range from 2 to 10. Below, we show the accuracy results of running the ICC algorithm on the Iris data. Here, we can see that the average accuracy for each iteration is relatively high, ranging from 88% - 96%. More importantly, though, the algorithms eventually agree upon a 16

Figure 14: ICC Accuracy Table on Iris Dataset solution that is more accurate than the average of each of the algorithms used alone. 8 Future Work Our research has opened the door to the uncharted territory of ICC. Iterating the consensus matrix can be applied to other methods for finding k, such as the gap statistic, a method used to determine clusters from the change in their variances. We primarily tested ICC with smaller data sets, and this was largely due to the long computation time of the algorithm. Future work could also include refining the code to shorten computation time and test ICC with larger data sets. It would also be helpful to test ICC with different algorithms and dimension reductions to determine which worked the best together. 9 Acknowledgments We would like to thank Dr. Carl Meyer for his guidance and advice throughout this research. We would also like to thank NCSU and the NSF for organizing and funding this REU program. References [1] Carl Meyer Amy N. Langville. Google s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006. 17

[2] D. B. Szyld E. Virnik D. Fritzsche, V. Mehrmann. An svd approach to identifying metastable states of markov chains. Electronic Transactions on Numerical Analysis, 29:46 69, 2008. [3] Chris Ding, Xiaofeng He, and Horst D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proc. SIAM Data Mining Conf, pages 606 610, 2005. [4] Robert L. Grossman. Data Mining for Scientfic and Engineering Applications. Kluwer Academic Publishers, 2001. [5] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):pp. 100 108, 1979. [6] Mark A. Hall Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, 2011. [7] Jacob Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press, Cambridge, New York, 2007. [8] Ravi Kothari and Dax Pitts. On finding the number of clusters. Pattern Recognition Letters, 20(4):405 416, 1999. [9] C. D. Meyer and C. D. Wessell. Stochastic Data Clustering. ArXiv e-prints, August 2010. [10] A. Fischer Ch. Schutte P. Deuflhard, W. Huisinga. Identification of almost invariant aggregates in reversible nearly uncoupled markov chains. Linear Algebra and its Applications, 315:39 59, 2000. [11] William J. Stewart. Probability, Markov Chains, Queues, and Simulation: The Mathematical Basis of Performance Modeling. Princeton University Press, 2009. [12] Rui Xu and II Wunsch, D. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3):645 678, may 2005. 18