Iterated Consensus Clustering: A Technique We Can All Agree On

Size: px
Start display at page:

Download "Iterated Consensus Clustering: A Technique We Can All Agree On"

Transcription

1 Iterated Consensus Clustering: A Technique We Can All Agree On Mindy Hong, Robert Pearce, Kevin Valakuzhy, Carl Meyer, Shaina Race Abstract Cluster Analysis is a field of Data Mining used to extract underlying patterns in unclassified data. Many existing clustering algorithms are inadequate in that they require knowledge of how many clusters exist in the data, otherwise known as k, and that their underlying assumptions make them ineffective in certain situations. The method of Consensus Clustering seeks to rectify the latter problem by incorporating the results of multiple clustering algorithms to achieve one final grouping. We investigate a novel method of Iterative Consensus Clustering (ICC) which solves both issues. The iteration of the consensus clustering technique widens the eigengap associated with the Perron cluster, giving a more definitive, and more accurate, estimation of the number of clusters, k. 1

2 Contents 1 Background Information 2 2 Problems in Clustering 3 1 Curse of Dimensionality The Fundamental Problem of Clustering / Finding k Distance Metrics Fisher s Iris Data Algorithms 5 1 Hierarchical Clustering Principal Direction Divisive Partitioning k - Means DBSCAN Expectation Maximization Dimension Reduction 10 1 Principal Component Analysis Singular Value Decomposition Non-Negative Matrix Factorization Consensus Clustering 12 1 Random Walks Reversibility Uncoupled and Nearly-uncoupled Markov Chains Iterated Consensus Clustering 14 7 Results 15 8 Future Work 17 9 Acknowledgments 17 References 17 1 Background Information Patterns are to data what pictures are to words. Just as the saying A picture is worth a thousand words, patterns can be used to meaningfully interpret unimaginably large datasets. These patterns in data can be more useful than the data itself, as patterns can explain the underlying behavior of data and predict future events. The need for these patterns can be seen in a wide range of areas from predicting anything from weather [4], 2

3 to the outcome of labor negotiations [6]. Now, with the rapid advance of technology, new data is obtained faster than it can be interpreted. This makes patterns, and the methods to find them, even more valuable. Data mining is the process of finding these patterns. By understanding these patterns, we can extract meaning from data. Google s use of the Page Rank algorithm is a particularly famous example of using data mining to obtain patterns used to optimize web searches [1]. In this paper we dive into the field of clustering, a field of data mining with a goal of grouping data based on some measure of similarity. The hope is that these groupings will manifest underlying patterns in the data. The difference between clustering and other data mining techniques, such as classification, is that it s an unsupervised learning approach. With supervised learning techniques, such as classification, there exists a correct answer for the groupings. They used sample data sets to train their algorithms to understand patterns in the dataset before classifying new observations. In unsupervised learning, the fitness of a clustering is decided by some combination of how similar the points in a cluster are versus how dissimilar points in different clusters are. Unsupervised techniques seek to maximize those values in order to create meaningful clusters. In this paper, we discuss the background knowledge needed to understand clustering, introduce some of the algorithms that we tested, and finally summarize the results from a new technique of clustering. 2 Problems in Clustering 1 Curse of Dimensionality The Curse of Dimensionality is a phenomenon that occurs when analyzing high-dimensional spaces. The Curse of Dimensionality suggests that the higher the dimensionality of the space, the less applicable distance formulas become. In other words, when dimensionality increases, the volume of the space increases so quickly that the available data becomes sparse. This affects methods that require statistical significance because the amount of data needed to support an accurate result normally grows exponentially with dimensionality. Furthermore, objects in high dimensional data tend to be dissimilar in several ways, preventing efficient data organization. Here, three data sets, each with 300 random points from a [0,1] uniform distributions, are generated with dimensions d = 2, 4, and 40. For each data set, the pairwise distance between all of the points are calculated, and a histogram is plotted to show the distributions. 2 The Fundamental Problem of Clustering / Finding k There are numerous algorithms that are used in cluster analysis, each with different interpretations of what constitutes a cluster and what method is most efficient in the clustering 3

4 Figure 1: Histograms with Dimensions d = 2,4, and 40 process. However, as Jacob Kogan stated, the fundamental problem of clustering is that there does not exist a best method, that is, one which is superior to all other methods. [7] Each clustering algorithm has its strengths and weaknesses, varying in terms of speed and accuracy. Thus, there is no clustering algorithm that surpasses all other algorithms for every data set. Another essential problem of clustering is determining an accurate value of k, the number of clusters in a data set. Because each algorithm differs in its criteria of what makes up a cluster, k tends to vary depending on each algorithm s attributes [8]. As a result, one of the goals of cluster analysis is to determine the most reasonable value for k. 3 Distance Metrics Distance metrics can be thought of as measures of dissimilarity, they indicate how different two pieces of data are. These techniques make use of the geometric interpretation of our data. The main distance metric we used was the Euclidean distance, a specific form of the Minkowski distance. d d(x, y) = x j y j r j=1 The parameter, r, determines how much weight is given to the combination of distances. When r = 1, this function gives the Manhattan Distance. This takes the absolute difference between each parameter in x and y and sums them together. The effect is that all distances are calculated as one walked from point A to B on a grid, unable to walk in any direction except parallel to an axis. Taking r = 2 you get the Euclidean Distance. This values the distance as the mean of all the differences between parameters, which can be likened to the distance if one took the shortest path between points in the plane or in space. If you let r approach infinity, you will get the Chebyshev Distance or the Maximum Distance. This gives as the distance the largest difference between any one parameter, preventing any other parameters from contributing to the distance calculation. 1/r 4

5 4 Fisher s Iris Data This dataset is one the most well-known real world datasets used for testing clustering algorithms [4, 8, 12], due to its high level of difficulty and small amount of data. The data consists of 4 observations from 150 different irises: the petal length, petal width, sepal length, and sepal width. The observations are split equally among the three species: Iris setosa, Iris virginica, and Iris versicolor. Upon observation, however, it is only evident that there are two different clusters in the data, the virginica and versicolor groups slightly overlapping each other in almost every set of observations. Separating these two clusters is the main challenge given by this dataset. Figure 2: A graph depicting 2 of the 4 dimensions of Fisher s Iris Dataset. This set includes 150 observations spread evenly among 3 different types of Iris, Setosa, Virginica, and Versicolor. These observations include the petal length, petal width, sepal length, and sepal width of the flowers. This dataset is popular in data mining due to its small size and difficulty in separating the Virginica and Versicolor groups. 3 Algorithms 1 Hierarchical Clustering Hierarchical clustering is a method of clustering that groups data by generating a cluster tree or dendrogram. The tree is a multilevel hierarchy in which clusters at one level are joined as clusters at the next level. There are generally two approaches for hierarchical clustering - agglomerative and divisive. Agglomerative clustering is more common; it 5

6 starts with having each point as an individual cluster, and then merges the closest pair of clusters step by step until one overall cluster remains. Divisive is, in sense, the reverse, where the algorithm starts with one overall cluster and then splits the cluster step by step until there is one cluster per point. In order to define proximity between clusters in agglomerative clustering,the following three methods are most commonly used - complete linkage (max), single linkage (min), and group average. When using complete linkage, the maximum distance, or the distance between the farthest two points in different clusters, is taken to be the cluster proximity. Single linkage (min) takes the proximity between the two closest points in different clusters as the proximity, and lastly, the group average technique takes the average pairwise proximity of all pairs of points as the cluster proximity. There are different benefits to incorporating hierarchical clustering. The algorithm allows the user to decide what level of clustering is most appropriate in his/her application. In addition, hierarchical clustering allows the usage of any valid measure of distance, and also makes relatively good local decisions when it comes to combining two clusters together. On the other hand, one of the main disadvantages to the algorithm decides locally on which clusters should be merged together, once a decision is made to merge the clusters, it cannot be undone. Hierarchical clustering can be represented in the first figure below, which uses the data from the Fisher s Iris data set. The algorithm calculates the Euclidean distance between observations in the data set (using pdist to calculate the distance and linkage to group the objects into a hierarchical cluster tree), then plots the hierarchical clustering in a dendrogram. By looking at the bottom of the dendrogram, we can see that there are 3-4 clusters in this data set. The middle figure zooms in so that only 15 of the observations from the data set are shown, providing a more clear view of the dendrogram. Finally, the last figure shows a scatter plot of the data set, and each cluster is indicated by a different cluster. As a result, the graph shows that there are 3 clusters in the Fisher s Iris data set. 2 Principal Direction Divisive Partitioning Another algorithm that is commonly used in clustering is the PDDP algorithm, which stands for Principal Direction Divisive Partitioning. PDDP is much like divisive hierarchical clustering, which initially begins with one overall cluster and then splits the clusters up in each step until there are n clusters for n individual data points. PDDP is essentially a clustering algorithm based on principal directions. The algorithm divides the entire collection of data points into two clusters by using the principal direction. Next, each of the two clusters will be further divided into two more subclusters through the same process, 6

7 Figure 3: Hierarchical Clustering, Fisher s Iris Data Set and this division stops when the specified cluster quantity is satisfied. The result is a hierarchical structure of divisions arranged into a binary tree, where each partition/division is either a leaf node (meaning it hasn t been separated) or has been separated into two more subgroups. The process of how PDDP works is as follows: the algorithm initially takes the data sets as an input and finds the direction of maximal variance. Then, the observations are projected down onto the line through the origin of the direction of maximal variance. PDDP selects the cluster with the highest scatter (a measure of the non-cohesiveness of a cluster) or variance value in order to choose which cluster to split next. The algorithm continued to split clusters until the initial specified number of clusters, k, is reached. The results of this algorithm can then be represented in a tree diagram, in which the number of leaf nodes indicates the number of clusters in the data set. The primary advantage to using PDDP is that it is an extremely fast algorithm, and can therefore be very efficient when it comes to clustering particularly large data sets. On the other hand, one of the disadvantages of using PDDP is that cluster decisions cannot be undone once made, and the algorithm tends to perform poorly if initial clusters are not separated well. However, the accuracy of PDDP can be improved by using dimension reduction techniques, such as singular value decomposition (SVD). 7

8 Below are a few figures that show the dendrograms of data sets having used the PDDP algorithm. The first figure is a dendrogram representing the clusters in the Fisher s Iris data set, and the second figure shows clusters from the TwentyNewsGroups data set. Figure 4: PDDP Dendrogram for Fisher s Iris data set Figure 5: PDDP Dendrogram for TwentyNewsGroups data set 3 k - Means The k - Means algorithm is an iterative process that attempts to locate the centroids of the groups of data, thereby allowing it to group each data point with the closest centroid [5]. Its relative simplicity and ease of implementation makes it a popular clustering algorithm. The algorithm is as follows: 1. Choose k locations as estimations of the centroids of the final groupings. Call these points Pseudocentroids or pcentroids. 2. Group each data point with the closest pcentroid. 3. Move each pcentroid to the mean value of all the data points in its group. 4. Repeat Step 2 and 3 until the pcentroids stop moving 8

9 This is a method of successively minimizing the total sum of distances between the points and the centroids of the group that they are associated with. It is apparent that this process will converge, as each step minimizes the distance, and there are a finite number of groupings for a finite set of points. The first problem this algorithm has is the issue of local minima. There is no assurance that the sum of distances between the data points and pcentroids represents the true minimum given any set of pcentroids. Based on the initialization, the final result can differ based on which local minima that the algorithm finds itself in. Related to this is the issue of initialization. Without prior knowledge of the nature of the dataset, it is hard to make an educated guess as to the locations of the centroids of the minimum sum of distances. Some algorithms pick a random point in the space where the data is held, while others use a random point of data as the initial guess. Another technique is to find the averages of the groupings of another algorithm and use those points as the initializations for k - Means. 4 DBSCAN DBSCAN is a clustering algorithm that clusters based on distance and density. If a single point contains the minimum number of points, p, in its ɛ-neighborhood, then call it dense. If a point is dense, then it is initialized as a cluster. If a point within the cluster is also dense, then the points within its ɛ-neighborhood are also added to the cluster, and if those points are within another cluster, the clusters merge. Figure 6: 9

10 5 Expectation Maximization Expectation maximization is a broad term that covers many techniques that involve iterative maximum likelihood estimates of parameters of statistical models. In clustering, this typically uses multivariate normal distributions, and modifies the parameters until all the points are associated with a particular distribution or cluster. The algorithm initially guesses the parameters of the normal distributions and then uses the maximum log likelihood of the distributions to determine what distribution each data point is in. After determining the clusters, the algorithm uses the same maximum likelihood to estimate the parameters of the distribution, and it iteratively computes parameters and clusters until the solution converges. Figure 7: 4 Dimension Reduction Using clustering algorithms alone is not enough. Sometimes the size of datasets limit the algorithms that can be used. The computational time may by too long. In other cases the data may be too noisy, or it may have too many dimensions making clustering much more difficult due to the curse of dimensionality. In these situations, techniques called dimension reductions are implemented. They use different methods to compress information by combining the different attributes and only taking the ones that contribute most to the data at hand. 1 Principal Component Analysis PCA generates a new set of dimensions that better capture the variability of the data. We first center the data by subtracting the mean of each dimension from every point in that dimension and then find the eigenvectors of the covariance matrix to create a new matrix that will transform the original vectors into new vectors with respect to the orthogonal 10

11 new basis. The new basis is more useful because the variance is the greatest for the first component and decreases as the eigenvalue for each basis vector decreases. The covariance between each axis is also zero, and to reduce the dimensions of the data, we can simply omit the coordinates associated with lower variance, assuming that it is mostly noise. Figure 8: 2 Singular Value Decomposition SVD is used to give the closest rank r approximation of a matrix of column vectors, A, as X = USV T where U and V are orthogonal matrices, and S is a diagonal matrix of rank r. The matrix S contains decreasing eigenvalues of the covariance matrix of X, and U contains the corresponding eigenvectors in its columns that form the new PCA basis for X. The rows of V are the dimensionally reduced vectors in our new basis. Figure 9: 3 Non-Negative Matrix Factorization Non-Negative Matrix Factorization is a method of decomposing a nonnegative data matrix X into the product of W, a matrix of basis or topic vectors, and a reduced matrix, H, both nonnegative, so X = W H. Matlab does this by varying the matricies W and H to minimize the frobenius norm of the residual, X W H. Clustering done on the W matrix is meaningful due to underlying 11

12 patterns in the matrix that mimic the results from other clustering algorithms [3]. 5 Consensus Clustering If we were to cluster the columns of our data matrix, X, using any single clustering algorithm we could display the result in an adjacency matrix for an undirected graph. Each observation (or document) in X would be represented by a vertex and an edge would be drawn between two vertices if those two observations were placed in the same cluster. We do not include self-loops or edges from a vertex to itself on this graph. Let E be the set of edges for this graph. Then the adjacency matrix, A, for the resulting graph is simply { 1 : (i, j) E A ij = 0 : otherwise The absence of self-loops ensures the diagonal entries of A are zero. Now suppose we were to use N different clustering algorithms on the same data matrix, X. The result would be N different adjacency matrices A 1, A 2,..., A N. Summing these adjacency matrices, we form our consensus matrix, C. C = C ij N i=1 is then the number of times that document (or observation) i was clustered with document (or observation) j. Again, we will think of C as an undirected graph, this time with weighted edges. It is important to note here that the N different clusterings do not have to assume the same number of clusters, k, in the data. In fact we can vary the number of clusters, k, for each algorithm and combine all of the results in C. This very approach will be taken in an attempt to determine the number of clusters in our data. A i 1 Random Walks We consider a random walk on the undirected graph defined by our consensus matrix, C. Let D be a diagonal matrix whose diagonal entries are the corresponding row sums of C, D = diag(ce), where e is a vector of all ones. We prevent singleton clusters, assuring that the diagonal entries in D are nonzero. The transition probability matrix for this Markov chain, P, and the stationary distribution, π T are given by P = D 1 C (1) π T = et D e T De (2) 12

13 2 Reversibility It is well known that random walks on connected, undirected graphs are reversible Markov chains [11], and thus satisfy the so-called detailed balance equations, diag(π)pdiag(π) 1 = P T. Equivalently, we can write diag(π) 1/2 P diag(π) 1/2 = diag(π) 1/2 P T diag(π) 1/2 which shows that P is similar to a symmetric matrix and thus has real eigenvalues. In an ideal scenario, the graph created by the consensus matrix would actually show k disconnected components, each component strongly connected in and of itself but disconnected from the others. In this case, we would essentially have a collection of k reversible Markov chains, each with a transition matrix with real eigenvalues ensuring that the collective transition matrix also has real eigenvalues. Such a scenario is impractical in practice, but motivates our continued discussion. 3 Uncoupled and Nearly-uncoupled Markov Chains Definition A Markov chain with transition probability matrix, P, is called uncoupled if there exists a permutation matrix, Q, such that QPQ T is block diagonal: P QPQ T 0 P = P kk where each P ii is square. If we consider our data observations as vertices on a graph, where edges exist only between observations that belong in the same cluster, then each P ii can be thought to define the transition probabilities of a random walk on the vertices of one cluster. If each P ii is irreducible, primitive, and reversible, which is guaranteed by our connected undirected graph components, then the algebraic multiplicity of the dominant eigenvalue, the socalled Perron root, λ 1 (P) = 1 is exactly k. While it is generally unrealistic to expect our clustering algorithms to produce a consensus matrix which would provide such an uncoupled transition matrix, we do expect to get something close. In fact, if we have faith in our clustering algorithms, we expect graph edges within a cluster to have much higher weight than those errant edges that connect vertices from different clusters. This provides us with a so-called nearly uncoupled Markov chain. Such a graph and the associated probability transition matrix are depicted in Figure 10. The magnitude of edge weights are indicated by the thickness of the edges, and three clusters are labelled A, B, and C. The Markov chain in Figure 10 is such that P A e 1, P B e 1, P C e 1, hence we call these submatrices nearly stochastic. It has been shown 13

14 A P A ε AB ε AC P= ε AB P B ε BC B ε AC ε BC P C C Figure 10: A Nearly Uncoupled Markov System that a nearly uncoupled Markov chain, containing k diagonal blocks which are nearly stochastic and sporadic off-diagonal elements with small magnitude, has k eigenvalues that are close to 1 [9, 10]. If there is no further decomposition (or uncoupling) of the blocks P ii then there are no more than k eigenvalues close to 1. This group of eigenvalues close to 1 is called the Perron cluster [10, 9]. Definition Let P be an n n stochastic matrix from a reversible Markov chain with eigenvalues, including multiplicities of 1 = λ 1 λ 2 λ 3 λ n. If the largest difference between consecutive eigenvalues occurs between λ k and λ k+1, then the set {1,..., λ k } is called the Perron Cluster of P, and the difference is called the eigengap. Others have used the number of eigenvalues in the Perron cluster to identify the number of clusters in various applications [10, 9, 2]. This is a natural approach to the problem, however when applied to the consensus matrix the resulting Perron cluster is often unclear or uninformative. We will show that iterating the consensus process widens the eigengap between λ k and λ k+1 and better estimates the number of clusters. 6 Iterated Consensus Clustering We propose a new method that seeks to refine the results gained through consensus clustering in order to gain a true consensus between algorithms. This is done by taking the consensus matrix after a run of consensus clustering and treating it like a new piece of data by running it through the same consensus clustering algorithm again. Between each iteration, values that are below a certain threshold, called the drop tolerance, are dropped. We do this under the assumption that the algorithms we use will 14

15 rarely make similar mistakes. Each algorithm will make incorrect connections between points, but rarely will multiple algorithms cause the same error. By taking only the strongest connections between points, determined by the percentage of algorithms that agree, we get a clustering that reflects the consensus of all of the algorithms. 7 Results Upon experimenting with the individual clustering algorithms, we decided to use k - Means, PDDP, and Expectation Maximization in our ICC code. We also included another k-means run, but with the centroids initialized as the centroids of the final clusterings of PDDP on the data. There were five different types of input data, either using raw data or the output from one of four dimension reduction techniques: PCA, SVD, NMF, and PCA used on the output from NMF. Everything was repeated for each value of k that we wanted to test, and every algorithm, except the four that ran on raw data, were repeated for each number of dimensions that we wanted to calculate, also known as our r values. We used code to extract the number of dimensions that were required for 60%, 75%, and 90% of the variance in the data, using those three amounts of dimensions as our r values. In order to see the performance results of our Iterated Consensus Clustering algorithm, we ran the algorithm against several data sets. For each data set, we ran the ICC algorithm to determine the number of clusters and find agreement using the correct number of k within each data set. For each data set, we generated the first 20 iterations using drop tolerances of p = 0, 0.1 and 0.2 to see which condition generated the best, most accurate, results. Seeing that p = 0.2 generated somewhat better results than the others, we decided to generate the eigengap graphs using a 0.2 drop tolerance. To simulate an attempt to find k, we took runs of all the algorithms at values of k from a wide range around the true value and added the consensus matrices to get a final result. Furthermore, we generated the accuracies of each data set (using the last column of the agreement matrix), and these were generated by 15

16 normalizing either the rows or columns of the matrix, depending on the size of the data. Here, we generated the eigengap images of each data set, taking the graphs from the first, second and third iterations. Figure 11: Iris Dataset ICC Eigenvalues (1st Iteration) Figure 12: Iris Dataset ICC Eigenvalues (1st Iteration) Figure 13: Iris Dataset ICC Eigenvalues (1st Iteration) These figures show the eigengaps of the Iris data set - here, we can see that by the third iteration, a significant eigengap has appeared, and the location of the eigengap shows us how many clusters are in the data set, which is three in this dataset. In calculating these graphs we used a k range from 2 to 10. Below, we show the accuracy results of running the ICC algorithm on the Iris data. Here, we can see that the average accuracy for each iteration is relatively high, ranging from 88% - 96%. More importantly, though, the algorithms eventually agree upon a 16

17 Figure 14: ICC Accuracy Table on Iris Dataset solution that is more accurate than the average of each of the algorithms used alone. 8 Future Work Our research has opened the door to the uncharted territory of ICC. Iterating the consensus matrix can be applied to other methods for finding k, such as the gap statistic, a method used to determine clusters from the change in their variances. We primarily tested ICC with smaller data sets, and this was largely due to the long computation time of the algorithm. Future work could also include refining the code to shorten computation time and test ICC with larger data sets. It would also be helpful to test ICC with different algorithms and dimension reductions to determine which worked the best together. 9 Acknowledgments We would like to thank Dr. Carl Meyer for his guidance and advice throughout this research. We would also like to thank NCSU and the NSF for organizing and funding this REU program. References [1] Carl Meyer Amy N. Langville. Google s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Press,

18 [2] D. B. Szyld E. Virnik D. Fritzsche, V. Mehrmann. An svd approach to identifying metastable states of markov chains. Electronic Transactions on Numerical Analysis, 29:46 69, [3] Chris Ding, Xiaofeng He, and Horst D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proc. SIAM Data Mining Conf, pages , [4] Robert L. Grossman. Data Mining for Scientfic and Engineering Applications. Kluwer Academic Publishers, [5] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):pp , [6] Mark A. Hall Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, [7] Jacob Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press, Cambridge, New York, [8] Ravi Kothari and Dax Pitts. On finding the number of clusters. Pattern Recognition Letters, 20(4): , [9] C. D. Meyer and C. D. Wessell. Stochastic Data Clustering. ArXiv e-prints, August [10] A. Fischer Ch. Schutte P. Deuflhard, W. Huisinga. Identification of almost invariant aggregates in reversible nearly uncoupled markov chains. Linear Algebra and its Applications, 315:39 59, [11] William J. Stewart. Probability, Markov Chains, Queues, and Simulation: The Mathematical Basis of Performance Modeling. Princeton University Press, [12] Rui Xu and II Wunsch, D. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3): , may

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

An Approach to Identify the Number of Clusters

An Approach to Identify the Number of Clusters An Approach to Identify the Number of Clusters Katelyn Gao Heather Hardeman Edward Lim Cristian Potter Carl Meyer Ralph Abbey July 11, 212 Abstract In this technological age, vast amounts of data are generated.

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6 Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Clustering Lecture 3: Hierarchical Methods

Clustering Lecture 3: Hierarchical Methods Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Binoda Nand Prasad*, Mohit Rathore**, Geeta Gupta***, Tarandeep Singh**** *Guru Gobind Singh Indraprastha University,

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010 STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral

More information

Clustering and Dimensionality Reduction

Clustering and Dimensionality Reduction Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Interpreting Clusters of World Cup Tweets

Interpreting Clusters of World Cup Tweets Interpreting Clusters of World Cup Tweets Daniel Godfrey, Caley Johns, Carol Sadek UNC Charlotte, BYU-Idaho, Wofford College Mentors: Carl Meyer, Shaina Race NC State University Abstract Cluster analysis

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

MATH5745 Multivariate Methods Lecture 13

MATH5745 Multivariate Methods Lecture 13 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Clustering Part 3. Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

More information

Image Analysis - Lecture 5

Image Analysis - Lecture 5 Texture Segmentation Clustering Review Image Analysis - Lecture 5 Texture and Segmentation Magnus Oskarsson Lecture 5 Texture Segmentation Clustering Review Contents Texture Textons Filter Banks Gabor

More information

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples Franck Olivier Ndjakou Njeunje Applied Mathematics, Statistics, and Scientific Computation University

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20 HW4 VINH NGUYEN Q1 (6 points). Chapter 8 Exercise 20 a. For each figure, could you use single link to find the patterns represented by the nose, eyes and mouth? Explain? First, a single link is a MIN version

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Image Processing. Image Features

Image Processing. Image Features Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information