Cluster Analysis. Angela Montanari and Laura Anderlucci
|
|
- Ashlynn Long
- 5 years ago
- Views:
Transcription
1 Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a specific set of variables. In order to accomplish this objective, the most common starting point is computing a matrix, called dissimilarity matrix, which contains information about the dissimilarity of the observed units. According to the nature of the observed variables (quantitative, qualitative, binary or mixed type variables), we can define and use different measures of dissimilarity. 1.1 Dissimilarity measures Let s consider the n p data matrix, where n is the number of observed units and p is the number of observed variables; let s indicate with i and j two generic objects and with x ik the value of the k-th variable for the i-th unit Quantitative variables If the observed variables are all quantitative, each unit can be identified with a point in the p-dimensional space and the dissimilarity between two objects i and j can be measured through: a) Euclidean distance: d ij = p (x ik x jk ) 2 k=1 1
2 b) City-Block or Manhattan distance: d ij = p x ik x jk k=1 Both of these two distances are specific cases, respectively for l = 2 and l = 1, of a generic distance family known as c) Minkowski metric: { p } 1/l d ij = x ik x jk l k=1 the larger l the greater is the weight given to high distances x ik x jk. For all of the three distances the properties of a distance function hold: 1. d ij 0 2. d ii = 0 3. d ij = d ji 4. d ij d it + d tj (triangle inequality) Binary variables When dealing with binary or categorical variables it is not possible to define true distance measures, but one can use some dissimilarity measures which: are still close to zero, when i and j are similar to each other increase when differences between two units are larger are symmetric are equal to zero when considering the dissimilarity of an object with itself Nevertheless, these dissimilarity measures generally do not satisfy the triangle inequality. Consider x ik that can assume only two values, coded as 0 and 1. In order to calculate the dissimilarity between two units, relevant information is represented in the following 2 2 table: 2
3 i j a b 0 c d where a is the number of variables that take value 1 for both objects i and j, b is the number of variables that assume modality 1 in object i and modality 0 in object j, c is the number of times the opposite is verified and d is the number of variables that assume value 0 for both objects i and j. Furthermore, a + b + c + d = p, with p equal to the number of observed variables. Some of the best known dissimilarity measures are: a) Simple dissimilarity coefficient d ij = b + c a + b + c + d b) Jaccard coefficient d ij = b + c a + b + c Jaccard s coefficient is recommended when binary variables indicate presence/absence of a specific characteristic. In this case agreement 1/1 can have a different meaning than agreement 0/0; indeed, absence of a certain characteristic in both objects i and j does not contribute to improve the similarity between them. Hence, it can be more useful to divide by a + b + c rather than by a + b + c + d Categorical variables When observed variables are categorical with more than two levels, a simple measure of the degree of dissimilarity between two units is given by a) d ij = 1 c ij where c ij is the ratio of the number of variables that assume the same values in both units i and j and the total number of observed variables p. 3
4 1.1.4 Mixed type variables When the observed variables have different nature a dissimilarity measure that is able to deal with their specific characteristics is: a) Gower index d ij = 1 s ij, with s ij = p k=1 δ ijks ijk p k=1 δ ijk x ik x jk where, for quantitative variables, s ijhk = 1 range of variable x k For qualitative or binary variables s ijk = 1 if i and j agree with respect to variable k and equal to 0 if they disagree. Indicator δ ijk is equal to 1 if both quantities x ik and x jk are recorded and equal to 0 in the opposite situation; furthermore it is equal to 0 for asymmetric binary variables when agreement 0/0 is observed. 2 Agglomerative Hierarchical Clustering Methods Hierarchical clustering methods are criteria aimed to create nested partitions that allow investigations with respect to different within clusters homogeneity levels. Those relationships are often represented by dendrograms; by cutting the graph at a certain level of dissimilarity we obtain a partition of objects into different groups. Each partition in k groups is optimal in a step-wise sense. It is the best partition in k group one can obtain, given the partition in k+1 group defined at the previous step, for agglomerative methods, or the best partition given the partition in k 1 groups yielded at the previous step, for divisive methods. Notice that for agglomerative hierarchical clustering methods, when two clusters are merged, they will remain so until the end of the hierarchy; a mistake committed in the initial phase of classification can never be corrected. This is both the strength and the weakness of this kind of methods; the bigger is the number of units the higher is the probability of misclassifying a unit. 4
5 If the aim is to find a number of clusters much lower than the number of units (k n), divisive methods have a lower risk to commit a mistake, since the number of steps needed to reach partition in k groups is smaller. On the other hand, divisive methods have a high computational complexity and this is the reason why agglomerative approaches are preferred. Starting from the dissimilarity matrix, agglomerative hierarchical clustering methods initially consider each unit as a separate cluster, and they subsequently merge in one cluster those units that have the lowest dissimilarity. Then, they evaluate dissimilarities between this new cluster and the others, merging again the two groups that have the lowest value in the dissimilarity matrix. Different clustering methods in this class refer to the different ways the dissimilarity between clusters is defined and evaluated. 2.1 Single Linkage (Nearest Neighbour) method Distance between two clusters C and G is defined as the dissimilarity between the two closest points d CG = min(d ij ), i C; j G i.e. it is the shortest distance between any element i of C and any element j of G. Clusters obtained with the Single Linkage method tend to have a long and narrow shape and to be internally poorly homogenous since only two similar observations from two different clusters are needed in order to merge the two clusters (chaining effect). For this reason, this method is not able to identify overlapping clusters, but it is good when clusters look well separated, because in every cluster yielded by the procedure any unit is more similar to the units belonging to the same cluster than to units of different groups. 2.2 Complete Linkage (Furthest Neighbour) method Distance between two clusters C and G is defined as the dissimilarity between the two furthest-away points d CG = max(d ij ), i C; j G i.e. it is the longest distance between any element i of C and any element j of G. In other words, d CG is the diameter of the smallest sphere that can 5
6 wrap the cluster obtained by merging C and G. Clusters yielded by this procedure tend to be spherical and compact; nothing can be said about their separation. 2.3 Average Linkage Distance between two clusters C and G is defined as the average of all the dissimilarities between pairs of points one from each cluster d CG = 1 d ij n C n G i C,j G where n C is the number of elements in cluster C, and n G is the number of elements in cluster G. 2.4 Centroid Linkage Distance between two clusters C and G is defined as the Euclidean distance between the two cluster centres d CG = x C x G where x C is the mean vector of the observed variables in cluster C and x G is the mean vector of the observed variables in cluster G. This method is suitable for quantitative variables only. A problem that may be encountered with this approach is that dissimilarity of a union is lower than the one measured at the previous merge (because of the centroid shift) and this makes the results less interpretable. 2.5 Ward method Distance between two clusters C and G is defined as the deviance between the two clusters. Be x ikg the value of variable k observed on unit i in cluster g and be x kg the average of variable k in cluster g; then E g = n p g (x ikg x kg ) 2 k=1 i=1 E g is the deviance within gth cluster and E = g E g the deviance within the whole partition. As usual the classification process starts by considering 6
7 each unit as a cluster. In this case the deviance within is 0. Each merge leads to an increase of the deviance within ; the algorithm is looking for the merge that will imply the smallest increase of the deviance within the whole partition; i.e. the groups that will be merged are those having the largest deviance between. Like the centroid linkage, this method can only be applied to quantitative data, but differently from that, it does not have problems with the order of dissimilarities in the different levels of the hierarchy. Both methods identify spherical clusters. 3 Partitioning Clustering Methods Hierarchical methods described in the previous section generate n different nested classifications that go from n one-element clusters to a cluster which contains all the n observations. Partitive methods give rise to a single partition with K groups, where K is previously specified or is determined by the clustering method itself. In the partitioning approach, the number of clusters (say K) is decided a priori and data are split into clusters in such a way the within-cluster dissimilarity is minimized. The function that is to be minimized is K g=1 i:x i C g (x i x Cg ) 2 The problem seems simple, but actually the numbers involved make impossible to consider all possible partition of n individuals into K groups. That is why several algorithms, designed to search for the minimum values of the clustering criterion, have then been developed. 3.1 K-means Algorithm A very popular partitioning method is the K-means algorithm. Classification begins by choosing the first K observations of the data matrix as centroids and carries on by assigning each of the remaining n K units to the closest centroid. This first partition in K groups is then improved by assuming the centroids of the K groups as new aggregation points and by associating each unit to the closest centroid. 7
8 The resulting classification strongly depends on the order by which units are assigned to different groups; nevertheless it has been shown that this effect is poorly relevant when clusters are clearly separated. 4 How to choose the number of clusters So far we have not discussed how to choose the appropriate number of clusters. There is plenty of methods in the literature designed to solve the problem but up to now there is no unique solution. Here a few methods are presented. 4.1 The Average Silhouette Width (ASW) For a partition of n units into K clusters C 1,..., C K, suppose object i has been assigned to cluster C h. We indicate with a(i) the average dissimilarity of i to all other objects of cluster C h : a(i, h) = 1 C h 1 j C h d(i, j) This expression makes sense only when C h contains other objects other than i. Let s consider now any cluster C l different from C h and define the average dissimilarity of i to all objects of C l d(i, C l ) = 1 d(i, j) C l j C l After computing d(i, C l ) for all clusters C l different from C h, we select the smallest of those: b(i) = min i/ C l d(i, C l ) The cluster for which this minimum is obtained is call neighbour of object i; this is like the second-best choice for clustering object i. The silhouette s(i) is obtained by combining a(i) and b(i) as follows: 1 a(i), if a(i) < b(i) b(i) s(i) = 0, if a(i) = b(i) b(i) 1, if a(i) > b(i). a(i) 8
9 This can be rearranged in one formula s(i) = b(i) a(i) max[a(i), b(i)] And it can be easily seen that 1 s(i) 1 for each object i. When s(i) is at its largest (that is, close to 1), this implies that the within dissimilarity a(i) is much smaller than the smallest between dissimilarity b(i). Therefore, we can say that i is well classified: the second best choice is not nearly as close at the actual choice. When s(i) is about zero, then a(i) and b(i) are approximately equal and so it is not clear whether i should have been assigned to C h or to C l, it lies equally far away from both. The worst situation takes place when s(i) is close to 1, when a(i) is actually much larger than b(i), and hence i lies on average closer to C l than to C h ; therefore it would have seemed better to assign object i to its neighbour. The silhouette s(i) hence measures how well unit i has been classified. By computing the average of the s(i), calculated for all the observations i = 1,..., n, we obtain the so called average silhouette width (ASW): s(i) = 1 n n s(i, K) i=1 If K is not fixed and needs to be estimated, the ASW estimate K ASW is obtained by maximizing the equation. Its expression leads to a clustering that emphasises the separation between the clusters and their neighbouring clusters. 4.2 Pearson Gamma If K is not fixed and needs to be estimated, the Pearson Gamma (PG) estimator K Γ maximises the Pearson correlation ρ(d, m) between the vector d of pairwise dissimilarities and the binary vector m that is 0 for every pair of observations in the same cluster and 1 for every pair of observations in different clusters. PG emphasises a good approximation of the dissimilarity structure by the clustering in the sense that when observations are in different clusters they should strongly be correlated with large dissimilarity. 9
10 Remark When the number of clusters is known (hence there is no need to estimate it) one can use these indexes as a measure of clustering quality. 10
Machine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationHierarchical clustering
Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized
More information3. Cluster analysis Overview
Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More information3. Cluster analysis Overview
Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationStat 321: Transposable Data Clustering
Stat 321: Transposable Data Clustering Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Clustering 1 / 27 Clustering Given n objects with d attributes, place them (the objects) into groups.
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationChapter VIII.3: Hierarchical Clustering
Chapter VIII.3: Hierarchical Clustering 1. Basic idea 1.1. Dendrograms 1.2. Agglomerative and divisive 2. Cluster distances 2.1. Single link 2.2. Complete link 2.3. Group average and Mean distance 2.4.
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationForestry Applied Multivariate Statistics. Cluster Analysis
1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationUnsupervised Learning
Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationUnsupervised Learning
Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision
More informationA COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING
A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING Susan Tony Thomas PG. Student Pillai Institute of Information Technology, Engineering, Media Studies & Research New Panvel-410206 ABSTRACT Data
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationHierarchical Clustering / Dendrograms
Chapter 445 Hierarchical Clustering / Dendrograms Introduction The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationClustering Lecture 3: Hierarchical Methods
Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Hierarchical Clustering Produces a set
More informationTree Models of Similarity and Association. Clustering and Classification Lecture 5
Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationTELCOM2125: Network Science and Analysis
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationClustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline
Clustering cluster - set of patterns whose inter-pattern distances are smaller than inter-pattern distances for patterns not in the same cluster a homogeneity and uniformity criterion no connectivity little
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationUNSUPERVISED CLASSIFICATION
CHAPTERFOUR UNSUPERVISED CLASSIFICATION 4.1 OVERVIEW In this chapter a brief overview is given of the notion of grouping objects into different categories without any supervision. The previous chapter
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationClustering Algorithms for general similarity measures
Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationData Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationHierarchical clustering
Aprendizagem Automática Hierarchical clustering Ludwig Krippahl Hierarchical clustering Summary Hierarchical Clustering Agglomerative Clustering Divisive Clustering Clustering Features 1 Aprendizagem Automática
More informationMedoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.
Chapter 447 Introduction The objective of cluster analysis is to partition a set of objects into two or more clusters such that objects within a cluster are similar and objects in different clusters are
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationIntroduction to Clustering
Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of
More informationSGN (4 cr) Chapter 11
SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter
More informationClustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie
Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie Data from Garber et al. PNAS (98), 2001. Clustering Clustering
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationChapter DM:II. II. Cluster Analysis
Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1
More information2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.
Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar
More informationMeasure of Distance. We wish to define the distance between two objects Distance metric between points:
Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More information4. Ad-hoc I: Hierarchical clustering
4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationMachine learning - HT Clustering
Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationTypes of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters
Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive
More informationData Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science
Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic
More informationWhat is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation
More informationMulti-Modal Data Fusion: A Description
Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining
More informationDATA CLASSIFICATORY TECHNIQUES
DATA CLASSIFICATORY TECHNIQUES AMRENDER KUMAR AND V.K.BHATIA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 akjha@iasri.res.in 1. Introduction Rudimentary, exploratory
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationClustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme
Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationUnsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team
Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised
More information