Multivariate Analysis
|
|
- Vivian McGee
- 5 years ago
- Views:
Transcription
1 Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com
2 Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data are not known before the analysis is conducted The number of clusters Populations Interpretation
3 System Samples Measurements Similarities Distances Clusters
4 Shape and size of clusters may be very different Spherical Ellipsoidal Linear Crescent Ring spiral
5 The most important methods are Partitioning methods Each object is assigned to exactly one group Hierarchical methods Tree-like dendrogram Optimal number of clusters Fuzzy clustering methods Each object is assigned by a membership coefficient to each of the found clusters Model-based clustering The different clusters are supposed to follow a certain model, like multivariate normal distribution with a certain mean and variance
6 The outcome of any cluster analysis procedure are assignments of the objects to the clusters Usually we cannot expect a unique solution for cluster analysis Distance measure Cluster algorithm Chosen parameters Application of unsupervised methods is often a recommendable starting step in data evaluation in order to obtain an insight into the data structure (to detect clusters or outliers)
7 Distance and Similarity Measures Euclidean distance It is the most widely used Manhattan distance It is less dominated by far outlying objects since it is based on absolute rather than square distances Minkowski distance It allows adjusting the power of the distances along the coordinates All these distance measures are not scale invariant
8 Most of the standard clustering algorithms can be directly used for clustering the variables Pearson Correlation Distance (for two variables x j and x k ) d corr x j, x k = 1 r jk Binary Vectors Chemical structures Each vector (values 0 or 1) indicates absence or presence of a certain substructure Tanimoto Index (Jaccard Similarity Coefficient) j AND x Aj, x Bj t AB = j OR x Aj, x Bj AND is for the number of variables with a 1 in both vectos, and OR in at least one of the vectors d Tani = 1 t AB
9 Partitioning Methods Clustered objects x 1 1,, x n1 1, x 1 2,, x n2 2,, x 1 k,, x nk k The number of objects assigned to a cluster n n k = n Often, the algorithms are adapted to the type of geometry of the data
10 k-means The most widely known algorithm for partitioning Data mining algorithm It uses pairwise distances between the objects Requires the input of the desired number of k clusters Centroids (means) represent the center of each cluster n j c j = 1 n j i=1 The objective (function) of k-means is to minimize the total within-cluster sum-of-squares k j=1 n j i=1 x i j x i j c j 2 min squared Euclidean distance between an object x j i of cluster j and cluster centroid c j
11 The most widely used algorithm for k-means work as follows 1. Select a number k of desired clusters and initialize k cluster centroids c j, for example, by randomly selecting k different objects 2. Assign each object to the cluster with the closest centroid, i.e., compute for each object the distance x i j c j for j = 1,, k and assign x i to the cluster where the minimum distance to the cluster centroid appears 3. Recompute the cluster centroids 4. Repeat steps 2 and 3 until the centroids become stable
12
13
14 The k-means algorithm tends to find spherical clusters k-means_spherical Library(cluster) library(chemometrics) data("glass") data("glass.grp") K2 <- kmeans(glass, 2) K3 <- kmeans(glass, 3) K4 <- kmeans(glass, 4) Xma <- scale(glass,center = TRUE, scale = TRUE) b_pca <- svd(xma) u_pca <- Xma%*%b_PCA$v par(mfrow = c(2, 2)) plot(u_pca[,1],u_pca[,2],col=glass.grp) plot(u_pca[,1],u_pca[,2],col=k2$cluster) plot(u_pca[,1],u_pca[,2],col=k3$cluster) plot(u_pca[,1],u_pca[,2],col=k4$cluster)
15 elliptical clusters
16 Silhouette It is proposed for partitioning techniques Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration The average silhouette width provides an evaluation of clustering validity, and might be used to select an appropriate number of clusters.
17
18
19 Honey data K=4 Rape (ra): 8-10 Honeydew (hd): Floral (of): 4-7; Acacia (ac): samples and 11 parameters
20 honeydata.mat X, 27x11 To get an idea of how well-separated the resulting clusters are, you can make a silhouette plot using the cluster indices output from k-means The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters This measure ranges from +1, indicating points that are very distant from neighboring clusters, through 0, indicating points that are not distinctly in one cluster or another, to -1, indicating points that are probably assigned to the wrong cluster.
21 Cluster autoscaling k = 4 idx4=kmeans(x,4); [silh4,h] = silhouette(x,idx4); Silhouette Value large silhouette values, greater than 0.6, indicate that the cluster is somewhat separated from neighboring clusters points with low silhouette values, and points with negative values, indicate that the cluster is not well separated
22 Cluster k = 3? idx3=kmeans(x,3); [silh3,h] = silhouette(x,idx3); Silhouette Value
23 Cluster k = 5? idx5=kmeans(x,5); [silh5,h] = silhouette(x,idx5); Silhouette Value
24 A more quantitative way to compare the solutions is to look at the average silhouette values k from 2 to 9 were tested. The best silhouette was obtained when k = 2 mean(silh2) ans = Without some knowledge of how many clusters are really in the data, it is a good idea to experiment with a range of values for k
25 Cluster Espectros de RMN 1 H Testei de k = 2 até 9 Melhor valor k = 3 mean(silh3) = x x Silhouette Value
26 k-means_silhoutte library(cluster) data("ruspini") disse <- daisy(ruspini) #dissimilarity matrixlibrary(fpc) de2 <- disse^2 K3 <- kmeans(ruspini, 3) sk3 <- silhouette(k3$cluster,de2) summary(sk3) #the higher the mean, the better par(mfrow = c(1, 1)) plot(sk3,col = c("red", "green", "blue")) K4 <- kmeans(ruspini, 4) sk4 <- silhouette(k4$cluster,de2) summary(sk4) #lower mean than K=3 plot(sk4,col = c("red", "green", "blue", "purple")) K5 <- kmeans(ruspini, 5) sk5 <- silhouette(k5$cluster,de2) summary(sk5) #lower mean than K=3 plot(sk5,col = c("red", "green", "blue", "purple", "yellow")) K6 <- kmeans(ruspini, 6) sk6 <- silhouette(k6$cluster,de2) summary(sk6) #lower mean than K=3 K7 <- kmeans(ruspini, 7) sk7 <- silhouette(k7$cluster,de2) summary(sk7) #lower mean than K=3 library(fpc) plotcluster(ruspini, K3$cluster) clusplot(ruspini,k3$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0) plotcluster(ruspini, K4$cluster) clusplot(ruspini,k4$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0) plotcluster(ruspini, K5$cluster) clusplot(ruspini,k5$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)
27
28
29
30
31
32
33 Hierarchical Clustering Methods Agglomerative Methods n clusters in the first level of the hierarchy The two closest clusters are merged in the next level And so on... Divisive Methods One cluster in the first level In the next level, this cluster is split into two smaller clusters And so on...
34 Splitting or merging Similarity or distance between the clusters CA searches for objects which are close together in the variable space The general distance is given by N N d ij = x ik x jk k=1 1 N For N = 2, this is the familiar n-space Euclidean distance Higher values of N will give more weight to smaller distances.
35 Distance between the two cluster can be determined by various methods Complete Linkage (Furthest neighbour) max i x i j x i l It uses the distance to the farthest point It usually results in homogeneous clusters in the early stages of the agglomeration, but the resulting clusters will be small
36 Single Linkage min i x i j x i l It judges the nearness of a point to a cluster on the basis of the distance to the closest point in the cluster It is known for the chaining effect, because even quite homogeneous clusters can be linked just by chance as soon as two objects are very similar
37 Average Linkage average i x i j x i l
38 Centroid Method c i c j The distance of a point to the centre of gravity of the points in a cluster is used.
39 Ward s Method c i c j 2n j n l n j + n l correction to the Centroid Method in order of increasing distance within the clustering procedure
40 Agglomerative clustering algorithm 1. Define each object as a separate cluster and compute all pairwise distances 2. Merge those clusters (=objetcs) with the smallest distance into a new cluster 3. Compute the distances between all clusters using a chosen method 4. Merge those clusters with smallest distance from step 3 5. Proceed with steps 3 and 4 until only one cluster remais
41 The results of hierarchical clustering are usually displayed by a dendrogram Points are grouped together based on their nearness or similarity into clusters The nearness of points in n-space reflects the similarity of their properties Similarity values, S ij, are calculated as S ij = 1 d ij d ij max
42 Observation Distance/Similarity Dendrogram
43 Complete and Single Linkage library(cluster) library(chemometrics) data("glass") cl <- hclust(dist(glass), method = "complete") plot(cl) sl <- hclust(dist(glass), method = "single") plot(sl)
44
45
46
47
48
49
50
51 Fuzzy Clustering Partitioning methods make an assignment of each object to exactly one cluster Fuzzy clustering, in contrast, allow for fuzzy assignment (an observation is not assigned to exclusively one cluster but at some part to all clusters Membership coefficients u ij for each observation x i to each cluster [0,1]: 0 not assignment; 1 full assignment
52 Fuzzy clustering membership Group 1 or Group 2? Group 3?
53 Fuzzy C-Means algorithm Objective function k n j j=1 i=1 u 2 ij x j 2 i c j min Similar to k-means, the number of clusters k has to be provided as an input Centroids c j = i u ij 2 x i i u 2 ij When using only memberships of 0 and 1, this algorithm reduces to k-means In each iteration step the membership coefficients are updated by u ij = k l=1 1 x i c j x i c l 2 and the cluster cenrtoids are recalculated
54 Fuzzy clustering library(chemometrics) data("glass") data("glass.grp") library(e1071) fc2 <- cmeans(glass,2) fc3 <- cmeans(glass,3) fc4 <- cmeans(glass,4) Xma <- scale(glass,center = TRUE, scale = TRUE) b_pca <- svd(xma) u_pca <- Xma%*%b_PCA$v par(mfrow = c(2, 2)) plot(u_pca[,1],u_pca[,2],col=glass.grp, main = "original classes") plot(u_pca[,1],u_pca[,2],col=fc2$cluster, main = "fuzzy clustering with 2 clusters") plot(u_pca[,1],u_pca[,2],col=fc3$cluster, main = "fuzzy clustering with 3 clusters") plot(u_pca[,1],u_pca[,2],col=fc4$cluster, main = "fuzzy clustering with 4 clusters") par(mfrow = c(1, 2)) fcpca4 <- cmeans(u_pca,4) plot(u_pca[,1],u_pca[,2],col=fcpca4$cluster, main = "fuzzy clustering with 4 clusters using pca scores") plot(u_pca[,1],u_pca[,2],col=fc4$cluster, main = "fuzzy clustering with 4 clusters")
55
56
57 Model-Based Clustering Model-based clustering assumes a statistical model of the clusters The simplest approach: Multivariate normal distribution with different means but covariance matrices of the same form σ 2 The same spherical cluster shape and size for all clusters A more complicated situation: σ j 2, for j = 1,, k Still spherical clusters but different cluster sizes In a third type of cluster model, the covariance matrix has not a diagonal form (as in the previous two model types) Elliptically symmetric clusters The most general form are clusters with different covariance matrices Σ j
58 Spherical Equal volume Diagonal covariance (same σ 2 for all clusters) Spherical Unequal volume Diagonal covariance of different size (σ j 2 ) Ellipsoidal General form Different covariance matrix for each cluster (Σ j )
59 Model-Based Clustering library(chemometrics) data("glass") Xma <- scale(glass,center = TRUE, scale = TRUE) b_pca <- svd(xma) u_pca <- Xma%*%b_PCA$v library(mclust) mbc_pc12 <- Mclust(u_PCA[,1:2]) plot(mbc_pc12)
60 Cluster Validity and Clustering Tendendy Measures Crucial point: number of k clusters Quality criterion and optimal number Homogeneity (within the clusters) n j W j = x i j c j 2 i=1 Heterogeneity (between the clusters) 2 B jl = c j c l Cluster Validity V k = k j=1 W j k j<l=1 B jl
61 library(chemometrics) clvalidity(glass) $validity kmeans fuzzy mclust kmeans fuzzy mclust
62 Hopkins Statistic point i = 1,, n d w of a data object to the nearest neighboring object d U of an arbitrary (artificial) point in the data space to the nearest neighboring object i d U i H = i d U i + i d w i Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters) Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters) If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the dataset is significantly a clusterable data library(clustertend) hopkins(glass, n=100) $H [1]
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
Cluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationChapter 6 Continued: Partitioning Methods
Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationCluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole
Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationMATH5745 Multivariate Methods Lecture 13
MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationData Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationHierarchical clustering
Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationMultivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)
Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Hierarchical Clustering Produces a set
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationClustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie
Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie Data from Garber et al. PNAS (98), 2001. Clustering Clustering
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationMachine learning - HT Clustering
Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationWhat is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationCluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008
Cluster Analysis Jia Li Department of Statistics Penn State University Summer School in Statistics for Astronomers IV June 9-1, 8 1 Clustering A basic tool in data mining/pattern recognition: Divide a
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationDATA CLASSIFICATORY TECHNIQUES
DATA CLASSIFICATORY TECHNIQUES AMRENDER KUMAR AND V.K.BHATIA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 akjha@iasri.res.in 1. Introduction Rudimentary, exploratory
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationHierarchical clustering
Aprendizagem Automática Hierarchical clustering Ludwig Krippahl Hierarchical clustering Summary Hierarchical Clustering Agglomerative Clustering Divisive Clustering Clustering Features 1 Aprendizagem Automática
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline
More information3. Cluster analysis Overview
Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More information3. Cluster analysis Overview
Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationHierarchical clustering
Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationMarket basket analysis
Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations
More informationObjective of clustering
Objective of clustering Discover structures and patterns in high-dimensional data. Group data with similar patterns together. This reduces the complexity and facilitates interpretation. Expression level
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationHierarchical Clustering / Dendrograms
Chapter 445 Hierarchical Clustering / Dendrograms Introduction The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed
More informationData Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationUnsupervised Learning
Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationForestry Applied Multivariate Statistics. Cluster Analysis
1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationMultivariate Methods
Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology,
More informationMeasure of Distance. We wish to define the distance between two objects Distance metric between points:
Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationClustering Lecture 3: Hierarchical Methods
Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced
More informationCHAPTER-6 WEB USAGE MINING USING CLUSTERING
CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More information11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records
11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationChapter 1. Using the Cluster Analysis. Background Information
Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,
More informationSGN (4 cr) Chapter 11
SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter
More information