Forestry Applied Multivariate Statistics. Cluster Analysis

Similar documents
CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

Hierarchical Clustering

CHAPTER 4: CLUSTER ANALYSIS

Cluster Analysis. Angela Montanari and Laura Anderlucci

DATA CLASSIFICATORY TECHNIQUES

Cluster analysis. Agnieszka Nowak - Brzezinska

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Applied Clustering Techniques. Jing Dong

3. Cluster analysis Overview

Unsupervised Learning and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis for Microarray Data

Unsupervised Learning and Clustering

Gene Clustering & Classification

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering: Overview and K-means algorithm

Unsupervised Learning : Clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Workload Characterization Techniques

University of Florida CISE department Gator Engineering. Clustering Part 5

Finding Clusters 1 / 60

Clustering CS 550: Machine Learning

University of Florida CISE department Gator Engineering. Clustering Part 2

Understanding Clustering Supervising the unsupervised

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

3. Cluster analysis Overview

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

What is Unsupervised Learning?

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

ECLT 5810 Clustering

Lesson 3. Prof. Enza Messina

Hierarchical clustering

ECLT 5810 Clustering

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Clustering Part 4 DBSCAN

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Multivariate Analysis

Network Traffic Measurements and Analysis

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Distances, Clustering! Rafael Irizarry!

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Hierarchical Clustering 4/5/17

SGN (4 cr) Chapter 11

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Introduction to Data Mining

Clustering. Chapter 10 in Introduction to statistical learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Hierarchical Clustering / Dendrograms

A Review on Cluster Based Approach in Data Mining

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Automated Clustering-Based Workload Characterization

MSA220 - Statistical Learning for Big Data

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Hierarchical clustering

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Lecture 10: Semantic Segmentation and Clustering

What to come. There will be a few more topics we will cover on supervised learning

Chapter 6 Continued: Partitioning Methods

Road map. Basic concepts

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

The EMCLUS Procedure. The EMCLUS Procedure

Clustering algorithms

Clustering Tips and Tricks in 45 minutes (maybe more :)

Clustering and Visualisation of Data

Based on Raymond J. Mooney s slides

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

What is Cluster Analysis?

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Document Clustering: Comparison of Similarity Measures

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Motivation. Technical Background

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

MATH5745 Multivariate Methods Lecture 13

Clustering part II 1

Chapter 6: Cluster Analysis

Supervised vs. Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning Hierarchical Methods

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Information Retrieval and Web Search Engines

CISC 4631 Data Mining

Clustering: Overview and K-means algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Data Mining Algorithms

Cluster Analysis: Agglomerate Hierarchical Clustering

Introduction to Clustering

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

4. Ad-hoc I: Hierarchical clustering

Transcription:

1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class] Unlike Factor Analysis, we are concerned with grouping rather than what are the cause (factors) of the groups; therefore, the objective and approach differ from Factor Analysis. Unlike Multi-dimensional scaling (MDS, see Manly for this tool) we are not interested in reducing the dimensions so that we can produce a map, but rather we wish to group the data. Unlike Multivariate Discriminant Analysis (MDA), the groups to which the entities belong are unknown. Procedure: Formal (mathematical techniques) vs informal (based on inspection and judgement. For few dimensions (2 or 3) informal methods may be appropriate. In this case, we could use plots of the variables to aid in the clustering. For many dimensions, maps based on MDS may be used to reduce the number of dimensions and an informal approach used. Alternatively, a formal method may be more appropriate. Problems: (1) How is similarity (dissimilarity) measured? (2) If formal procedures are to be used, which one (of many)? Two Objectives: (1) Similarity within the groups. (2) Separation between the groups. The method to be selected will depend on the objective, and the type of clusters in the data (Figure 8-2, Jackson). Figure 8-2a Clear separation of clusters. Figure 8-2b Separation ok. But some points in cluster 1 are actually closer to points in cluster 2 than to points in cluster 1. Figure 8-2c Well separated; nonhomogenous Figure 8-2d One cluster using either objective. Figure 8-2e Using similarity within- 2 groups; not clear using separation between. Figure 8-2f Noise between the two groups--difficult to meet either objective. Figure 8-2g Similar to 8-2f 1

2 Measures of Similarity/Distance: [see Manly, p. 129] Distance when grouping variables: Separation in n-dimensional space based on n-observations of each variable Distance when grouping variables: Separation in p-dimensional space based on p measures for each observation. Commonly used similarity/distance measures for ratio, interval, or ordinal scale variables: [examples in class] Given a vector x = [x1 x2 x3...] and a vector y= [y1 y2 y3...], the distance between points x and y can be described by may different methods For ordinal variables, ranks are assigned and it is assumed that there is equal distance among ranks. 1. Euclidean distance (usually the default in packages): Distance(x,y) = SQRT(sum (x i - y i ) 2 ) where the sum is over all the dimensions; x i are the elements of vector x and y i are the elements of vector y; this is the commonly used distance between two vectors as already covered. Example: Observation 1 = [ 2 4 6] Observation 2= [ 1 2 7] 2. Squared Euclidean distance. 3. City-block or Manhattan distance: Distance(x,y) = sum x i -y i NOTE: large differences weighted less heavily than for Euclidean or squared Euclidean distances. 4. Chebychev distance metric: Distance(x,y) = Maximum x i -y i Distance (x,y) = 2

3 Simularity measure for ratio scale only: Cosine. Simularity (x,y) = (sum(x i y i )) / ( SQRT( (sum(y i 2 )) (sum(xi 2 )) ) Commonly used similarity/distance measures for discrete nominal variables (can also be used for ordinal variables) 1. Matching coefficient: Fraction of all variables with similar values is a measure of similarity between the observations. Example (ordinal): Observation 1 = [ 3 1 2 5] Observation 2 = [ 3 1 1 2] 1=strongly agree; 2=agree; 3=neither agree nor disagree; 4=disagree; 5=strongly disagree Distance: 2 / 4 are similar. Example nominal: sphere sweet sour crunchy purple Apple Yes Yes Yes Yes No Banana No Yes No No No (ie. converted to 0=no; 1=yes. Similarity: 2/5 match. 2. Jaccard (similarity) and DJaccard (distance or dissimilarity) coefficient: Like the matching coefficient, but does not include negative results. Example nominal: sphere sweet sour crunchy purple Apple Yes Yes Yes Yes No Banana No Yes No No No (ie. converted to 0=no; 1=yes. Similarity: 2/4 match (purple characteristic not included as neither have this) Dissimilarlity: 2/4 don t match. OTHERS: SAS lists many similarlity/dissimilarity measures in the documentation found with PROC distance. 3

4 Agglomerative or Hierarchical Procedures: General idea is to cluster points together until a large cluster results. Once a point is in a cluster, it cannot be removed. 1. Nearest-neighbour algorithm. (Also called single linkage). Steps are: a. Join the closest points. Treat these as an entity. b. Join the next two closest entities (points and the first group of two points). The distance between the entity from a. and any of the remaining points is defined as the least distance between the points in the entity and the remaining points. c. Continue to joint the closest entities, point by point until all points are in one group. see Figure 8-3 and Figure 8-8a, Jackson. The large jump in the tree diagram/dendrogram shows that the last two groups are not naturally joined. Uses: for clearly separated natural groups (separation-between objective, see Figure 8-13 Jackson); badly affected by noise between groups; poor at finding homogeneous groups. 2. Farthest neighbour algorithm (also called complete linkage). As the nearest neighbour algorith except that the distance between an entity and the remaining points is defined as the maximum of the differences between the individual points of the entity and the remaining points. Uses: highlights lack of similarity within clusters so suitable for similarity within objective; suitable for compact groups with similar variances within each group. 3. Minimum squared-error method. The notion of the centroid of the groups is used. Link entity by entity based on the criteria of reaching the smalled squared distance between points and centroids of entities. (see Table 8-10, Jackson) Uses: Extremely reluctant to include outliers (large squared error results) so good for similarity within objective and is little affected by noise; not appropriate for clusters such as Figure 8-13 (Jackson). Many other hierarchial methods are available. SPSS and SAS have many options for methods. A description (with references) for many procedures is given in the SAS documentation for PROC CLUSTER. 4

5 Nonhierarchical procedures: Nonhierarchial procedures may use Hill and Valley Methods to assign points to groups (e.g, Quick Cluster in SPSS), or may start with all the data in one group and proceed to break the data down (Divisive methods). 1. Hill and Valley Methods. Idea is to use the concept of density of the points(hill) tapering to fewer points (valley). The measure of density is the closeness of the points to the other points in the vacinity. Could specify a particular near point such as the 5th nearest point. If very dense, this distance would be small. If sparse, this distance would be large. With the Unimodal procedure the steps are: a. Define the nuclei of the clusters (specify how many there are) based on the density of the points. b. Classify the remaining points to a cluster based on density (specify distance to which nearest neighbour). Uses: Presence of noise or clusters close together; must be well-defined natural clusters. Suited for the similarity within objective especially in the presence of noice. The number of clusters at the end is determined by the number of nuclei defined at the beginning. (see Figure 8-14, Jackson) SAS Procedures: PROC CLUSTER; In PROC cluster of SAS, all observations start out as individual clusters, and clustering ends when all observations are in one cluster. Then, based on the distance between points, one cluster is formed by joining two points. The distances between this cluster and all other points are then calculated. Then, the next two points are joined, OR the newly formed cluster is joined to one of the remaining points based on the smallest distance. The distances between this cluster and all other points (or any other clusters) is calculated. The process is repeated until all points are joined together. Interpretation: If a very large distance is needed to join points (or clusters) together, then the number of natural clusters has been reached before this last clustering. A tree diagram showing the clustering and distances to join two clusters is often very useful in determining the number of clusters. 5

6 Variations: 1. The distances used in SAS (if numeric input is used) are either Euclidean or squared Euclidean. No other distances may be specified. However, a distance matrix, already calculated, can be input instead of the raw data. That way, any distance measure can be used in the clustering. 2. The distance from a cluster to a point (or another cluster) can be: average linkage; centroid method; complete linkage; simple linkage. 3. Other clustering methods include: density linkage; flexible-bata; maximum likelihood; McQuitty's similar analysis; median method; two-stage density linkage; Ward's minimum variance method. [See SAS documentation online for PROC CLUSTER, for descriptions of other methods] PROC FASTCLUS This procedure can be used for fast clustering of very large data sets. FASTCLUS performs a disjoint cluster analysis on the basis of Euclidean distances computed from one or more quantitative variables. The observations are divided such that every observation belongs to one group. The clustering is based on minimizing the squared differences from the cluster means. The user can specify the maximum number of clusters allowed. FASTCLUS starts will all observations in one group, and divides these into the number of clusters specified. SPSS Procedures: Two procedures are available in SPSS. Cluster is for small data sets; a hierarchial method is used. Several similarity/distance methods can be used (Squared Euclidean, Euclidean, Cosine, Chebychev, City-block, Power) if data are numeric; proximity data can also be input. Methods include Baverage, Waverage, Single linkage, complete linkage, centroid clustering, median, Ward's method. Tree diagrams can be output. Quick cluster is for larger data set; a hill and valley method is used. References: Jackson, B.B. 1983. Multivariate data analysis: An introduction. Richard D. Irwin, Inc., Homewood, Illinois. Manly, Bryan. 2005. Multivariate statistics: A Primer, 3rd edition. Chapman & Hall/CRC Press, New York, chapter 9. 6