CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

Similar documents
DATA CLASSIFICATORY TECHNIQUES

Forestry Applied Multivariate Statistics. Cluster Analysis

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Angela Montanari and Laura Anderlucci

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Methods for Intelligent Systems

Clustering and Visualisation of Data

Workload Characterization Techniques

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

CHAPTER 4: CLUSTER ANALYSIS

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Supervised vs. Unsupervised Learning

Unsupervised Learning and Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Multivariate Analysis

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning and Clustering

COMP5318 Knowledge Management & Data Mining Assignment 1

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CSE 5243 INTRO. TO DATA MINING

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. Supervised vs. Unsupervised Learning

Clustering CS 550: Machine Learning

CSE 5243 INTRO. TO DATA MINING

Applied Clustering Techniques. Jing Dong

Machine Learning (BSMC-GA 4439) Wenke Liu

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Chapter 6: Cluster Analysis

Gene Clustering & Classification

Discriminate Analysis

Cluster Analysis on Statistical Data using Agglomerative Method

Finding Clusters 1 / 60

ECLT 5810 Clustering

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

10701 Machine Learning. Clustering

3. Cluster analysis Overview

Unsupervised Learning

3. Cluster analysis Overview

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning : Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Multi-Modal Data Fusion: A Description

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Hierarchical Clustering / Dendrograms

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Chapter 1. Using the Cluster Analysis. Background Information

How do microarrays work

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

ECLT 5810 Clustering

Iteration Reduction K Means Clustering Algorithm

Cluster Analysis for Microarray Data

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical clustering

Hierarchical Clustering

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

K-Means Clustering 3/3/17

SGN (4 cr) Chapter 11

Unsupervised Learning

Road map. Basic concepts

Based on Raymond J. Mooney s slides

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Keywords: clustering algorithms, unsupervised learning, cluster validity

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Introduction to Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

Unsupervised Learning Hierarchical Methods

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Redefining and Enhancing K-means Algorithm

Understanding Clustering Supervising the unsupervised

University of Florida CISE department Gator Engineering. Clustering Part 2

Hierarchical Clustering Lecture 9

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering algorithms

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

2. Background. 2.1 Clustering

Distributed and clustering techniques for Multiprocessor Systems

4. Ad-hoc I: Hierarchical clustering

Lesson 3. Prof. Enza Messina

Applied Multivariate Analysis

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

CHAPTER THREE THE DISTANCE FUNCTION APPROACH

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Motivation. Technical Background

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning

Transcription:

CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the recorded traits. In other words the simultaneous study of several is of paramount importance. The data on several traits may be classified under broadly two following ways and can be studied by various statistical techniques. Case I: When a set of constitutes a mixture of dependent and independent. In this situation, the objectives of examining the relationship among can be studied by: 1. Both sets of dependent and independent are quantitative: Multivariate multiple regression Canonical correlation 2. Dependent set of as quantitative but independent set of as qualitative: MANOVA 3. Set of binary (polytomous) dependent and set of independent quantitative Discriminant analysis Logistic regression Multiple logistic/ logic models Case II: All the are of the same status and there is no distinction of dependent/ independent or target. In such a situation, the objectives of examining the structure among them can be studied by: 1. Reduction of Principal components 2. Discover natural affinity groups Cluster analysis 3. Identify unobservable underlying factors Factor analysis From the above description of multivariate techniques, it is clear that the cluster analysis is a methodology used to find out similar objects in a set based on several traits. There are various mathematical methods which help to sort objects in to a group of similar objects called a Cluster. Cluster analysis is used in diversified research fields. In biology cluster

analysis is used to identify diseases and their stages. For example by examining patients who are diagnosed as depressed, one finds that there are several distinct sub-groups of patients with different type of depression. In marketing cluster analysis is used to identify persons with similar buying habits. By examining their characteristics it becomes possible to plan future marketing strategies more efficiently. Although both cluster and discriminant analysis classify objects into categories, discriminant analysis requires one to know group membership for the cases used to decide the classification rule and whereas in cluster analysis group membership for all cases is unknown. In addition to membership, the number of groups is also generally unknown. In cluster analysis the units within cluster are similar but different between clusters. The grouping is done on the basis of some criterion like similarities measures etc. Thus in the case of cluster analysis the inputs are similarity measures or the data from which these can be computed. No generalisation about cluster analysis is possible as a vast number of clustering methods have been developed in several different fields with different definitions of clusters and similarities. There are many kinds of clusters namely: Disjoint cluster where every object appears in single cluster. Hierarchical clusters where one cluster can be completely contained in another cluster, but no other kind of overlap is permitted Overlapping clusters. Fuzzy clusters, defined by a probability of membership of each object in one cluster. 1. Similarity Measures A measure of closeness is required to form simple group structures from complex data sets. A great deal of subjectivity is involved in the choice of similarity measures. Important considerations are the nature of the i.e. discrete continuous or binary or scales of measurement ( nominal, ordinal, interval, ratio etc. ) and subject matter knowledge. If the items are to be clustered, proximity is usually indicated by some sort of distance. The are however are grouped on the basis of some measure of association like the correlation co-efficient etc. Some of the measures are Qualitative Variables Consider k observed on n units, in case of binary response it can be represented as Jth unit Ith unit Yes No Total Yes K 11 K 12 K 11 +K 12 No K 21 K 22 K 21 +K 22 Total K 11 +K 21 K 12 +K 22 K Simple matching coefficient (% matches) d ij = (K 11 + K 12 )/ K (i,j =1,2, n) This can easily be summarized to polytomous responses 672

Quantitative Variables In the case of k quantitative recorded on n cases, the observations can be expressed as X 11 X 12 X 13 X 1k X 21 X 22 X 23 X 2k X n1 X n2 X n3 X nk Similarity r ij (i,j = 1,2..n) Correlation between X ik s with X jk s (Not the same as correlation between ) Dissimilarity d ij = ( Xik Xjk) 2 Euclidean distance k X s are standardised. It can be calculated for one variable. Hierarchical Agglomeration Hierarchical Clustering techniques begin by either a series of successive mergers or of a successive divisions. Consider a natural process of grouping Each unit is an entity to start with Merge those two units first which are most similar (least d ij ) now becomes an entity Examine mutual distance between (n-1) entities Merge those two that are most similar Repeat the process and go on merging till all are merged to form one entity At each stage of agglomerative process, note the distance between the two merging entities Choose that stage which shows sudden jump in this distance ( Since it indicates that two very dissimilar entities are being merged ) _ This could be subjective. Distance between entities As there are large number of methods available, so it is not possible to enumerate them here but some of them are Single linkage- This method works on the principle of smallest distance or nearest neighbour Complete linkage- It works on the principle of distant neighbour or dissimilarities- Farthest neighbour Average linkage This works on the principle of average distance. (Average of distances between unit of one entity and the other unit of the second entity. Centroid This method assigns each item to the cluster having nearest centroid (means). The process has three steps Partition the items into k initial clusters Proceed through the list of items assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the 673

centroid (mean) for the cluster receiving the new item and the cluster losing the item. Repeat step 2 until no more assignments take place Ward s Two stage density linkage Units assigned to modal entities on the basis of densities (frequencies) (kth nearest neighbour) Modal entities allowed to join later on SAS Cluster Procedure The SAS procedures for clustering are oriented towards disjoint or hierarchical cluster from a co-ordinate data, distance or a correlation or covariance matrix. The following procedures are used for clustering CLUSTER FASTCLUS VARCLUS TREE Does hierarchical clustering of observations Finds disjoint clusters of observations using a k-means method applied to co-ordinate data. Recommended for large data sets. It is both for hierarchical data disjoint clustering Draws the tree diagrams or dendograms using outputs from the CLUSTER or VARCLUS procedures The TREE Procedure The CLUSTER and VARCLUS procedures create output data sets giving the results of hierarchical clustering as tree structure. The TREE procedure uses the output sets to print a diagram. Following is the terminology related to TREE procedure. Leaves Root Branch Node Parent & Child Objects that are clustered The cluster containing all the objects A cluster containing at least two objects but not all of them A general term for leaves, branch and roots If A is union of cluster and B and C, the A is parent and B and C are children Specifications The TREE procedure is invoked by the following statements: PROC TREE < options> Optional Statements NAME HEIGHT 674

PARENT BY COPY FREQ ID If the data sets have been created by CLUSTER or VARCLUS, the only requirement is the statement PROC TREE. The other optional statements listed above are described after the PROC TREE statement PROC TREE statement PROC TREE < options> The PROC TREE statement starts the TREE procedure. The options that usually find place in the PROC TREE statement FUNCTION OPTION Specify data set DATA= DOCK= LEVEL= NCLUSTERS= OUT= Specify cluster heights HEIGHT= DISSIMILAR= SIMILAR= FUNCTION Print horizontal trees Control the height axis Control characters printed in trees Control sort order Control output OPTION HORIZONTAL INC= MAXHEIGHT= MINHEIGHT= NTICK= PAGES= POS= SPACES= TICKPOPS= FILLCHAR= JOINCHAR= LEAFCHAR= TREECHAR= DESCENDING SORT LIST NOPRINT PAGES 675

By default, the tree diagram is oriented with the height and vertical and the object names at the top of the diagram. For horizontal axis HORIZONTAL option can be used. Example: The data along with SAS CODE belongs to different kinds of teeth for a variety of mammals. The objective of the study is to identify suitable clusters of mammals based on the eight. Data teeth; Input mammal $ v1 v2 v3 v4 v5 v6 v7 v8; Cards; A 2 3 1 1 3 3 3 3 B 3 2 1 0 3 3 3 3 C 2 3 1 1 2 3 3 3 D 2 3 1 1 2 2 3 3 E 2 3 1 1 1 2 3 3 F 1 3 1 1 2 2 3 3 G 2 1 0 0 2 2 3 3 H 2 1 0 0 3 2 3 3 I 1 1 0 0 2 1 3 3 J 1 1 0 0 2 1 3 3 K 1 1 0 0 1 1 3 3 L 1 1 0 0 0 0 3 3 M 1 1 0 0 1 1 3 3 N 3 3 1 1 4 4 2 3 O 3 3 1 1 4 4 2 3 P 3 3 1 1 4 4 3 2 Q 3 3 1 1 4 4 1 2 R 3 3 1 1 3 3 1 2 S 3 3 1 1 4 4 1 2 T 3 3 1 1 3 3 1 2 U 3 3 1 1 4 3 1 2 V 3 2 1 1 3 3 1 2 W 3 3 1 1 3 2 1 1 X 3 3 1 1 3 2 1 1 Y 3 2 1 1 4 4 1 1 Z 3 2 1 1 4 4 1 1 AA 3 2 1 1 3 3 2 2 BB 2 1 1 1 4 4 1 1 CC 0 4 1 0 3 3 3 3 DD 0 4 1 0 3 3 3 3 EE 0 4 0 0 3 3 3 3 FF 0 4 0 0 3 3 3 3 ; 676

proc cluster method=average std pseudo ndesign; var v1-v8; id mammal; This will perform clustering by using average linkage distance method. The following PROC TREE statements use the average linkage distances as height as default proc tree; The following PROC TREE statements sort the clusters at each branch in order of formation and use the number of clusters for the height axis. proc tree sort height=n; The following PROC TREE statements produce no printed output but creates an output data set indicating the cluster to which each observation belongs at the 6-cluster level in the tree; the data set is reproduced by PROC PRINT proc tree noprint out=part nclusters=6; id mammal; copy v1-v8; proc sort; by cluster; proc print label uniform; id mammal; var v1-v8; format v1-v8 1.; by cluster; 677