DATA CLASSIFICATORY TECHNIQUES

Similar documents
CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

Forestry Applied Multivariate Statistics. Cluster Analysis

Workload Characterization Techniques

Cluster Analysis: Agglomerate Hierarchical Clustering

Chapter 1. Using the Cluster Analysis. Background Information

Multivariate Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering and Visualisation of Data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Unsupervised Learning and Clustering

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

3. Cluster analysis Overview

Unsupervised Learning and Clustering

3. Cluster analysis Overview

Clustering CS 550: Machine Learning

Chapter 6 Continued: Partitioning Methods

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Cluster analysis. Agnieszka Nowak - Brzezinska

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Discriminate Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

ECLT 5810 Clustering

ECLT 5810 Clustering

Unsupervised Learning : Clustering

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Chapter 6: Cluster Analysis

Hierarchical Clustering / Dendrograms

CHAPTER 4: CLUSTER ANALYSIS

Automated Clustering-Based Workload Characterization

Supervised vs. Unsupervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Network Traffic Measurements and Analysis

Hierarchical Clustering Lecture 9

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Machine Learning (BSMC-GA 4439) Wenke Liu

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Hierarchical Clustering

INF 4300 Classification III Anne Solberg The agenda today:

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CSE 5243 INTRO. TO DATA MINING

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Methods for Intelligent Systems

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Latent Class Modeling as a Probabilistic Extension of K-Means Clustering

Cluster Analysis for Microarray Data

Lesson 3. Prof. Enza Messina

Gene Clustering & Classification

Understanding Clustering Supervising the unsupervised

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

CSE 5243 INTRO. TO DATA MINING

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Applied Clustering Techniques. Jing Dong

Finding Clusters 1 / 60

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Unsupervised Learning

Statistical Pattern Recognition

Statistical Pattern Recognition

Applied Multivariate Analysis

Clustering. Chapter 10 in Introduction to statistical learning

Dynamic Thresholding for Image Analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

8.11 Multivariate regression trees (MRT)

University of Florida CISE department Gator Engineering. Clustering Part 2

Hierarchical clustering

COMP5318 Knowledge Management & Data Mining Assignment 1

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Distributed and clustering techniques for Multiprocessor Systems

Clustering. Supervised vs. Unsupervised Learning

Multi-Modal Data Fusion: A Description

10701 Machine Learning. Clustering

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Hierarchical clustering

Unsupervised: no target value to predict

Unsupervised Learning

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

2. Background. 2.1 Clustering

MATH5745 Multivariate Methods Lecture 13

Hierarchical Clustering

Unsupervised Learning Hierarchical Methods

4. Ad-hoc I: Hierarchical clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Unsupervised Learning

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Statistical Pattern Recognition

Introduction to Clustering

Clustering part II 1

Exploratory data analysis for microarrays

Multivariate Methods

CHAPTER THREE THE DISTANCE FUNCTION APPROACH

Transcription:

DATA CLASSIFICATORY TECHNIQUES AMRENDER KUMAR AND V.K.BHATIA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 akjha@iasri.res.in 1. Introduction Rudimentary, exploratory procedures are often quite helpful in understanding the complex nature of multivariate relationship. Searching the data for a structure of "natural" grouping is an important exploratory technique. The most important techniques for data classification are Cluster analysis Discriminant analysis Although both cluster and discriminant analysis classifies objects into categories, discriminant analysis requires one to know group membership for the cases used to decide the classification rule whereas in cluster analysis group membership for all cases is unknown. In addition to membership, the number of groups is also generally unknown. In cluster analysis the units within cluster are similar but different between clusters. The grouping is done on the basis of some criterion like similarities measures etc. Thus in the case of cluster analysis the inputs are similarity measures or the data from which these can be computed. 2. Cluster Analysis Cluster analysis is a technique used for combining observations into groups such that: (a) Each group is homogeneous or compact with respect to certain characteristics i.e., observations in each group are similar to each other. (b) Each group should be different from other groups with respect to the characteristics i.e., observations of one group should be different from the observations of other groups. There are various mathematical methods which help to sort objects into a group of similar objects called a Cluster. Cluster analysis is used in diversified research fields. In biology, cluster analysis is used to identify diseases and their stages. For example, by examining patients who are diagnosed as depressed, one finds that there are several distinct subgroups of patients with different types of depression. In marketing, cluster analysis is used to identify persons with similar buying habits. By examining their characteristics it becomes possible to plan future marketing strategies more efficiently. The objective of cluster analysis is to group observations into clusters such that each cluster is as homogenous as possible with respect to the clustering variables. The various steps in cluster analysis (i) Select a measure of similarity. (ii) Decision is to be made on the type of clustering technique to be used (iii) Type of clustering method for the selected technique (iv) Decision regarding the number of clusters (v) Cluster solution is interpreted.

No generalization about cluster analysis is possible as a vast number of clustering methods have been developed in several different fields with different definitions of clusters and similarities. There are many kinds of clusters namely: Disjoint cluster where every object appears in single cluster. Hierarchical clusters where one cluster can be completely contained in another cluster, but no other kind of overlap is permitted Overlapping clusters. Fuzzy clusters, defined by a probability of membership of each object in one cluster. 2.1 Similarity Measures A measure of closeness is required to form simple group structures from complex data sets. A great deal of subjectivity is involved in the choice of similarity measures. Important considerations are the nature of the variables i.e. discrete, continuous or binary or scales of measurement (nominal, ordinal, interval, ratio etc.) and subject matter knowledge. If the items are to be clustered, proximity is usually indicated by some sort of distance. The variables however are grouped on the basis of some measure of association like the correlation co-efficient etc. Some of the measures are Qualitative Variables k variables observed on n units in case of binary response can be represented as follows: j th unit i th unit Yes No Total Yes K 11 K 12 K 11 +K 12 No K 21 K 22 K 21 +K 22 Total K 11 +K 21 K 12 +K 22 K Simple matching coefficient (% matches) d ij = (K 11 + K 22 )/ K (i,j =1,2,,n) This can easily be summarized to polytomous responses. Quantitative Variables In the case of k quantitative variables recorded on n cases, the observations can be expressed as X 11 X 12 X 13 X 1k X 21 X 22 X 23 X 2k X n1 X n2 X n3 X nk Similarity r ij (i,j = 1,2,,n) is the correlation between X ik s with X jk s (Not the same as correlation between variables) Dissimilarity d ij = k 2 (Xik Xjk ) which is the Euclidean distance X s are standardised. It can be calculated for one variable. 2

2.2 Hierarchical Agglomeration Hierarchical clustering techniques begin by either a series of successive mergers or of successive divisions. Consider a natural process of grouping Each unit is an entity to start with Merge those two units first which are most similar (least d ij ) - now becomes an entity Examine mutual distance between (n-1) entities Merge those two that are most similar Repeat the process and go on merging till all are merged to form one entity At each stage of agglomerative process, note the distance between the two merging entities Choose that stage which shows sudden jump in this distance (since it indicates that two very dissimilar entities are being merged). This could be subjective. 2.3 Distance Between Entities As there are large numbers of methods available, so it is not possible to enumerate them here but some of them are Single linkage- This method works on the principle of smallest distance or nearest neighbour Complete linkage- It works on the principle of distant neighbour or dissimilarities- Farthest neighbour Average linkage- This works on the principle of average distance. Average of distances between unit of one entity and the other unit of the second entity. Centroid- This method assigns each item to the cluster having nearest centroid (means). The process has three steps Partition the items into k initial clusters Proceed through the list of items assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid (mean) for the cluster receiving the new item and the cluster losing the item. Repeat this step until no more assignments take place Ward s It forms cluster by maximising within cluster homogeneity, within group sum of squares is used as the measure of homogeneity Two stage density linkage Units assigned to modal entities on the basis of densities (frequencies) (k th nearest neighbour) Modal entities allowed to join later on SAS Cluster Procedure The SAS procedures for clustering are oriented towards disjoint or hierarchical cluster from a co-ordinate data, distance or a correlation or covariance matrix. The following procedures are used for clustering CLUSTER FASTCLUS VARCLUS Does hierarchical clustering of observations Finds disjoint clusters of observations using a k-means method applied to co-ordinate data. Recommended for large data sets. It is used for hierarchical as well as non-hierarchical clustering 3

TREE Draws the tree diagrams or dendograms using outputs from the CLUSTER or VARCLUS procedures The TREE procedure is considered to be very important because it produces dendrogram, using a data set created by the CLUSTER or VARCLUS procedure and also create output data sets giving the results of hierarchical clustering as tree structure. The TREE procedure uses the output sets to print a diagram. Following is the terminology related to TREE procedure. Leaves Objects that are clustered Root The cluster containing all the objects Branch A cluster containing at least two objects but not all of them Node A general term for leaves, branch and roots Parent & Child If A is union of cluster B and C, then A is parent and B and C are children Specifications The TREE procedure is invoked by the following statements: PROC TREE <options> Optional Statements NAME variables HEIGHT variables PARENT variables BY variables COPY variables FREQ variables ID variables If the data sets have been created by CLUSTER or VARCLUS, the only requirement is the statement PROC TREE. The other optional statements listed above are described after the PROC TREE statement PROC TREE <options> The PROC TREE statement starts the TREE procedure. The options that usually find place in the PROC TREE statement are as given: FUNCTION Specify data set Specify cluster heights OPTION DATA= DOCK= LEVEL= NCLUSTERS= OUT= HEIGHT= DISSIMILAR= SIMILAR= 4

Print horizontal trees Control the height axis Control characters printed in trees Control sort order Control output HORIZONTAL INC= MAXHEIGHT= MINHEIGHT= NTICK= PAGES= POS= SPACES= TICKPOPS= FILLCHAR= JOINCHAR= LEAFCHAR= TREECHAR= DESCENDING SORT LIST NOPRINT PAGES By default, the tree diagram is oriented with the height vertical and the object names at the top of the diagram. For horizontal axis HORIZONTAL option can be used. EXERCISE The data along with SAS CODE belongs to different kinds of teeth for a variety of mammals. The objective of the study is to identify suitable clusters of mammals based on the eight variables. Data teeth; Input mammal $ v1 v2 v3 v4 v5 v6 v7 v8; Cards; A 2 3 1 1 3 3 3 3 B 3 2 1 0 3 3 3 3 C 2 3 1 1 2 3 3 3 D 2 3 1 1 2 2 3 3 E 2 3 1 1 1 2 3 3 F 1 3 1 1 2 2 3 3 G 2 1 0 0 2 2 3 3 H 2 1 0 0 3 2 3 3 I 1 1 0 0 2 1 3 3 J 1 1 0 0 2 1 3 3 K 1 1 0 0 1 1 3 3 L 1 1 0 0 0 0 3 3 5

M 1 1 0 0 1 1 3 3 N 3 3 1 1 4 4 2 3 O 3 3 1 1 4 4 2 3 P 3 3 1 1 4 4 3 2 Q 3 3 1 1 4 4 1 2 R 3 3 1 1 3 3 1 2 S 3 3 1 1 4 4 1 2 T 3 3 1 1 3 3 1 2 U 3 3 1 1 4 3 1 2 V 3 2 1 1 3 3 1 2 W 3 3 1 1 3 2 1 1 X 3 3 1 1 3 2 1 1 Y 3 2 1 1 4 4 1 1 Z 3 2 1 1 4 4 1 1 AA 3 2 1 1 3 3 2 2 BB 2 1 1 1 4 4 1 1 CC 0 4 1 0 3 3 3 3 DD 0 4 1 0 3 3 3 3 EE 0 4 0 0 3 3 3 3 FF 0 4 0 0 3 3 3 3 ; proc cluster method=average std pseudo ndesign; var v1-v8; id mammal; This will perform clustering by using average linkage distance method. The following PROC TREE statements use the average linkage distances as height as default proc tree; The following PROC TREE statements sort the clusters at each branch in order of formation and use the number of clusters for the height axis. proc tree sort height=n; The following PROC TREE statements produce no printed output but creates an output data set indicating the cluster to which each observation belongs at the 6-cluster level in the tree; the data set is reproduced by PROC PRINT proc tree noprint out=part nclusters=6; id mammal; copy v1-v8; 6

proc sort; by cluster; proc print label uniform; id mammal; var v1-v8; format v1-v8 1.; by cluster; Data Entry and Procedure in SPSS Analyze Classify K-Means Cluster Hierarchical Cluster Discrimant K-Means Cluster 7

Output 8

3. Discriminant Analysis Discriminant analysis is a multivariate technique concerned with classifying distinct set of objects (or set of observations) and with allocating new objects or observations to the previously defined groups. It involves deriving variates, which are combination of two or more independent variables that will discriminate best between a priori defined groups. The Objectives of Discriminant Analysis are (i) Identifying a set of variables that best discriminates between the groups. (ii) Identifying a new axis, Z, such that new variables Z, given by the projection of observations onto this new axis, provides the maximum separation or discrimination between the groups. (iii) Classifying future observations into one of the groups. 3.1 Linear Discriminant Function If the population covariance matrices are equal then linear discriminant function for classification is used, otherwise quadratic discriminant function is used for this purpose. The maximum number of discriminant functions that can be computed is equal to minimum of G-1 and p, where G is the number of groups and p is the number of variables. Suppose the first discriminant function is Z = +, 1 W11X1+ W12X2...W1p Xp where the W 1j is the weight of the j th variable for the 1 st discriminant function. The weights of the discriminant function are such that the ratio between groups SS of Z1 λ 1 = is maximized. within groups SS of Z Suppose the second discriminant function is given by, Z + 2 = W21X1+ W22X2...W2pX p The weights of above discriminant function are estimated such that the ratio between groups SS of Z2 λ 2 =, within groups SS of Z 1 2 is maximized subject to the constraint that the discriminant scores Z 1 and Z 2 are uncorrelated. The procedure is repeated until all possible discriminant functions are identified. Once the discriminant functions are identified, the next step is to determine a rule for classifying the future observations. Classification procedure involves the division of the discriminant space in g mutually exclusive and collectively exhaustive regions. For example, to classify any given observation using discriminant scores, the discriminant scores are computed, then the observations are plotted in the discriminant space and the observation is classified into the group in whose region it falls. 9

EXERCISE Given below is annual data on weather parameters temp.(max. and min), relative humidity (morning and evening), sunshine hours, amount of rainfall (mm) and number of rainy days for three groups. GROUP MXT MNT RHM RHE SS RAINFALL NOR 1 29.2 24.3 87 74 3.1 102.1 24 1 31.3 25.1 86 64 2.8 57.8 17 1 30.5 24.3 90 74 2.9 97.8 24 1 30.8 24.7 89 71 3.9 115.9 20 1 31.3 23.8 91 76 3.1 179.8 29 1 30.3 23.6 91 76 3.3 117.5 26 1 29.7 23.6 91 76 2.9 122.1 23 1 30.2 22.8 92 75 2.8 68.5 27 2 30.0 23.6 91 73 2.9 88.3 22 2 31.7 23.1 89 67 2.8 69.4 22 2 29.4 24.1 91 75 3.1 141.3 28 2 30.7 23 89 73 3.1 73.9 22 2 29.7 23.5 92 76 2.9 116.2 29 2 31.5 23.7 89 70 2.8 82.3 20 3 30.3 23.6 93 71 3.2 156.1 23 3 30.4 24.1 91 75 3 114.8 26 3 30.3 24.3 92 78 2.4 87.3 25 3 30.8 23.7 92 73 3 131.3 24 3 31.4 23.7 90 72 2.8 115.9 29 3 30.5 24.1 91 74 3.2 155.9 24 3 30.8 24.2 91 75 2.6 45.9 16 Analysis through SPSS (analysis based on 20 observations) DATA LIST filename /GROUP MXT MNT RHM RHE SS RAINFALL NOR. LIST. SET LIST result filename. DSCRIMINANT /GROUPS = GROUP(1,3) /VARIABLES MXT TO NOR/ANALYSIS ALL /METHOD DIRECT/STATISTICS ALL. Discriminant Functions Function Eigen value % of variance Wilk s Lamda Chi-square 1 0.7925 69.21 0.4124 12.40 2 0.3527 30.79 0.7392 4.29 Discriminant Function Coefficients of Weather Variables Function Constant MXT MNT RHM RHE SS RAINFA LL NOR 1-106.89 0.617 1.426 0.697-0.07-1.579 0.0123-0.006 2-55.02 0.528 1.427-0.229 0.286 1.645-0.003 0.0028 10

Group Centroids Group Function 1 Function 2 1 -.475.590 2 -.616 -.728 3 1.250 -.059 Calculation of Discriminant Score Z 1 = -106.89 + 0.617*MXT + 1.426*MNT + 0.697*RHM - 0.07*RHE -1.579*SS 0.0123*RAINFALL - 0.006*NOR Z 2 = -55.02 + 0.528 *MXT + 1.427 *MNT 0.229 *RHM + 0.286*RHE + 1.645*SS - 0.003*RAINFALL + 0.00028*NOR Misclassification Results Actual Group No. of cases Predicted Group Membership 1 2 3 Group 1 8 5 62.5% 2 25.0% 1 12.5% Group 2 6 1 16.7% 5 83.3% 0 0.0% Group 3 6 0 0.0% 0 0.0% 6 100% Percentage of grouped cases correctly classified = 76.19% Classification for the future data can be done on the basis of probability of group membership i.e. put the future observation in the group of highest probability. Observation Observed Predicted Probability Probability Probability group group of group 1 of group 2 of group 3 21 3 3 0.30 0.10 0.60 The highest probability of last observation (21) is 0.60 which fall in group 3, thus this observation fall in the group 3. SAS Commands for discriminant Analysis Data weather; title 'Discriminant Analysis '; input GROUP MXT MNT RHM RHE SS RAINFALL NOR; datalines; ; 11

proc discrim data=weather method=normal pool=yes list crossvalidate CAN; class group; var MXT MNT RHM RHE SS RAINFALL NOR; References and Suggested Reading Chatfield, C. and Collins, A.J. (1990). Introduction to multivariate analysis. Chapman and Hall publications. Johnson, R.A. and Wichern, D.W. (1996). Applied multivariate statistical analysis. Prentice-Hall of India Private Limited. Sharma, S. (1996). Applied Multivariate Techniques, John Wiley & Sons, New York. 12