Cluster Analysis, Multidimensional Scaling and Graph Theory. Cluster Analysis, Multidimensional Scaling and Graph Theory

Cluster Analysis, Multidimensional Scaling and Graph Theory Dpto. de Estadística, E.E. y O.E.I. Universidad de Alcalá luisf.rivera@uah.es 1

Outline The problem of statistical classification Cluster analysis Multidimensional scaling and graph theory The adjacency matrix The Iris data Conclusions and references 2

1. The problem of Statistical Classification Introduction Identification of groups of similar cases is a very important task in everyday research. Information of p variables may be measured over n individuals. What is the group structure of these cases? X x11 x12 x1 p = x21 x22 x2p xn1 xn2 xnp 3

1. The problem of Statistical Classification Classification and statistical learning Classification systems look for a rule to classify objects. They can be supervised or unsupervised, depending on the existence of a prior knowledge of classes to which the objects belong. Classical methods: Discriminant Analysis Cluster Analysis Modern methods: Statistical learning Supervised Unsupervised 4

1. The problem of Statistical Classification Supervised vs. unsupervised learning (I) To develop a supervised classification system, it is necessary to know the classes (C) in which the population is divided, and also to which class each observed individual belongs. For each case i, we must know to which class it belongs, from set {1,2,...,C}: y1 y Y = 2, yi { 1,2,..., C}, i = 1,..., n. yn 5

1. The problem of Statistical Classification Supervised vs. unsupervised learning (II) A supervised classification system provides some kind of mathematical function Y = Y(X,w), where w is a vector of parameters adjusted from data. The values of these parameters are determined using a learning algorithm, which usually tries to minimize a function of classification error. Supervised classifiers Discriminant Analysis Neural Networks SVM Trees... 6

1. The problem of Statistical Classification Supervised vs. unsupervised learning (III) Unsupervised classification tries to find out the existing group structure in data, in a natural way. Normally, real classes (C) in population are unknown, thus there is no knowledge about the class each object belongs to. This kind of problems is sometimes referred of as pattern recognition, in the sense that it's intended to discover classes of objects in data. 7

1. The problem of Statistical Classification Supervised vs. unsupervised learning (IV) Unsupervised classification algorithms seek to divide the data set in some groups or classes of elements. Normally, a group is described as a set of similar cases, that are different to cases classified in other groups. It is necessary to find a way to measure the closeness between cases. Dissimilarity measures are used. Unsupervised classifiers Cluster analysis Neural networks K-NN... 8

The problem of statistical classification Cluster analysis Multidimensional scaling and graph theory The adjacency matrix The Iris data Conclusions and references 9

2. Cluster analysis Introduction The purpose of Cluster analysis is to discover groups of elements in data, according to the following properties: Each element belongs to only one group. Every element must be classified in one group. Elements in a group must be homogenous (similar) and different to elements in other groups. Clustering methods can be: Partitioning: based on the elements in dataset. Hierarchical: based on distances between elements in dataset. 10

2. Cluster analysis Example Let s consider this dataset (2 dim.): X 1 X 2 2,25 3,50 2,50 4,00 2,25 3,00 3,00 3,50 3,25 3,00 2,75 3,25 3,50 2,25 3,25 2,00 3,75 2,50 4,00 2,25 2,25 1,00 2,50 1,75 2,75 1,25 2,50 1,50 2,75 1,50 4,00,00 4,25 1,00 4,25,25 4,50,50 4,50,75 How many groups are there? 11

2. Cluster analysis Example. k-means (I) k=2 k=3 12

2. Cluster analysis Example. k-means (II) k=4 k=5 What s the structure of this dataset? 13

2. Cluster analysis Example. Hierarchical methods While as k-means is based on dataset, hierarchical methods are based on distances between cases in the dataset (nxn matrix): Caso 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Matriz de distancias distancia euclídea 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,000,559,500,750 1,118,559 1,768 1,803 1,803 2,151 2,500 1,768 2,305 2,016 2,062 3,913 3,202 3,816 3,750 3,553,559,000 1,031,707 1,250,791 2,016 2,136 1,953 2,305 3,010 2,250 2,761 2,500 2,512 4,272 3,473 4,138 4,031 3,816,500 1,031,000,901 1,000,559 1,458 1,414 1,581 1,904 2,000 1,275 1,820 1,521 1,581 3,473 2,828 3,400 3,363 3,182,750,707,901,000,559,354 1,346 1,521 1,250 1,601 2,610 1,820 2,264 2,062 2,016 3,640 2,795 3,482 3,354 3,132 1,118 1,250 1,000,559,000,559,791 1,000,707 1,061 2,236 1,458 1,820 1,677 1,581 3,092 2,236 2,926 2,795 2,574,559,791,559,354,559,000 1,250 1,346 1,250 1,601 2,305 1,521 2,000 1,768 1,750 3,482 2,704 3,354 3,260 3,052 1,768 2,016 1,458 1,346,791 1,250,000,354,354,500 1,768 1,118 1,250 1,250 1,061 2,305 1,458 2,136 2,016 1,803 1,803 2,136 1,414 1,521 1,000 1,346,354,000,707,791 1,414,791,901,901,707 2,136 1,414 2,016 1,953 1,768 1,803 1,953 1,581 1,250,707 1,250,354,707,000,354 2,121 1,458 1,601 1,601 1,414 2,512 1,581 2,305 2,136 1,904 2,151 2,305 1,904 1,601 1,061 1,601,500,791,354,000 2,151 1,581 1,601 1,677 1,458 2,250 1,275 2,016 1,820 1,581 2,500 3,010 2,000 2,610 2,236 2,305 1,768 1,414 2,121 2,151,000,791,559,559,707 2,016 2,000 2,136 2,305 2,264 1,768 2,250 1,275 1,820 1,458 1,521 1,118,791 1,458 1,581,791,000,559,250,354 2,305 1,904 2,305 2,358 2,236 2,305 2,761 1,820 2,264 1,820 2,000 1,250,901 1,601 1,601,559,559,000,354,250 1,768 1,521 1,803 1,904 1,820 2,016 2,500 1,521 2,062 1,677 1,768 1,250,901 1,601 1,677,559,250,354,000,250 2,121 1,820 2,151 2,236 2,136 2,062 2,512 1,581 2,016 1,581 1,750 1,061,707 1,414 1,458,707,354,250,250,000 1,953 1,581 1,953 2,016 1,904 3,913 4,272 3,473 3,640 3,092 3,482 2,305 2,136 2,512 2,250 2,016 2,305 1,768 2,121 1,953,000 1,031,354,707,901 3,202 3,473 2,828 2,795 2,236 2,704 1,458 1,414 1,581 1,275 2,000 1,904 1,521 1,820 1,581 1,031,000,750,559,354 3,816 4,138 3,400 3,482 2,926 3,354 2,136 2,016 2,305 2,016 2,136 2,305 1,803 2,151 1,953,354,750,000,354,559 3,750 4,031 3,363 3,354 2,795 3,260 2,016 1,953 2,136 1,820 2,305 2,358 1,904 2,236 2,016,707,559,354,000,250 3,553 3,816 3,182 3,132 2,574 3,052 1,803 1,768 1,904 1,581 2,264 2,236 1,820 2,136 1,904,901,354,559,250,000 Esta es una matriz de disimilaridades Which proximity measure is better? Which clustering method should be used? 14

2. Cluster analysis Example. Hierarchical methods: single linkage Dendrogram using Single Linkage Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ 19 òûòòòø 1 20 ò ó 18 òòòòòú 16 òòòòòôòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòø 17 òòòòò ó 2 1 òòòòòòòòòòòòòûòø ó 2 3 òòòòòòòòòòòòò ó ó 4 òòòòòûòòòòòòòòòú ó 6 òòòòò ùòòòòòòòø ó 3 5 òòòòòòòòòòòòòòòú ó ó 2 òòòòòòòòòòòòòòò ó ó 9 òòòòòø ùòòòòòòòòòòòòòòòòòòòòòòòòò 3 4 10 òòòòòú ó 7 òòòòòôòòòòòòòòòòòòòòòòòú 8 òòòòò ó 14 òø ó 15 òú ó 12 òôòòòòòòòòòòòòòø ó 13 ò ùòòòòòòò 4 1 11 òòòòòòòòòòòòòòò 15

2. Cluster analysis Example. Hierarchical methods: complete linkage Dendrogram using Complete Linkage Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ 19 òûòø 1 20 ò ùòòòòòø 17 òòò ùòòòòòòòòòòòòòòòòòòòø 16 òûòòòòòòò ó 18 ò ó 4 14 òø ùòòòòòòòòòòòòòòòòòòòø 2 15 òôòòòòòø ó ó 12 ò ùòòòòòòòòòòòòòòòø ó ó 11 òòòûòòò ó ó ó 3 13 òòò ùòòòòò ó 9 òûòòòòòø ó ó 10 ò ùòòòòòòòòòòòòòòò ó 3 4 7 òûòòòòò ó 8 ò ó 4 òûòø ó 6 ò ùòòòòòòòòòø ó 5 òòò ùòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòò 1 òòòûòòòòòø ó 3 òòò ùòòò 2 1 2 òòòòòòòòò 16

2. Cluster analysis Example. Hierarchical methods: Centroid Dendrogram using Centroid Method Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ 19 òûòòòø 1 20 ò ùòòòø 17 òòòòò ùòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòø 16 òòòûòòòòò ó 18 òòò ó 2 4 òòòûòòòø ó 2 6 òòò ùòø ó 5 òòòòòòò ùòòòø ó 1 òòòòòòòòòú ùòòòòòòòòòòòòòòòø ó 3 3 òòòòòòòòò ó ó ó 2 òòòòòòòòòòòòò ó ó 9 òòòø ùòòòòòòòòòòòòòòòòòòò 3 4 10 òòòôòòòòòø ó 7 òòò ùòòòòòòòòòòòòòòòòòø ó 8 òòòòòòòòò ó ó 14 òø ùò 15 òú ó 12 òôòòòòòòòø ó 13 ò ùòòòòòòòòòòòòòòòòò 4 1 11 òòòòòòòòò 17

2. Cluster analysis Shortcomings In hierarchical methods decisions have to be made such as proximity measure and clustering method (results are decision-dependent). k-means can be used only if Euclidean distance is valid for variables in dataset. In both methodologies, there is no specific criterion to determine the number of groups. When dimensionality of the problem gets bigger, no geometric interpretation is possible. 18

The problem of statistical classification Cluster analysis Multidimensional scaling and graph theory The adjacency matrix The Iris data Conclusions and references 19

3. Multidimensional scaling and graph theory Graphs A (weighted) graph on V is a pair G = (V,E), where V is the set of nodes and E is the set of edges or lines which connect them. The edges connect nodes from V, and define the shape of G. In graph theory, only the essential of the drawing may be important: edges are not relevant, just the nodes they connect. Position of nodes is not important, so they can be moved to get a simpler graph. In unsupervised classification problem, cases are the nodes of the graph, and dissimilarity matrix determine the set of edges. At first, graph is complete (each pair of nodes is connected). 20

3. Multidimensional scaling and graph theory Graphs. Example The representation of our example, as a graph, is: Each node (case) is connected with the rest. If there are n nodes, then there are nn ( 1) 2 edges. To ensure graph theory and cluster analysis meet, then V and E must have an adequate structure. 21

3. Multidimensional scaling and graph theory Multidimensional scaling Multidimensional scaling is a statistical method for representing a set of cases, from which their matrix of proximities is known, by a configuration of points in a low-dimensional Euclidean space, in such a form that Euclidean distance between points in this new space represents their dissimilarity at the beginning. This method is useful to put the cases of a classification problem in a Euclidean space, in which it is equivalent the use of k-means clustering or a hierarchical method based on Euclidean distance. 22

3. Multidimensional scaling and graph theory Multidimensional scaling. Where does it come in? Data Matrix X (dim. nxp) Proximities Multidimensional scaling Euclidean distances matrix Euclidean configuration (dim. nxm, m<<p) E V 23

3. Multidimensional scaling and graph theory Cluster analysis related to graph theory Application of multidimensional scaling to dissimilarity matrix between cases in dataset allows the homogenization of cluster analysis techniques with classification in graph theory. Thus, the problem of classification is reduced to the analysis of the distribution of edges in a graph, taking into account the distances in the Euclidean space derived from multidimensional scaling. 24

The problem of statistical classification Cluster analysis Multidimensional scaling and graph theory The adjacency matrix The Iris data Conclusions and references 25

4. The adjacency matrix Introduction In a graph, the adjacency matrix is the most important element, because it can be used to analyse conectivity between nodes (or cases in a dataset). Searching for analogy with cluster analysis, for two nodes in the graph, the stronger the connection they have (smaller distance), the more similar they are. Not every edge has the same importance. If there are long edges, it may be useless to take them into account, as they are connecting very different cases. It is necessary to find an strategy to define the number of groups in a dataset in terms of the distribution of edges. 26

4. The adjacency matrix Distribution of edges. Finding a threshold Edges represent Euclidean distance between cases in dataset in the Euclidean space derived by multidimensional scaling. The distribution of such distances can give us some clues about the existence of group structure in data. Cases can be classified in groups, if a correct threshold is selected: Mean value. Half of the mean value. Median.... 27

4. The adjacency matrix Distribution of edges. Example (I) Histogram of the distribution of edges of the two dimensional example: 25 20 15 10 5 mean = 1.7858. mean/2 = 0.8929. median = 1.8028.. 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 28

4. The adjacency matrix Distribution of edges. Example (II) Threshold = mean = 1.7858 One group. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 29

4. The adjacency matrix Distribution of edges. Example (III) Threshold = mean/2 = 0.8929 Two groups. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 30

4. The adjacency matrix Distribution of edges. Example (IV) Threshold = mean/3 = 0.5953 Four groups. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 31

4. The adjacency matrix Distribution of edges. Example (V) Threshold = 0.6718 (smallest mode of kernel density) 0.5 0.45 0.4 0.35 0.3 X: 0.6718 Y: 0.2747 0.25 0.2 0.15 0.1 0.05 0-1 0 1 2 3 4 5 6 Four groups. 32

The problem of statistical classification Cluster analysis Multidimensional scaling and graph theory The adjacency matrix The Iris data Conclusions and references 33

5. The Iris data The Iris data Fisher,R.A.: "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). The dataset contains 3 classes of 50 instances each (referred to a type of iris plant). One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Variables: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm Summary Statistics: Min Max Mean SD Class Correlation sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43-0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) 34

5. The Iris data Multidimensional scaling 2-dimensional derived configuration is: 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4-0.5-1.5-1 -0.5 0 0.5 1 1.5 35

5. The Iris data K-means 2 groups 3 groups Some misclassified objects 36

5. The Iris data Graph 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4-0.5-1.5-1 -0.5 0 0.5 1 1.5 600 500 400 300 200 100 0 0 0.5 1 1.5 2 2.5 mean = 0.8260 mean/2 = 0.4130 median = 0.7431 median/2 = 0.3716 37

5. The Iris data Distribution of edges 0.5 Threshold = 0.2605 (smallest mode of kernel density) 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4-0.5-1.5-1 -0.5 0 0.5 1 1.5 0.5 0.4 0.3 0.2 0.1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 X: 0.2605 Y: 0.8633-0.1-0.2-0.3-0.4-0.5-1.5-1 -0.5 0 0.5 1 1.5 0.1 0-0.5 0 0.5 1 1.5 2 2.5 3 38

6. Conclusions and references General conclusions 1. Cluster analysis explores data, searching for groups. 2. Depending on method employed, cluster analysis requires some previous decisions (number of clusters, proximity measure, hierarchical method,...). 3. Multidimensional scaling gives the possibility to represent in a Euclidean space the relationships of proximity in a dataset. 4. The classification problem can be understood as the analysis of edges in a graph. Therefore, graph theory can be applied to classify the objects in a dataset. 39

6. Conclusions and references Particular conclusions 1. Graph theory elements have been used to explore cluster analysis problems. 2. To use graph theory, the study of the distribution of edges is proposed. The used of some parameter derived from distribution is analysed. 3. The best threshold is where the smallest mode of the edge distribution is located (experimentally). 40

6. Conclusions and references Further research 1. There is a need to deepen in relationship between distribution of distances and optimal point selection (simulation and use of robust measures?). 2. It may be possible to use some of the elements exposed for the determination of multivariate outliers (objects which are very far from the rest). 3. The incidence matrix could be used to search for the best classification, if permutations of cases are evaluated. 4. Why use all distances simultaneously? Triangulation in graphs. 41

6. Conclusions and references References (I) Anderberg, M.R. Cluster Analysis for application. Academic Press, 1973. Cheong, M.-Y.; Lee, H. Determining the number of clusters in cluster analysis. Journal of the Korean Statistical Society (2008), to appear. Eldershaw, C.; Hegland, M. Cluster analysis using triangulation. Computational Techniques and Aplications: CTAC97, 201-208, 1997. Gentle, J.E. Elements of Computational Statistics. Springer Verlag, 2002. 42

6. Conclusions and references References (II) Ghahramani, Z. Unsupervised Learning. In Bousquet, O.; Raetsch, G; von Luxburg, U. (Eds.): Advanced Lectures on Machine Learning. Springer Verlag, 2004. Gordon, A.D. Classification. Chapman and Hall, 1981. Hansen, P.; Jaumard, B. Cluster analysis and mathematical programming. Mathematical Programming, 79, 191-215, 1997. Van Ryzin, J. (Ed.) Classification and Clustering. Academic Press, 1977. 43

6. Conclusions and references References (III) Xu, R. Wunsch, D. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, Vol. 16( 3), 645-678, 2005. Yu, K.; Yu, S.; Tresp, V. Soft clustering on graphs. Advances in Neural Information Processing Systems, 18 (NIPS 2005). 44

Cluster Analysis, Multidimensional Scaling and Graph Theory Dpto. de Estadística, E.E. y O.E.I. Universidad de Alcalá luisf.rivera@uah.es 45