An Experiment in Visual Clustering Using Star Glyph Displays

Size: px

Start display at page:

Download "An Experiment in Visual Clustering Using Star Glyph Displays"

Griffin Chapman
5 years ago
Views:

1 An Experiment in Visual Clustering Using Star Glyph Displays by Hanna Kazhamiaka A Research Paper presented to the University of Waterloo in partial fulfillment of the requirements for the degree of Master of Mathematics in Statistics Waterloo, Ontario, Canada September 30, 2011

2 Contents 1 Introduction 4 2 Theory and Methods Graph Traversal Algorithms Applied to Star Glyph Displays Tree Distances Metric Experimental Design Hypothesis Data Sets Protocol Interface Results Data Description Analysis Exploratory Analysis Models Comparing Standard Deviations Conclusions Acknowledgments References 2

3 Abstract A star glyph plot is a visualization technique often employed for the task of clustering high-dimensional data points. The order in which the data variables are assigned to the axes affects the shape of a star glyph; changing this ordering may result in a different clustering outcome for the same data. To reduce the order-dependence, a sequence for axes assignment in which all pairs of variables appear adjacently is suggested. An experiment is performed to compare subjects performance when clustering with the ordinary star glyph plot, and when working with the improved version. The analysis and results of this study are presented in this paper. 3

4 1 Introduction The ordering of components in a statistical graphical display often requires some consideration on behalf of the statistician. Choosing an alternate ordering may change certain aspects of a graphical display, thus revealing or hiding patterns, trends, anomalies and relations in the data. A star glyph plot is one such display which depends on the ordering of data variables. A star glyph plot is a tool used to visualize, and then cluster multidimensional data points. A visual representation is created for each data point in the form of a star-shaped glyph. Every variable in the data, or dimension, is assigned to an axis emanating from the origin of the glyph; this is where the ordering of variables has an effect. Typically, this assignment follows the natural order of variables found in the data. Changing the order of variables as they are assigned to the axes will change the shape of the glyph; this effect is explained in more detail in Section 2.1. Clustering is done through a visual inspection of the glyphs atrributes such as shape and size, and a grouping of like glyphs. It is of interest to remove this dependence on order from the star glyph plot, so as to not distort patterns in the data and allow for a more accurate visual clustering. A method to achieve this goal by creating glyphs in which each pair of variables appears adjacently is suggested by Hurley and Oldford 4

5 [1]. Eulerian tours and Hamiltonian decompositions of complete graphs are used to generate such orderings of variables. The resulting ordering is used in the assignment of variables to axes. A star glyph display created in this manner is thought to reduce the order effect, and produce a clustering closer to the true one found in the data. Whether or not such a display is an improvement over a standard star glyph plot is the question investigated in this paper. For quantitative evidence, a study was conducted to assess users performance when working with different types of star glyph plots. Subjects were presented with several glyph displays of the same data set, where the ordering of variables was different for each display. Some glyph displays were produced by a standard assignment of variables to the axes referred to as an ordinary glyph sequence; and other displays were created by ensuring all pairs of variables appear adjacently once. The subjects were asked to cluster each glyph display. The metric used for subjects performance was the distance from the target clustering; this measure is described in Section 2.2. Section 3 outlines the experimental protocol. Experimental results showed that using an ordering generated by an Eulerian tour yields a more reliable visual clustering of the data. The variation around the target clustering was smaller when subjects worked with star glyphs with such an ordering. There were no significant differences in ac- 5

6 curacy, or proximity to the target, between the different types of orderings. Data analysis and conclusions are found in Section 4. 6

7 2 Theory and Methods 2.1 Graph Traversal Algorithms Applied to Star Glyph Displays In an ordinary star glyph display, each variable is assigned a radius emanating from an origin, values are plotted on the radii to determine length, and lines connecting them are drawn to form a polygon. Clustering of star glyphs is done through a visual inspection of their shapes. The shape of a star glyph, however, depends largely on the ordering of the variables assigned to the axes. The same data point can look very different if the ordering of variables is changed, as illustrated in Figure 1. Changes in glyph shapes may in turn lead to a different clustering of the data points. To lessen the order-dependence effect, using a sequence for axes assignment where all pairs of variables appear adjacently is suggested in Hurley and Oldford [1]. Generating such a sequence can be done through graph traversal algorithms [1]. Applying graph theory results is a convenient way to formalize the procedure for glyph construction. The method is as follows. A complete graph is formed, where each variable in the data set is assigned to a node. From the definition of completeness, each pair of nodes in the graph is connected by an edge. An arrangement of all pairings of nodes is obtained by finding an Eulerian tour, or a Hamiltonian decomposition of the graph. An Eulerian tour of a graph is a closed path which visits every edge of 7

8 Figure 1: Star glyph plots for six data points. The plot on the right was created by reordering the variables. the graph exactly once. It is easy to see that such a traversal would generate an ordering on the nodes where each pair appears adjacently. It may be the case that the Eulerian tour is composed of Hamiltonian decompositions of the graph, which is the setup in the second method suggested for constructing glyphs. A Hamiltonian decomposition is a series of edge-distinct Hamiltonian cycles closed paths which visit each node exactly once the union of which is the complete graph. Joining these Hamiltonian cycles at the same node results in an Eulerian tour and produces the desired arrangement of variables. In summary, there are two methods to generate a sequence in which all pairs of variables appear adjacently: by means of an Eulerian tour or a Hamiltonian decomposition of a complete graph. These two sequences, along with the ordinary glyph sequence, produce three different star glyph 8

9 representations of the same data. 2.2 Tree Distances Metric With three different star glyph representations presumably leading to different clusterings, a means of comparing them is necessary. An intuitive way to assess the quality of a clustering is to compare its closeness to the true grouping of the data. Using a tree-based distance measure developed by Oldford and Zhou [2], a distance from one clustering to the target clustering can be obtained. A clustering tree is created for each clustering outcome; an outline of the procedure is summarized below. 1. The entire set of observations to be clustered is assigned to the root node of the tree. 2. The root has at least two branches; its children are mutually exclusive subsets of the data or the clusters found. The distance measure takes two such clustering trees, transforms each of them into a vector, and finds the Euclidean distance between the two vectors. This method is especially effective for hierarchical clustering. Each layer of the tree is a consecutive split of the data into subgroups. This however, is not relevant for the experiment discussed here, since hierarchical visual clustering using star glyphs is a time consuming and labourious procedure. Each clustering tree resulting from an outcome of this experiment had two 9

10 layers: one for the root node, and the second representing the exact clustering indicated by the subject. To test the performance of the three glyph sequences discussed in the previous section an ordinary sequence and the ones produced by an Eulerian tour and a Hamiltonian decomposition the same data set must be clustered three times, once for each sequence. For each clustering, a tree is created and its distance to the true clustering tree is calculated. This produces a set of distances recorded as the response variable in the study. 10

11 3 Experimental Design 3.1 Hypothesis The objective of the study was to compare the effectiveness of three different techniques for arranging star glyph radii. Since the main purpose of star glyph plots is to facilitate visual clustering, naturally their assessment involved human participants who were asked to perform visual clustering tasks on a computer screen. The response measured was the distance between the true clustering of the data and the one produced by the subject. Of interest were the differences in distances using an ordinary glyph sequence versus one produced by an Eulerian tour or a Hamiltonian decomposition. 3.2 Data Sets Participants were presented with star glyph representations of two data sets. These were carefully selected to be reasonably characteristic of data sets one might want to cluster in practice. Many considerations were addressed with regards to suitability of data sets, such as number of observations, dimensionality, and number of clusters. For the purposes of the experiment, the data sets had to be small enough so that subjects could finish the task in a reasonable amount of time, but large enough to mimic the qualities of a data set found in practice. Selected data sets contained points. The optimal dimensionality of the data points was another issue. Star glyphs are considered ineffective for very high-dimensional data; as the number of radii increases, the patterns in glyph shapes become difficult to identify. For small 11

12 dimensions, the ordinary glyph sequence is too similar to one in which all pairs of variables appear adjacently, and the star glyph shapes look almost the same. Six- and seven-dimensional data was used in this study; this was thought to be optimal. Both data sets contained four true clusters of different sizes. Ideally, to have greater confidence in the effect of changing glyph shapes, the subjects would be tested with more than two data sets the more, the better. This kind of experiment, however, is very difficult to set up in practice. Trial runs of the experiment indicated that it took subjects minutes on average, to group six star glyph plots consisting of points. Adding another data set would increase the time required to complete the experiment, and could affect the participants performance. It was assumed that performance would get worse towards the end of the experiment, as participants could become tired and careless with the task. A reasonable expected completion time was chosen as 45 minutes. Based on these considerations, the decision was made to use two data sets in the experiment. Participants first worked with an artificial data set consisting of 50 points, forming 4 clusters in a 7-dimensional space. This data consisted of observations randomly generated from four Gaussian distributions with varying means. One such data set is displayed in Figure 2. The true grouping of the data points corresponded to the four different distributions from which the 12

13 points were generated. The sizes of clusters were assigned randomly. The second data set used was a subset of a real data set giving birth rates, death rates, life expectancies, and Gross National Product for 97 countries in the year The annotated data set, referred to as Poverty data, can be found at A group indicator for each country was given in the file, and is based on geographic location as well as general economic factors (type of economy, first-world, third-world, etc). To meet time constraints, several groups, as well as individual records which contained missing data, were removed from the data set. The total number of observations was reduced to 67, each belonging to one of four groups. The data set contained 6 variables. The star glyph plot of the resulting data is displayed in Figure Protocol The experiment was divided into two parts: a demonstration of the interface, and the participants tasks. It was conducted on a computer interface programmed to facilitate clustering tasks. Participants were first shown a demo of the interface functions using a test data set. After they had had a chance to familiarize themselves with the interface, the actual experiment was begun. The instructions given were to group the star glyphs based on perceived similary and indicate the chosen grouping by brushing each cluster in a unique colour. Participants were shown a total of 6 displays, each having 13

14 Figure 2: Artificial data set. Figure 3: Poverty data set. 14

15 either 50 or 67 star glyphs neatly arranged in a grid pattern. The first 3 were star glyph representations of an artificial data set, the latter 3 of a real data set. For each data set, the order in which participants were shown plots with ordinary, Eulerian tour or Hamiltonian decomposition glyph sequences was randomized. Upon completion of the entire experiment, the partipants received renumeration in the form of a $10 gift certificate. This study was reviewed and received ethics clearance through the Office of Research Ethics at the University of Waterloo. The subjects recruited for this study were undergraduate and graduate students at the University of Waterloo, with a mathematics, science, or engineering background. Almost all had no prior exposure to star glyph plots. A total of 32 students participated in the study. 3.4 Interface The graphical user interface for this experiment was written in R by Adrian Waddell. The program takes a data set, location parameters for the glyphs and a glyph sequence as input, and produces a window displaying the corresponding star glyph plot. The user is able to move the glyphs around in the window, and brush them in different colours. Participants typically arranged 15

16 the glyphs in groups at different corners of the window, then brushed each group in a unique colour. A snapshot of the window can be found in Figure 4 and Figure 5. Figure 4: Experimental interface. 16

17 Figure 5: After clustering. 17

18 4 Results 4.1 Data Description The raw data collected from the experiment consisted of two files for each participant: one file containing the results of clustering the artificial data set, and the other having the results for the real data set clustering. Each file was a data frame where every row corresponded to an observation from the data set, and the columns indicated the colour it was brushed for each of the three glyph sequences. An indicator for the true group that each observation belonged to was also appended to the file. The data processing stage involved calculating the distance from each of the three clustering outcomes to the true clustering of the data. This was done by applying the tree distance methods discussed in Section 2.2. The resulting data set was a list of distance measures, with indicators of the data set that was clustered (real or artificial), glyph sequence used, and subject ID for each record. This formatted data was used for the analysis stage. 4.2 Analysis The experiment was analysed as a randomized block design with two treatment factors and one blocking factor. The two treatment factors were Data set and Glyph sequence ; with the first taking on two levels artificial Gaussian or real Poverty data, and the second with three levels an ordi- 18

19 nary sequence in which variables appear in their natural order, a sequence produced by an Eulerian tour on the complete graph of variables, or one produced by a Hamiltonian decomposition of the same graph. The effect of the Glyph sequence factor was of primary interest, as per study objectives Exploratory Analysis This section outlines some of the results of an initial exploratory analysis of the data. Before imposing a strict model to the data, it was viewed and analysed via graphical tools. One of the goals of an exploratory investigation is outlier detection. A boxplot of the distances grouped by subject is displayed in Figure 6. Subjects 8, 16 and 17 are identified as outliers; due to their lower medians and greater variation in their scores, relative to the distance scores of the other participants. Note that a lower score is better, in the context of this experiment. Since the response variable is distance from the true clustering, a small distance implies that the subject s clustering was very close to the target one. Observations collected from subjects 8, 16 and 17 were removed from the data set used for further analysis, so as not to skew the results. Possible reasons for their exceptional performance may be failure to follow instructions, or prior experience with similar tasks. 19

20 Figure 6: Boxplots of distances grouped by subject. Subjects 8, 16 and 17 are identified as outliers. Next the data is summarized across subjects and grouped by glyph sequence. Boxplots of distances by glyph sequence can be found in Figure 7. This allows for a visual comparison of median and spread for clustering distances obtained by using the three different types of glyph sequence. Inspection of these plots shows only small differences in the medians for the three different glyph sequences; the median distances for the Eulerian and Hamiltonian glyph sequences are slightly lower than that of the ordinary glyph sequence. A larger effect is seen in the spread of the data. The interquartile range for values associated with an ordinary glyph sequence is much 20

21 larger than the range of the other two. In addition, note that the higher end of the inter-quartile range is the same for all glyph sequences, but the lower end is much lower for the ordinary one. This suggests that it is more common to see a smaller distance to the true clustering when using the ordinary glyph sequence. Figure 7: Boxplots of distances grouped by glyph sequence. Another useful graphical summary is a comparison of the distribution of distances across the two data sets. Figure 8 presents two boxplots, both grouped by glyph sequence: one for the artificial Gaussian data, the other for the real Poverty data set. The most noticeable difference between the two 21

22 data sets is the range of values. This however, should be attributed to the nature of the data sets; the artificial data set was randomly generated each time the experiment was conducted, whereas the real data set remained the same throughout. As a result, each participant was presented with a unique artificial data set, and all participants worked with the same real data set. Some of the variation in distance scores with the artificial data sets can be expained by the natural variation between the data sets themselves. Figure 8: Boxplots of distances grouped by glyph sequence for the artifical and real data sets. For the real data set, the results can be displayed in a MDS plot, as shown in Figure 9. An average clustering tree is created for each glyph sequence by combining the clustering outcomes across all subjects. Multidimensional 22

23 Figure 9: An MDS plot of distances between average clustering trees for the real data set. scaling techniques are applied to the distances between the average clustering trees to produce a visualization. The Euclidean distances between the points on the MDS plot are close to the distances between trees, which allows for comparison of the performance of the three glyph sequences. Note that it would not make sense to produce such a plot for the artificial Gaussian data, since the data set, and hence its true clustering was different for each subject. Combining the clustering outcomes across subjects is not appropriate in this case. The points associated with the three ordering methods appear almost equally far from the true clustering in the plot, suggesting that no glyph sequence performs better than the others in terms of accuracy. 23

24 The graphical representations of the experimental data suggest that the impact of glyph sequence on the accuracy of clustering is not large. To validate this assumption, a formal model is fit to the data in the next section Models A standard linear regression model was fit to the experimental data. The response variable was Distance, and the explanatory variables were the blocking factor Subject, and the treatment factors Data Set and Glyph Sequence. No significant interactions between treatment factors were found; and thus no interaction term is included in the model. The model can be written as follows: Y ijk = µ + S i + D j + G k + ɛ ijk where S i = is subject effect, D j = data set effect, and G k = glyph sequence effect. The indices i = 1,.., 29 correspond to the 29 subjects whose results were used in the analysis, j = 1, 2 indexes the data set factor (artificial, Poverty), and k = 1, 2, 3 is associated with the glyph sequence used for the clustering. The results of the model are found in the table below. No treatment effects are found to be significant, as can be seen from the corresponding p-values. This confirms the results suggested by the plots in the exploratory 24

25 analysis section: glyph sequence does not have a significant impact on the accuracy of clustering performance. Effect Df SS MS F-value P-value Subject Data Set Glyphs Residuals Comparing Standard Deviations The boxplots found in Figure 7 and Figure 8 in Section suggest that there are differences in the sample standard deviations for the three different glyph sequences. This is of interest, because a smaller deviation around the target is indicative of a more reliable visual clustering method. To validate this hypothesis, a one-sided F-test was performed for each pair of glyph sequences, at the 95% sigificance level. Results of these tests are displayed in the table below. The standard deviation for Eulerian glyph sequences is found to be smaller than that of ordinary and Hamiltonian sequences; there is no significant difference between ordinary and Hamiltonian glyph sequences. H o H a F-value P-value σ H = σ O σ H < σ O σ E = σ O σ E < σ O σ E = σ H σ E < σ H

26 4.3 Conclusions Based on an exploration of the experimental data through graphical aids, as well as formal analysis of the results, a significant difference in precision, but not in accuracy, was found between the performances of the three different glyph sequences. A glyph sequence produced by an Eulerian tour of a complete graph on all of the variables yields a more precise visual clustering, than one created by a Hamiltonian decomposition of the same graph, or an ordinary glyph sequence where all variables appear once, in their natural order. Although the average distance from the true clustering did not differ across the three glyph sequences, a smaller variation around the true clustering found for Eulerian sequences is a significant improvement on the ordinary method for creating star glyphs. A more reliable technique for visual clustering is one which is less likely to result in clusterings far from the true one that is inherent in the data. Thus, using a glyph sequence obtained by an Eulerian tour results in a gain in precision, and leads to a more reliable visual clustering. It is important to keep in mind that the experiments described in this paper are sensitive to many factors, including test subjects backgrounds and characteristics of the data sets used. Perhaps targeting only subjects with experience in data visualization and previous exposure to star glyph plots would render different results. With regards to the data sets, the difficulty lies in finding those which are appropriate for evaluating clustering methods. 26

27 Clusters in the data are intrinsically somewhat arbitrary it is up to the researcher to define what constitutes a cluster in any given data set. This lack of a clear ground truth against which a technique can be evaluated introduces subjectivity and effects the reliability of results. A way to counter that is to test subjects on a variety of data sets. 27

28 Acknowledgments I would like to thank my supervisor Professor Wayne Oldford for his guidance and support in the writing of this research essay, and Adrian Waddell for his patience and assistance with the programming aspect of this project. 28

29 References [1] Hurley, C.B. and Oldford, R.W (2010). Pairwise Display of High- Dimensional Information via Eulerian Tours and Hamiltonian Decompositions. Journal of Computational and Graphical Statistics, 19, [2] Reference for Tree Distances 29

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &