Revitalizing the Scatter Plot

Size: px

Start display at page:

Download "Revitalizing the Scatter Plot"

Jade McDaniel
6 years ago
Views:

1 Revitalizing the Scatter Plot David A. Rabenhorst IBM Research T.J. Watson Research Center Yorktown Heights, NY Abstract Computer-assisted interactive visualization has become a valuable tool for discovering the underlying meaning of tabular data, including categorical tabular data. The capabilities of the more traditionally mundane kinds of pictures like scatter plots can be expanded to usefully depict categorical tabular data by incorporating annotations and transforms, and by integrating the extensions into an interactive system. Keywords: annotation, categorical, data, glyph, multivariate, plot, scatter, tabular, transform, visualization. 1. Introduction Computer graphics and interactivity are great enablers for effective visual mining of multivariate tabular data. Larger and more diverse data domains are explorable and minable. Increasingly diverse types of meaningful pictures can be created, linked, and refined in real time. Transform functions, which remap data values according to an algorithm, can be dynamically applied and cascaded onto data variables to seek even more revealing visualizations. 1.1 Tabular Data Tabular data pervades databases, spreadsheets, and even the world wide web. The arrangement of tabular data elements in rows and columns of cells usually implies a set of relationships and dependencies among the cell values. Each column of data is a potentially a dependent or an independent variable. The number of columns in a table can be called the dimensionality, and each row in a table can be called a case. The types of the individual cell values in a table may be non-numeric (character strings), or even missing. Often, the meaning and implications of the relationships in a table of data are not obvious, especially if there are a lot of cells. Visualization and increasingly sophisticated computer graphics are enormously helpful in discovering, revealing, and exploring the underlying multivariate relationships in tabular data. 1.2 Scatter Plots The types of visualizations which are most appropriate to depict certain types of tabular data typically depend upon the type of data. Relevant picture types include pie charts, bar graphs, histograms, scatter plots, parallel coordinates plots, similarity dendograms, and others. Scatter plots are traditionally most appropriate for numeric real-valued data of low-dimensionality. A typical variation on a scatter plot is the line graph, where lines are drawn between consecutive points, and the points may even be elided. Various techniques exist for increasing the visual dimensionality of scatter plot visualizations by one, two, or even several dimensions through the use of creative annotations. Thus, a two-dimensional planar scatter plot of variables X and Y can also show additional variables by parameterizing one or more visible characteristics of the plotted points, which then might become glyphs. Color is commonly used for this purpose to show a third dimension via a color map. Glyph size and various aspects of glyph shape can also be similarly used. If the parameterized visible characteristics are clearly distinguishable, then multiple methods can be used simultaneously and independently to achieve still higher visual 1

2 dimensionality. However, the visual clarity and usefulness of simultaneously depicting more and more dimensions in this manner rapidly deteriorates. Typically, the points or glyphs in a simple scatter plot are plotted independently of one another, and there may be no visible connection between them. However, such inter-connections of glyphs can be used to depict additional information. For example, in a parametric snake plot, which is an extension of the line graph, connecting lines are drawn between the points of the scatter plot in increasing order of the values of an arbitrary third variable. Or, in a quadwise plot, linking lines are drawn between the corresponding points of two scatter plots, and four dimensions can be seen. More generally, if the plotted points are colored according to a color map, then corresponding points in another plot may be identically colored, and implicit linking of correspondence is accomplished through the point colors without drawing the connecting lines. This is an especially powerful technique, because it can be used to link together an essentially arbitrary number complementary pictures of possibly different kinds without cluttering or visibly degrading any of them. The effectiveness of such a linked combination may reach significantly beyond the sum of the effectiveness of the individual unlinked pictures. These techniques can be used statically for unchanging pictures, but they can become even more powerful when applied dynamically to pictures that change under user control. For example, an interactive technique called brushing enables a user to dynamically refine the meaning of the colors which are automatically linked between different visualizations of the same data. Animation can be used to show additional information over time. 2. Purely and Partially Categorical Data Categorical data is sometimes narrowly equated to be simply non-numeric data. It typically consists of a relatively few unique values which are heavily repeated, like a long speadsheet column that contains only the values yes or no. But these same repetition characteristics could be attributed to numeric data as well. There is also the question of what it might mean for data to be only partially categorical. This leads to a formal mathematical definition of categorality that is not restricted to merely a boolean yes or no, but includes the concept of partial categorality. It can be generalized to be a continuously-valued quotient between 0 and 1 inclusive, and is calculated as the number of repeated values in a variable divided by the number of unique values. Thus, a variable is purely categorical with a quotient of 1 if every unique value is repeated. If no values are repeated, then the categorality is 0. But if just some unique values are not repeated, then the categorality is partial, and valued somewhere between 0 and 1. Thus, variables of both numeric and non-numeric data can be either fully, partially, or not categorical. Is this sense, numeric variables can be categorical with many repeated values if they consist of only small integers. Floating point values can be categorical if they have insufficient precision. Non-categorical floating point values can become categorical if they are transformed to loose precision, as for example by rounding. Mathematically then, a variable s categorality can be described as a scalar descriptive univariate statistic. This definition can be easily extended to be a descriptive bivariate statistic as well, by considering unique combinations of pairs of corresponding case values. Similarly, categorality can be multivariate by considering unique combinations of N-tuples of values. Ranking variables by their categorality may help determine which are most suitable for certain kinds of visualizations. The actual repetition count for each of the unique case values in either a purely or partially categorical variable may well be a vitally important characteristic with respect to visualizing it informatively, and for discovering its relationships to other variables. An interactive system that deals with categorical data usually needs to determine the point repetition counts, and expose them either as additional derived variables, or through a Count transform, which might be easily applied and removed. 2

3 When the Count transform is applied to the original categorical values, each value is replaced with the number of occurrences of it. 3. Visualizing Categorical Data with Scatter Plots The following examples use data from the PLANTS database of the U.S National Plant Center, USDA, NRCS 1999 ( This database includes the scientific name, accepted name/common name, family, and other characteristics for over 82,000 plants. The full plant scientific names are highly un-categorical. The plant family names are highly categorical. The plant common names are partially categorical. A relatively small subset of this database was extracted. First, the full plant scientific names were truncated to just their first word, thus making them highly categorical. Then, the cases were filtered to include only a few hundred cases whose truncated scientific names corresponded with a common name which included the word poison. Thus, the resulting fairly small dataset consisted of poisonous plants, and plants which are closely related to poisonous plants. 3.1 Mapping Non-Numeric Values Into Numbers Sometimes the case values of tabular data are non-numeric character strings that may or may not have a natural order. This is the case with the example dataset. When unordered, the character strings can often be mapped or transformed into relevant numbers for purposes of useful visualization, including plotting them in scatter plots. Although such mapping can be quite useful, it should be done with some care to avoid possibly misleading visualizations. Artificially mapped numeric values which are unavoidably adjacent may exhibit or even emphasize visual proximities that are simply not relevant to the nature of the data. So, depending on the data, the actual mapping algorithm used can be visually important or even critical if there are more than two unique values involved. Algorithms for mapping character strings to numbers can be either simple or elaborate. Since there may be a combinatorially explosive number of mapping variations, exhaustive searching or testing of them is often not possible. An optimal mapping algorithm for the best visualization may not be known, or might not even be possible. For that reason, it can be useful with certain kinds of data to be able to interactively explore some selected families of mapping schemes and their variations to find an acceptably good one, while observing possible visual deficiencies or improvements. Perhaps the simplest, and yet often quite useful, mapping scheme is to transform alphabetical values directly to integer values. All the character string case values within a variable are sorted alphabetically (as in a dictionary). Then each unique string value is replaced with an ordinal integer, starting with 0. Thus, the last and highest ordinal assigned will be one less than the number of unique values. This mapping has the useful properties that strings which come alphabetically before other strings will have mapped values lower than the others, and that identical string values retain identically mapped integer values. Figure 1 shows the simple alphabetical mapping and the bivariate case repetition counts for corresponding pairs of values in the example dataset for the non-numeric categorical variable values Scientific and Family. 3

3.2 Visibly Lost Categorical Characteristics Basic two-dimensional scatter plots are most naturally suited for depicting two dimensions of numeric data.

4 3.2 Visibly Lost Categorical Characteristics Basic two-dimensional scatter plots are most naturally suited for depicting two dimensions of numeric data. And, as already described, annotations to scatter plot points can accommodate and simultaneously depict up to a few additional dimensions of data. But scatter plots showing categorical data on the X and/or the Y axis can easily be non-informative or misleading if the vital repetition counts of the categorical coordinates of the plotted points are not visible. Figure 2 shows a scatter plot of the two highly categorical Scientific and Family variables from the example dataset. The case repetition counts are not visible, and the truly unique non-repeated points cannot be distinguished from the heavily repeated ones. 3.3 Revealing Categorical Characteristics If point annotations are used to depict the point repetition counts, then a scatter plot can become a perfectly reasonable and useful way to visualize repeated and categorical data. Any of the point annotations already described can be used for this purpose. 4

Figure 3 shows a scatter plot of the same two categorical variables Scientific and Family, but with the

The case repetition counts are clearly visible, and the scatter plot becomes a more useful depiction of the

An annotated scatter plot of bivariate categorality such as listed in Figure 1 and plotted as in Figure 3

5 Figure 3 shows a scatter plot of the same two categorical variables Scientific and Family, but with the points plotted as boxes whose size is parameterized by the repetition count of that value pair combination. The case repetition counts are clearly visible, and the scatter plot becomes a more useful depiction of the categorical data. considered separately, such as listed in Figures 4 and 5. An annotated scatter plot of bivariate categorality such as listed in Figure 1 and plotted as in Figure 3 need not be restricted to depicting only bivariate repetition counts. The univariate repetition counts of the unique values of each variable can be 5

The two univariate repetition counts of the variables Scientific and Family can be represented simultaneously by independently parameterizing the widths and heights of the plotted boxes, as in Figure

6 The two univariate repetition counts of the variables Scientific and Family can be represented simultaneously by independently parameterizing the widths and heights of the plotted boxes, as in Figure 6. An alternative to using point annotations to represent the repetition counts of the plotted coordinates is to transform the X and Y coordinate variables so that they are not precisely repeated, and so they will not precisely overlay one another when plotted. This can be done by adding just enough random noise to each case value of each coordinate so as to visually fuzz and separate the previously overlayed points, but not so much as to make the categories overlap. Figure 7 plots the same variables as Figure 1, but with up to 10% random noise added to each case value to depict the repetition counts, instead of using glyph annotations. It bears a strong similarity to the parameterized boxes of Figure 3 and Figure 6, but with point density replacing box size. 6

7 Other important and revealing characteristics of categorality can be directly derived from the repetition counts of unique values. Specifically, and perhaps the most important, is essentially the reverse rank order of the repetition counts, and can be implemented in a transform called Classify. That is, all case values equal to the most commonly repeated case value will be replaced by 1, and all case values equal to the second most commonly repeated case value will be replaced by 2, etc. The uniqueness of repeated case values is preserved by using different but consecutive result values for tied repetition counts. Figure 8 shows the same glyphs as in Figure 6, but the variables are each transformed by the Classify transform, which remaps the categorical values, so that the larger glyphs in each dimension are closer to its origin. 3.4 Remapping Non-Numeric Values to Improve Visualization The simplest mapping algorithms like alphabetic to integer might provide for useful visualizations, but perhaps not the best visualizations. Mapping algorithms can also be constructed in a huge variety of other ways. One such family of algorithms utilizes the value of some numeric parameter or parameters which are independently derived for each subset of unique string case values. The Classify transform described above actually belongs to this family. A more general parameterized categorical remapping function is remap(a,b), where a is a 7

8 categorical vector variable, and b is either a numeric vector variable or a function producing such. A parameter is independently derived from each of the vector subsets of values of b which corresponds to all the cases with a unique value of a. Useful subset parameters might include statistical minimum, maximum, mean, etc. For example, suppose the function uniques(c,d) counts the number of unique values of vector c over the subset of cases for each unique value of d. Then, the function remap(a,uniques(b,a)) counts the degree of fanout to unique values of b from each unique value of a. Figure 9 shows the same glyphs as in Figures 6 and 8, but the categorical variables Scientific and Family on the X and Y axes are each transformed by the fanout number of corresponding unique values in the other variable as described above. The stacks of glyphs at the unique values in each dimension are the most populated away from its origin. Useful categorical remapping transforms can be defined that preserve both the original categories and the original set of unique case values, and only permute the existing set of unique category values. One such transform is called FuzzClass, which displaces and permutes the existing set of unique categorical values by a random amount up to a parameterized value. For example, if a given categorical variable v has many repetitions of the four unique values 1, 2,3, and 4, then FuzzClass(v,1) would have precisely the same unique values, but with a displacement of up to 1 for each. That is, all values which were 1 might have values of either 1 or 2, all values which were 2 might have values of either 1, 2, or 3, all values which were 3 might have values of either 2, 3, or 4, and all values which were 4 might have values of either 3 or 4. Similarly, the transform FuzzClass(v,4) would apply maximum possible displacement, and completely randomize the permutation of unique values. 8

Figure 10 shows the same glyphs as in Figures 9, but the variables are each transformed by FuzzClass(v,2), so that the plotted coordinate positions are each slightly mixed up and displaced by up to 2.

9 Figure 10 shows the same glyphs as in Figures 9, but the variables are each transformed by FuzzClass(v,2), so that the plotted coordinate positions are each slightly mixed up and displaced by up to 2. The fidelity of an algorithm for mapping categorical data can be examined by applying the mapping, visualizing the result, and interactively applying various remapping transforms while judging their possible detrimental effects. 3.5 Depicting a Categorical Parameter Scatter plots showing non-categorical data on the X and Y axes can be used to additionally show a categorical parameter on the Z axis by utilizing the plot areas between the plotted points. Essentially, the technique is to color the plot background with the same color as the nearest plotted point. Thus, the entire plot space gets partitioned by color into the categories mapped from the Z parameter, and their degree of complexity is immediately obvious at a glance. If the categorical Z parameter has relatively few unique values, or if only a few are so parameterized, then the clusters of plotted points falling into each of the parametric categories are quite distinctly revealed. If the category boundaries are fairly smooth, then it should be relatively easy to formulate them into functional rules which divide the case values of the X and Y variables into the categories of the Z variable. Large bumps or irregularities in the category boundaries roughly correspond to special rules or exceptions to the rules. The colored background areas can be kept from visually overwhelming the plotted points by filling them with slightly darker shades than what is used for the points themselves. 9

Figure 11 shows a plot of the Scientific vs Common variables, but divides the plot background according to two categories of the Family parameter.

10 Figure 11 shows a plot of the Scientific vs Common variables, but divides the plot background according to two categories of the Family parameter. The Fabaceae family is one color, and the Pittosporaceae family is another color. The visible holes in the background areas are indicative of irregularities in the categorical patterns of the parameter. Conclusion The notion of categorality of tabular data variables can be mathematically defined as a continuously-valued quotient from 0 and 1. Non-numeric character string values, categorical or not, can often be usefully mapped by a variety of mechanisms to numeric values for visualization purposes. Scatter plots can be used to effectively visualize partially or purely categorical tabular data by using any of a variety of annotation methods tailored to the task. Many of the same methods can often be used as well for other kinds of visualizations like three-dimensional scatter plots and parallel coordinates plots. 10

11 References [1] W.S. Cleveland and R. McGill, Graphical perception: the visual decoding of quantitative information on displays of data (with discussion), Journal of the Royal Statistical Society, Series A, vol. 150, pp , [2] B. E. Rogowitz, D. A. Rabenhorst, J.A. Gerth, E.B. Kalin, Visual Cues for Data Mining, Proceedings of the SPIE/SPSE Symposium on Electronic Imaging, 2657, pp , February [3] David. A. Rabenhorst, Interactive exploration of multidimensional data, Proceedings of the SPIE Symposium on Electronic Imaging, 2179, pp , February [4] R.A. Becker and W.S. Cleveland, Brushing a scatter plot matrix: high interaction graphical methods for data analysis, Murray Hill, NJ, AT&T Bell Laboratories, Technical memorandum (published in Technometrics 29, pp [5] Sheng Ma and Joseph L. Hellerstein, Ordering Categorical Data to Improve Visualization, Accepted to IEEE Symposium on Information Visualization, [6] A. Inselberg, The plane with parallel coordinates, The visual computer, vol. 1, pp ,

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs

DATA PRESENTATION At the end of the chapter, you will learn to: Present data in textual form Construct different types of table and graphs Identify the characteristics of a good table and graph Identify