Revitalizing the Scatter Plot

Size: px
Start display at page:

Download "Revitalizing the Scatter Plot"

Transcription

1 Revitalizing the Scatter Plot David A. Rabenhorst IBM Research T.J. Watson Research Center Yorktown Heights, NY Abstract Computer-assisted interactive visualization has become a valuable tool for discovering the underlying meaning of tabular data, including categorical tabular data. The capabilities of the more traditionally mundane kinds of pictures like scatter plots can be expanded to usefully depict categorical tabular data by incorporating annotations and transforms, and by integrating the extensions into an interactive system. Keywords: annotation, categorical, data, glyph, multivariate, plot, scatter, tabular, transform, visualization. 1. Introduction Computer graphics and interactivity are great enablers for effective visual mining of multivariate tabular data. Larger and more diverse data domains are explorable and minable. Increasingly diverse types of meaningful pictures can be created, linked, and refined in real time. Transform functions, which remap data values according to an algorithm, can be dynamically applied and cascaded onto data variables to seek even more revealing visualizations. 1.1 Tabular Data Tabular data pervades databases, spreadsheets, and even the world wide web. The arrangement of tabular data elements in rows and columns of cells usually implies a set of relationships and dependencies among the cell values. Each column of data is a potentially a dependent or an independent variable. The number of columns in a table can be called the dimensionality, and each row in a table can be called a case. The types of the individual cell values in a table may be non-numeric (character strings), or even missing. Often, the meaning and implications of the relationships in a table of data are not obvious, especially if there are a lot of cells. Visualization and increasingly sophisticated computer graphics are enormously helpful in discovering, revealing, and exploring the underlying multivariate relationships in tabular data. 1.2 Scatter Plots The types of visualizations which are most appropriate to depict certain types of tabular data typically depend upon the type of data. Relevant picture types include pie charts, bar graphs, histograms, scatter plots, parallel coordinates plots, similarity dendograms, and others. Scatter plots are traditionally most appropriate for numeric real-valued data of low-dimensionality. A typical variation on a scatter plot is the line graph, where lines are drawn between consecutive points, and the points may even be elided. Various techniques exist for increasing the visual dimensionality of scatter plot visualizations by one, two, or even several dimensions through the use of creative annotations. Thus, a two-dimensional planar scatter plot of variables X and Y can also show additional variables by parameterizing one or more visible characteristics of the plotted points, which then might become glyphs. Color is commonly used for this purpose to show a third dimension via a color map. Glyph size and various aspects of glyph shape can also be similarly used. If the parameterized visible characteristics are clearly distinguishable, then multiple methods can be used simultaneously and independently to achieve still higher visual 1

2 dimensionality. However, the visual clarity and usefulness of simultaneously depicting more and more dimensions in this manner rapidly deteriorates. Typically, the points or glyphs in a simple scatter plot are plotted independently of one another, and there may be no visible connection between them. However, such inter-connections of glyphs can be used to depict additional information. For example, in a parametric snake plot, which is an extension of the line graph, connecting lines are drawn between the points of the scatter plot in increasing order of the values of an arbitrary third variable. Or, in a quadwise plot, linking lines are drawn between the corresponding points of two scatter plots, and four dimensions can be seen. More generally, if the plotted points are colored according to a color map, then corresponding points in another plot may be identically colored, and implicit linking of correspondence is accomplished through the point colors without drawing the connecting lines. This is an especially powerful technique, because it can be used to link together an essentially arbitrary number complementary pictures of possibly different kinds without cluttering or visibly degrading any of them. The effectiveness of such a linked combination may reach significantly beyond the sum of the effectiveness of the individual unlinked pictures. These techniques can be used statically for unchanging pictures, but they can become even more powerful when applied dynamically to pictures that change under user control. For example, an interactive technique called brushing enables a user to dynamically refine the meaning of the colors which are automatically linked between different visualizations of the same data. Animation can be used to show additional information over time. 2. Purely and Partially Categorical Data Categorical data is sometimes narrowly equated to be simply non-numeric data. It typically consists of a relatively few unique values which are heavily repeated, like a long speadsheet column that contains only the values yes or no. But these same repetition characteristics could be attributed to numeric data as well. There is also the question of what it might mean for data to be only partially categorical. This leads to a formal mathematical definition of categorality that is not restricted to merely a boolean yes or no, but includes the concept of partial categorality. It can be generalized to be a continuously-valued quotient between 0 and 1 inclusive, and is calculated as the number of repeated values in a variable divided by the number of unique values. Thus, a variable is purely categorical with a quotient of 1 if every unique value is repeated. If no values are repeated, then the categorality is 0. But if just some unique values are not repeated, then the categorality is partial, and valued somewhere between 0 and 1. Thus, variables of both numeric and non-numeric data can be either fully, partially, or not categorical. Is this sense, numeric variables can be categorical with many repeated values if they consist of only small integers. Floating point values can be categorical if they have insufficient precision. Non-categorical floating point values can become categorical if they are transformed to loose precision, as for example by rounding. Mathematically then, a variable s categorality can be described as a scalar descriptive univariate statistic. This definition can be easily extended to be a descriptive bivariate statistic as well, by considering unique combinations of pairs of corresponding case values. Similarly, categorality can be multivariate by considering unique combinations of N-tuples of values. Ranking variables by their categorality may help determine which are most suitable for certain kinds of visualizations. The actual repetition count for each of the unique case values in either a purely or partially categorical variable may well be a vitally important characteristic with respect to visualizing it informatively, and for discovering its relationships to other variables. An interactive system that deals with categorical data usually needs to determine the point repetition counts, and expose them either as additional derived variables, or through a Count transform, which might be easily applied and removed. 2

3 When the Count transform is applied to the original categorical values, each value is replaced with the number of occurrences of it. 3. Visualizing Categorical Data with Scatter Plots The following examples use data from the PLANTS database of the U.S National Plant Center, USDA, NRCS 1999 ( This database includes the scientific name, accepted name/common name, family, and other characteristics for over 82,000 plants. The full plant scientific names are highly un-categorical. The plant family names are highly categorical. The plant common names are partially categorical. A relatively small subset of this database was extracted. First, the full plant scientific names were truncated to just their first word, thus making them highly categorical. Then, the cases were filtered to include only a few hundred cases whose truncated scientific names corresponded with a common name which included the word poison. Thus, the resulting fairly small dataset consisted of poisonous plants, and plants which are closely related to poisonous plants. 3.1 Mapping Non-Numeric Values Into Numbers Sometimes the case values of tabular data are non-numeric character strings that may or may not have a natural order. This is the case with the example dataset. When unordered, the character strings can often be mapped or transformed into relevant numbers for purposes of useful visualization, including plotting them in scatter plots. Although such mapping can be quite useful, it should be done with some care to avoid possibly misleading visualizations. Artificially mapped numeric values which are unavoidably adjacent may exhibit or even emphasize visual proximities that are simply not relevant to the nature of the data. So, depending on the data, the actual mapping algorithm used can be visually important or even critical if there are more than two unique values involved. Algorithms for mapping character strings to numbers can be either simple or elaborate. Since there may be a combinatorially explosive number of mapping variations, exhaustive searching or testing of them is often not possible. An optimal mapping algorithm for the best visualization may not be known, or might not even be possible. For that reason, it can be useful with certain kinds of data to be able to interactively explore some selected families of mapping schemes and their variations to find an acceptably good one, while observing possible visual deficiencies or improvements. Perhaps the simplest, and yet often quite useful, mapping scheme is to transform alphabetical values directly to integer values. All the character string case values within a variable are sorted alphabetically (as in a dictionary). Then each unique string value is replaced with an ordinal integer, starting with 0. Thus, the last and highest ordinal assigned will be one less than the number of unique values. This mapping has the useful properties that strings which come alphabetically before other strings will have mapped values lower than the others, and that identical string values retain identically mapped integer values. Figure 1 shows the simple alphabetical mapping and the bivariate case repetition counts for corresponding pairs of values in the example dataset for the non-numeric categorical variable values Scientific and Family. 3

4 3.2 Visibly Lost Categorical Characteristics Basic two-dimensional scatter plots are most naturally suited for depicting two dimensions of numeric data. And, as already described, annotations to scatter plot points can accommodate and simultaneously depict up to a few additional dimensions of data. But scatter plots showing categorical data on the X and/or the Y axis can easily be non-informative or misleading if the vital repetition counts of the categorical coordinates of the plotted points are not visible. Figure 2 shows a scatter plot of the two highly categorical Scientific and Family variables from the example dataset. The case repetition counts are not visible, and the truly unique non-repeated points cannot be distinguished from the heavily repeated ones. 3.3 Revealing Categorical Characteristics If point annotations are used to depict the point repetition counts, then a scatter plot can become a perfectly reasonable and useful way to visualize repeated and categorical data. Any of the point annotations already described can be used for this purpose. 4

5 Figure 3 shows a scatter plot of the same two categorical variables Scientific and Family, but with the points plotted as boxes whose size is parameterized by the repetition count of that value pair combination. The case repetition counts are clearly visible, and the scatter plot becomes a more useful depiction of the categorical data. considered separately, such as listed in Figures 4 and 5. An annotated scatter plot of bivariate categorality such as listed in Figure 1 and plotted as in Figure 3 need not be restricted to depicting only bivariate repetition counts. The univariate repetition counts of the unique values of each variable can be 5

6 The two univariate repetition counts of the variables Scientific and Family can be represented simultaneously by independently parameterizing the widths and heights of the plotted boxes, as in Figure 6. An alternative to using point annotations to represent the repetition counts of the plotted coordinates is to transform the X and Y coordinate variables so that they are not precisely repeated, and so they will not precisely overlay one another when plotted. This can be done by adding just enough random noise to each case value of each coordinate so as to visually fuzz and separate the previously overlayed points, but not so much as to make the categories overlap. Figure 7 plots the same variables as Figure 1, but with up to 10% random noise added to each case value to depict the repetition counts, instead of using glyph annotations. It bears a strong similarity to the parameterized boxes of Figure 3 and Figure 6, but with point density replacing box size. 6

7 Other important and revealing characteristics of categorality can be directly derived from the repetition counts of unique values. Specifically, and perhaps the most important, is essentially the reverse rank order of the repetition counts, and can be implemented in a transform called Classify. That is, all case values equal to the most commonly repeated case value will be replaced by 1, and all case values equal to the second most commonly repeated case value will be replaced by 2, etc. The uniqueness of repeated case values is preserved by using different but consecutive result values for tied repetition counts. Figure 8 shows the same glyphs as in Figure 6, but the variables are each transformed by the Classify transform, which remaps the categorical values, so that the larger glyphs in each dimension are closer to its origin. 3.4 Remapping Non-Numeric Values to Improve Visualization The simplest mapping algorithms like alphabetic to integer might provide for useful visualizations, but perhaps not the best visualizations. Mapping algorithms can also be constructed in a huge variety of other ways. One such family of algorithms utilizes the value of some numeric parameter or parameters which are independently derived for each subset of unique string case values. The Classify transform described above actually belongs to this family. A more general parameterized categorical remapping function is remap(a,b), where a is a 7

8 categorical vector variable, and b is either a numeric vector variable or a function producing such. A parameter is independently derived from each of the vector subsets of values of b which corresponds to all the cases with a unique value of a. Useful subset parameters might include statistical minimum, maximum, mean, etc. For example, suppose the function uniques(c,d) counts the number of unique values of vector c over the subset of cases for each unique value of d. Then, the function remap(a,uniques(b,a)) counts the degree of fanout to unique values of b from each unique value of a. Figure 9 shows the same glyphs as in Figures 6 and 8, but the categorical variables Scientific and Family on the X and Y axes are each transformed by the fanout number of corresponding unique values in the other variable as described above. The stacks of glyphs at the unique values in each dimension are the most populated away from its origin. Useful categorical remapping transforms can be defined that preserve both the original categories and the original set of unique case values, and only permute the existing set of unique category values. One such transform is called FuzzClass, which displaces and permutes the existing set of unique categorical values by a random amount up to a parameterized value. For example, if a given categorical variable v has many repetitions of the four unique values 1, 2,3, and 4, then FuzzClass(v,1) would have precisely the same unique values, but with a displacement of up to 1 for each. That is, all values which were 1 might have values of either 1 or 2, all values which were 2 might have values of either 1, 2, or 3, all values which were 3 might have values of either 2, 3, or 4, and all values which were 4 might have values of either 3 or 4. Similarly, the transform FuzzClass(v,4) would apply maximum possible displacement, and completely randomize the permutation of unique values. 8

9 Figure 10 shows the same glyphs as in Figures 9, but the variables are each transformed by FuzzClass(v,2), so that the plotted coordinate positions are each slightly mixed up and displaced by up to 2. The fidelity of an algorithm for mapping categorical data can be examined by applying the mapping, visualizing the result, and interactively applying various remapping transforms while judging their possible detrimental effects. 3.5 Depicting a Categorical Parameter Scatter plots showing non-categorical data on the X and Y axes can be used to additionally show a categorical parameter on the Z axis by utilizing the plot areas between the plotted points. Essentially, the technique is to color the plot background with the same color as the nearest plotted point. Thus, the entire plot space gets partitioned by color into the categories mapped from the Z parameter, and their degree of complexity is immediately obvious at a glance. If the categorical Z parameter has relatively few unique values, or if only a few are so parameterized, then the clusters of plotted points falling into each of the parametric categories are quite distinctly revealed. If the category boundaries are fairly smooth, then it should be relatively easy to formulate them into functional rules which divide the case values of the X and Y variables into the categories of the Z variable. Large bumps or irregularities in the category boundaries roughly correspond to special rules or exceptions to the rules. The colored background areas can be kept from visually overwhelming the plotted points by filling them with slightly darker shades than what is used for the points themselves. 9

10 Figure 11 shows a plot of the Scientific vs Common variables, but divides the plot background according to two categories of the Family parameter. The Fabaceae family is one color, and the Pittosporaceae family is another color. The visible holes in the background areas are indicative of irregularities in the categorical patterns of the parameter. Conclusion The notion of categorality of tabular data variables can be mathematically defined as a continuously-valued quotient from 0 and 1. Non-numeric character string values, categorical or not, can often be usefully mapped by a variety of mechanisms to numeric values for visualization purposes. Scatter plots can be used to effectively visualize partially or purely categorical tabular data by using any of a variety of annotation methods tailored to the task. Many of the same methods can often be used as well for other kinds of visualizations like three-dimensional scatter plots and parallel coordinates plots. 10

11 References [1] W.S. Cleveland and R. McGill, Graphical perception: the visual decoding of quantitative information on displays of data (with discussion), Journal of the Royal Statistical Society, Series A, vol. 150, pp , [2] B. E. Rogowitz, D. A. Rabenhorst, J.A. Gerth, E.B. Kalin, Visual Cues for Data Mining, Proceedings of the SPIE/SPSE Symposium on Electronic Imaging, 2657, pp , February [3] David. A. Rabenhorst, Interactive exploration of multidimensional data, Proceedings of the SPIE Symposium on Electronic Imaging, 2179, pp , February [4] R.A. Becker and W.S. Cleveland, Brushing a scatter plot matrix: high interaction graphical methods for data analysis, Murray Hill, NJ, AT&T Bell Laboratories, Technical memorandum (published in Technometrics 29, pp [5] Sheng Ma and Joseph L. Hellerstein, Ordering Categorical Data to Improve Visualization, Accepted to IEEE Symposium on Information Visualization, [6] A. Inselberg, The plane with parallel coordinates, The visual computer, vol. 1, pp ,

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs DATA PRESENTATION At the end of the chapter, you will learn to: Present data in textual form Construct different types of table and graphs Identify the characteristics of a good table and graph Identify

More information

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc Section 2-2 Frequency Distributions Copyright 2010, 2007, 2004 Pearson Education, Inc. 2.1-1 Frequency Distribution Frequency Distribution (or Frequency Table) It shows how a data set is partitioned among

More information

University of Florida CISE department Gator Engineering. Visualization

University of Florida CISE department Gator Engineering. Visualization Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to

More information

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- # Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 2 Summarizing and Graphing Data 2-1 Review and Preview 2-2 Frequency Distributions 2-3 Histograms

More information

CHAPTER 4: MICROSOFT OFFICE: EXCEL 2010

CHAPTER 4: MICROSOFT OFFICE: EXCEL 2010 CHAPTER 4: MICROSOFT OFFICE: EXCEL 2010 Quick Summary A workbook an Excel document that stores data contains one or more pages called a worksheet. A worksheet or spreadsheet is stored in a workbook, and

More information

2.1: Frequency Distributions

2.1: Frequency Distributions 2.1: Frequency Distributions Frequency Distribution: organization of data into groups called. A: Categorical Frequency Distribution used for and level qualitative data that can be put into categories.

More information

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution Chapter 2 Summarizing & Graphing Data Slide 1 Overview Descriptive Statistics Slide 2 A) Overview B) Frequency Distributions C) Visualizing Data summarize or describe the important characteristics of a

More information

Trellis Displays. Definition. Example. Trellising: Which plot is best? Historical Development. Technical Definition

Trellis Displays. Definition. Example. Trellising: Which plot is best? Historical Development. Technical Definition Trellis Displays The curse of dimensionality as described by Huber [6] is not restricted to mathematical statistical problems, but can be found in graphicbased data analysis as well. Most plots like histograms

More information

Middle School Math Course 3

Middle School Math Course 3 Middle School Math Course 3 Correlation of the ALEKS course Middle School Math Course 3 to the Texas Essential Knowledge and Skills (TEKS) for Mathematics Grade 8 (2012) (1) Mathematical process standards.

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

A SOM-view of oilfield data: A novel vector field visualization for Self-Organizing Maps and its applications in the petroleum industry

A SOM-view of oilfield data: A novel vector field visualization for Self-Organizing Maps and its applications in the petroleum industry A SOM-view of oilfield data: A novel vector field visualization for Self-Organizing Maps and its applications in the petroleum industry Georg Pölzlbauer, Andreas Rauber (Department of Software Technology

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

Data Visualization. Fall 2016

Data Visualization. Fall 2016 Data Visualization Fall 2016 Information Visualization Upon now, we dealt with scientific visualization (scivis) Scivisincludes visualization of physical simulations, engineering, medical imaging, Earth

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Spatial Enhancement Definition

Spatial Enhancement Definition Spatial Enhancement Nickolas Faust The Electro- Optics, Environment, and Materials Laboratory Georgia Tech Research Institute Georgia Institute of Technology Definition Spectral enhancement relies on changing

More information

MATH 117 Statistical Methods for Management I Chapter Two

MATH 117 Statistical Methods for Management I Chapter Two Jubail University College MATH 117 Statistical Methods for Management I Chapter Two There are a wide variety of ways to summarize, organize, and present data: I. Tables 1. Distribution Table (Categorical

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Chapter 2 - Graphical Summaries of Data

Chapter 2 - Graphical Summaries of Data Chapter 2 - Graphical Summaries of Data Data recorded in the sequence in which they are collected and before they are processed or ranked are called raw data. Raw data is often difficult to make sense

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Learn the various 3D interpolation methods available in GMS

Learn the various 3D interpolation methods available in GMS v. 10.4 GMS 10.4 Tutorial Learn the various 3D interpolation methods available in GMS Objectives Explore the various 3D interpolation algorithms available in GMS, including IDW and kriging. Visualize the

More information

Exploring Data data exploration Exploratory Data Analysis

Exploring Data data exploration Exploratory Data Analysis 3 Exploring Data The previous chapter addressed high-level data issues that are important in the knowledge discovery process This chapter provides an introduction to data exploration, which is a preliminary

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

A Virtual Laboratory for Study of Algorithms

A Virtual Laboratory for Study of Algorithms A Virtual Laboratory for Study of Algorithms Thomas E. O'Neil and Scott Kerlin Computer Science Department University of North Dakota Grand Forks, ND 58202-9015 oneil@cs.und.edu Abstract Empirical studies

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Data organization. So what kind of data did we collect?

Data organization. So what kind of data did we collect? Data organization Suppose we go out and collect some data. What do we do with it? First we need to figure out what kind of data we have. To illustrate, let s do a simple experiment and collect the height

More information

TNM093 Tillämpad visualisering och virtuell verklighet. Jimmy Johansson C-Research, Linköping University

TNM093 Tillämpad visualisering och virtuell verklighet. Jimmy Johansson C-Research, Linköping University TNM093 Tillämpad visualisering och virtuell verklighet Jimmy Johansson C-Research, Linköping University Introduction to Visualization New Oxford Dictionary of English, 1999 visualize - verb [with obj.]

More information

UNIT 15 GRAPHICAL PRESENTATION OF DATA-I

UNIT 15 GRAPHICAL PRESENTATION OF DATA-I UNIT 15 GRAPHICAL PRESENTATION OF DATA-I Graphical Presentation of Data-I Structure 15.1 Introduction Objectives 15.2 Graphical Presentation 15.3 Types of Graphs Histogram Frequency Polygon Frequency Curve

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

BIOL Gradation of a histogram (a) into the normal curve (b)

BIOL Gradation of a histogram (a) into the normal curve (b) (التوزيع الطبيعي ( Distribution Normal (Gaussian) One of the most important distributions in statistics is a continuous distribution called the normal distribution or Gaussian distribution. Consider the

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis

More information

Use of GeoGebra in teaching about central tendency and spread variability

Use of GeoGebra in teaching about central tendency and spread variability CREAT. MATH. INFORM. 21 (2012), No. 1, 57-64 Online version at http://creative-mathematics.ubm.ro/ Print Edition: ISSN 1584-286X Online Edition: ISSN 1843-441X Use of GeoGebra in teaching about central

More information

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007 What s New in Spotfire DXP 1.1 Spotfire Product Management January 2007 Spotfire DXP Version 1.1 This document highlights the new capabilities planned for release in version 1.1 of Spotfire DXP. In this

More information

Geostatistics 3D GMS 7.0 TUTORIALS. 1 Introduction. 1.1 Contents

Geostatistics 3D GMS 7.0 TUTORIALS. 1 Introduction. 1.1 Contents GMS 7.0 TUTORIALS Geostatistics 3D 1 Introduction Three-dimensional geostatistics (interpolation) can be performed in GMS using the 3D Scatter Point module. The module is used to interpolate from sets

More information

Qualitative Physics and the Shapes of Objects

Qualitative Physics and the Shapes of Objects Qualitative Physics and the Shapes of Objects Eric Saund Department of Brain and Cognitive Sciences and the Artificial ntelligence Laboratory Massachusetts nstitute of Technology Cambridge, Massachusetts

More information

Mapping Distance and Density

Mapping Distance and Density Mapping Distance and Density Distance functions allow you to determine the nearest location of something or the least-cost path to a particular destination. Density functions, on the other hand, allow

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

MERBEIN P-10 COLLEGE MATHS SCOPE & SEQUENCE

MERBEIN P-10 COLLEGE MATHS SCOPE & SEQUENCE MERBEIN P-10 COLLEGE MATHS SCOPE & SEQUENCE Year Number & Algebra Measurement & Geometry Statistics & Probability P Numbers to 20 Location Data collection Counting and comparing groups Length Ordinal numbers

More information

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1 Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 2 Summarizing and Graphing Data 2-1 Overview 2-2 Frequency Distributions 2-3 Histograms

More information

RINGS : A Technique for Visualizing Large Hierarchies

RINGS : A Technique for Visualizing Large Hierarchies RINGS : A Technique for Visualizing Large Hierarchies Soon Tee Teoh and Kwan-Liu Ma Computer Science Department, University of California, Davis {teoh, ma}@cs.ucdavis.edu Abstract. We present RINGS, a

More information

Graphical Presentation for Statistical Data (Relevant to AAT Examination Paper 4: Business Economics and Financial Mathematics) Introduction

Graphical Presentation for Statistical Data (Relevant to AAT Examination Paper 4: Business Economics and Financial Mathematics) Introduction Graphical Presentation for Statistical Data (Relevant to AAT Examination Paper 4: Business Economics and Financial Mathematics) Y O Lam, SCOPE, City University of Hong Kong Introduction The most convenient

More information

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or  me, I will answer promptly. Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00

More information

Ms Nurazrin Jupri. Frequency Distributions

Ms Nurazrin Jupri. Frequency Distributions Frequency Distributions Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data so that it is possible to get a general overview of the results.

More information

Spreadsheet Warm Up for SSAC Geology of National Parks Modules, 2: Elementary Spreadsheet Manipulations and Graphing Tasks

Spreadsheet Warm Up for SSAC Geology of National Parks Modules, 2: Elementary Spreadsheet Manipulations and Graphing Tasks University of South Florida Scholar Commons Tampa Library Faculty and Staff Publications Tampa Library 2009 Spreadsheet Warm Up for SSAC Geology of National Parks Modules, 2: Elementary Spreadsheet Manipulations

More information

Chapter 2: Looking at Multivariate Data

Chapter 2: Looking at Multivariate Data Chapter 2: Looking at Multivariate Data Multivariate data could be presented in tables, but graphical presentations are more effective at displaying patterns. We can see the patterns in one variable at

More information

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below. Graphing in Excel featuring Excel 2007 1 A spreadsheet can be a powerful tool for analyzing and graphing data, but it works completely differently from the graphing calculator that you re used to. If you

More information

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization DSC 201: Data Analysis & Visualization Visualization Design Dr. David Koop Definition Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks

More information

Ratios and Proportional Relationships (RP) 6 8 Analyze proportional relationships and use them to solve real-world and mathematical problems.

Ratios and Proportional Relationships (RP) 6 8 Analyze proportional relationships and use them to solve real-world and mathematical problems. Ratios and Proportional Relationships (RP) 6 8 Analyze proportional relationships and use them to solve real-world and mathematical problems. 7.1 Compute unit rates associated with ratios of fractions,

More information

Working with Map Algebra

Working with Map Algebra Working with Map Algebra While you can accomplish much with the Spatial Analyst user interface, you can do even more with Map Algebra, the analysis language of Spatial Analyst. Map Algebra expressions

More information

Bar Charts and Frequency Distributions

Bar Charts and Frequency Distributions Bar Charts and Frequency Distributions Use to display the distribution of categorical (nominal or ordinal) variables. For the continuous (numeric) variables, see the page Histograms, Descriptive Stats

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Visualization? Information Visualization. Information Visualization? Ceci n est pas une visualization! So why two disciplines? So why two disciplines?

Visualization? Information Visualization. Information Visualization? Ceci n est pas une visualization! So why two disciplines? So why two disciplines? Visualization? New Oxford Dictionary of English, 1999 Information Visualization Matt Cooper visualize - verb [with obj.] 1. form a mental image of; imagine: it is not easy to visualize the future. 2. make

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

Excel Core Certification

Excel Core Certification Microsoft Office Specialist 2010 Microsoft Excel Core Certification 2010 Lesson 6: Working with Charts Lesson Objectives This lesson introduces you to working with charts. You will look at how to create

More information

8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the

8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13. This document is designed to help North Carolina educators

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

Generic Graphics for Uncertainty and Sensitivity Analysis

Generic Graphics for Uncertainty and Sensitivity Analysis Generic Graphics for Uncertainty and Sensitivity Analysis R. M. Cooke Dept. Mathematics, TU Delft, The Netherlands J. M. van Noortwijk HKV Consultants, Lelystad, The Netherlands ABSTRACT: We discuss graphical

More information

Sorting Fields Changing the Values Line Charts Scatter Graphs Charts Showing Frequency Pie Charts Bar Charts...

Sorting Fields Changing the Values Line Charts Scatter Graphs Charts Showing Frequency Pie Charts Bar Charts... Database Guide Contents Introduction... 1 What is RM Easiteach Database?... 1 The Database Toolbar... 2 Reviewing the License Agreement... 3 Using Database... 3 Starting Database... 3 Key Features... 4

More information

+ Statistical Methods in

+ Statistical Methods in + Statistical Methods in Practice STA/MTH 3379 + Dr. A. B. W. Manage Associate Professor of Statistics Department of Mathematics & Statistics Sam Houston State University Discovering Statistics 2nd Edition

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Chapter 4. Clustering Core Atoms by Location

Chapter 4. Clustering Core Atoms by Location Chapter 4. Clustering Core Atoms by Location In this chapter, a process for sampling core atoms in space is developed, so that the analytic techniques in section 3C can be applied to local collections

More information

Chapter 2: Understanding Data Distributions with Tables and Graphs

Chapter 2: Understanding Data Distributions with Tables and Graphs Test Bank Chapter 2: Understanding Data with Tables and Graphs Multiple Choice 1. Which of the following would best depict nominal level data? a. pie chart b. line graph c. histogram d. polygon Ans: A

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

Bar Graphs and Dot Plots

Bar Graphs and Dot Plots CONDENSED LESSON 1.1 Bar Graphs and Dot Plots In this lesson you will interpret and create a variety of graphs find some summary values for a data set draw conclusions about a data set based on graphs

More information

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended. Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews cannot be printed. TDWI strives to provide

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Algorithms for Grid Graphs in the MapReduce Model

Algorithms for Grid Graphs in the MapReduce Model University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks 8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks MS Objective CCSS Standard I Can Statements Included in MS Framework + Included in Phase 1 infusion Included in Phase 2 infusion 1a. Define, classify,

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

2.1: Frequency Distributions and Their Graphs

2.1: Frequency Distributions and Their Graphs 2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2016 Admin Assignment 1 solutions will be posted after class. Assignment 2 is out: Due next Friday, but start early! Calculus and linear

More information

netzen - a software tool for the analysis and visualization of network data about

netzen - a software tool for the analysis and visualization of network data about Architect and main contributor: Dr. Carlos D. Correa Other contributors: Tarik Crnovrsanin and Yu-Hsuan Chan PI: Dr. Kwan-Liu Ma Visualization and Interface Design Innovation (ViDi) research group Computer

More information

Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix. Paula Ahonen-Rainio Maa Visual Analysis in GIS

Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix. Paula Ahonen-Rainio Maa Visual Analysis in GIS Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix Paula Ahonen-Rainio Maa-123.3530 Visual Analysis in GIS 11.11.2015 Topics today YOUR REPORTS OF A-2 Thematic maps with charts

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight local variation of one variable with respect to another.

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

PITSCO Math Individualized Prescriptive Lessons (IPLs)

PITSCO Math Individualized Prescriptive Lessons (IPLs) Orientation Integers 10-10 Orientation I 20-10 Speaking Math Define common math vocabulary. Explore the four basic operations and their solutions. Form equations and expressions. 20-20 Place Value Define

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data Chapter 2 Organizing and Graphing Data 2.1 Organizing and Graphing Qualitative Data 2.2 Organizing and Graphing Quantitative Data 2.3 Stem-and-leaf Displays 2.4 Dotplots 2.1 Organizing and Graphing Qualitative

More information

Visual Analytics. Visualizing multivariate data:

Visual Analytics. Visualizing multivariate data: Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Read, write compare and order numbers beyond 1000 in numerals and words Read Roman numerals to 100 and understand how they have changed through time

Read, write compare and order numbers beyond 1000 in numerals and words Read Roman numerals to 100 and understand how they have changed through time Number Year 4 Year 5 Year 6 Year 6 Exceeded Developing Count reliably to and from 20 putting numbers in order Year 2 Year 3 Read, write and compare and order numbers 0-1000 in numerals and words Read,

More information

Multiple variables data sets visualization in ROOT

Multiple variables data sets visualization in ROOT Journal of Physics: Conference Series Multiple variables data sets visualization in ROOT To cite this article: O Couet 2008 J. Phys.: Conf. Ser. 119 042007 View the article online for updates and enhancements.

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Elementary Statistics

Elementary Statistics 1 Elementary Statistics Introduction Statistics is the collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing

More information