IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Size: px

Start display at page:

Download "IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram"

Julia Ward
5 years ago
Views:

1 IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram

2 Exploring data Example: US Census People # of people in group Year # (every decade) Age # Sex (Gender) # Male, female Marital status # Single, Married, Divorced, 2348 data points Data and Image Models IAT 4355 Slide adapted from Jeff Heer

3 Census data: What type ( N, O, Q)? Example: US Census People Q- Ratio Year Q- interval Age Q - Ratio Sex (Gender) N Marital status N Data and Image Models IAT 4355 Slide adapted from Jeff Heer

4 Census data: What type ( N, O, Q)? Example: US Census People Count Measure (dependent variable) Year Dimension Age?? Sex (Gender) Dimension Marital status Dimension Data and Image Models IAT 4355 Slide adapted from Jeff Heer

5 Roll-up and Drill-Down Want to examine marital status in each decade Roll-up the data along the desired dimension Data and Image Models IAT 4355 Slide adapted from Jeff Heer

6 Roll-up and Drill-Down Need more detailed information? Drill-down into additional dimensions Data and Image Models IAT 4355 Slide adapted from Jeff Heer

7 Data and Image Models IAT 4355 Slide adapted from Jeff Heer

8 Distribution is important for understanding data Visualization helps us see relations or the trends of them - as visual patterns a lot of what we visualize are the descriptive statistics Example: mean income vs median income Need to ensure that the aggregate units of visualization are legit Rule: check your core units /variables. If hey are descriptive, look at the distribution Data and Image Models IAT

9 Example: job losses in US over time Data and Image Models IAT

10 Example: job losses in US over time Data and Image Models IAT

11 Data and Image Models IAT

12 Visualizing distribution We can t really tell much about this data set Even Min and Max are hard to see We can get a better idea of this data by looking at its distribution. Data Values X-axis labels Data and Image Models IAT

13 Data distribution Measures of dispersion characterise how spread out the distribution is, i.e., how variable the data are. Commonly used measures of dispersion include: 1. Range 2. Variance & Standard deviation 3. Coefficient of Variation (or relative standard deviation) 4. Inter-quartile range Data and Image Models IAT

14 Distribution and symmetry symmetric Median, mean and mode of symmetric, posi3vely and nega3vely skewed data positively skewed negatively skewed 14 January 15, 2014 Data Mining: Concepts and Techniques Adapted from Han, Kamber and Pei 2013

15 Properties of Normal Distribution Curve The normal (distribu3on) curve From μ σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard devia3on) From μ 2σ to μ+2σ: contains about 95% of it From μ 3σ to μ+3σ: contains about 99.7% of it 15

16 Normal and Skewed Distributions 0.14 When data are skewed, the mean and SD can be misleading Skewness sk= 3(mean-median)/SD If sk> 1 then distribution is non-symetrical Negatively skewed Mean<Median Sk is negative Positively Skewed Mean>Median Sk is positive Data and Image Models IAT

17 Measuring the Dispersion of Data Quar3les, outliers and boxplots Quar%les: Q 1 (25 th percen3le), Q 3 (75 th percen3le) Inter- quar%le range: IQR = Q 3 Q 1 Five number summary: min, Q 1, median, Q 3, max Boxplot: ends of the box are the quar3les; median is marked; add whiskers, and plot outliers individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard devia%on (sample: s, popula,on: σ) Measure of how spread out the numbers are Adapted from Han, Kamber and Pei

18 Measures of variance Variance One measure of dispersion (deviation from the mean) of a data set. The larger the variance, the greater is the standard deviation. Standard Deviation the average deviation from the mean of a data set. Determines overall how spread out the data values are Variance and SD are critical in analysing your data distribution and determining how meaningful is the chosen average Data and Image Models IAT

19 Graphic Displays of Basic Statistical Descriptions Boxplot: graphic display of five- number summary Histogram: x- axis are values, y- axis repres. frequencies Sca?er plot: each pair of values is a pair of coordinates and plozed as points in the plane Quan%le plot: each value x i is paired with f i indica3ng that approximately 100 f i % of data are x i Quan%le- quan%le (q- q) plot: graphs the quan3les of one univariant distribu3on against the corresponding quan3les of another Adapted from Han, Kamber and Pei

20 Inter-quartile range The Median divides a distribution into two halves. The first and third quartiles (denoted Q 1 and Q 3 ) are defined as follows: 25% of the data lie below Q 1 (and 75% is above Q 1 ), 25% of the data lie above Q 3 (and 75% is below Q 3 ) The inter-quartile range (IQR) is the difference between the first and third quartiles, i.e. IQR = Q 3 - Q 1 Data and Image Models IAT

21 Outliers An outlier is an datum which does not appear to belong with the other data Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc. An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient. Data and Image Models IAT

22 Box-plots A box-plot is a visual description of the distribution based on Minimum Q1 Median Q3 Maximum Useful for comparing large sets of data Data and Image Models IAT

23 Example 1: Box-plot Data and Image Models IAT

24 Outlier Boxplot Re-define the upper and lower limits of the boxplots (the whisker lines) as: Lower limit = Q IQR, and Upper limit = Q IQR Note that the lines may not go as far as these limits If a data point is < lower limit or > upper limit, the data point is considered to be an outlier. Data and Image Models IAT

25 Visualization of Data Dispersion: 3-D Boxplots 25 January 15, 2014 Data Mining: Concepts and Techniques

26 Histogram Q à Q à N Most common form: split data range into equal-sized bins and count the number of points from the data set that fall into the bin. Vertical axis: Frequency (i.e., counts for each bin) Horizontal axis: Response variable The histogram graphically shows the following: 1. center (i.e., the location) of the data; 2. spread (i.e., the scale) of the data; 3. skewness of the data; 4. presence of outliers; and 5. presence of multiple modes in the data. 26 Data and Image Models IAT 4355

27 Histograms Often Tell More than Boxplots n n The two histograms shown in the le\ may have the same boxplot representa3on n The same values for: min, Q1, median, Q3, max But they have rather different data distribu3ons 27

28 Plotting the distribution Determine a frequency table (bins) A histogram is a column chart of the frequencies Category Labels Frequency Frequency > >90 Scores Data and Image Models IAT

29 Issues with Histograms For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms. Interactive bin-width example (online applet) For large data sets, histograms can be quite effective at illustrating general properties of the distribution. Histograms effectively only work with 1 variable at a time Difficult to extend to 2 dimensions, not possible for >2 So histograms tell us nothing about the relationships among variables Data and Image Models IAT

30 Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plozed as points in the plane 30

31 Positively and Negatively Correlated Data The le\ half fragment is posi3vely correlated The right half is nega3vely correlated 31

32 Uncorrelated Data 32

33 Correlation A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired (x,y) sample data are plotted on a graph. The linear correlation coefficient r measures the strength of the linear relationship. Also called the Pearson correlation coefficient. Ranges from -1 to 1. r = 1 represents a perfect positive correlation. r = 0 represents no correlation r = -1 represents a perfect negative correlation Slide adapted from David Lippman's

34 Correlation Assesses the linear relationship between two variables Example: height and weight Strength of the association is described by a correlation coefficientr r = low, probably meaningless r = low, possible importance r = moderate correlation r = high correlation r =.8-1 very high correlation Can be positive or negative Pearson s, Spearman correlation coefficient Tells nothing about causation

35 Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80 Strong negative No Correlation Non-linear correlation r = r = 0.16 relationship Slide adapted from David Lippman's

36 Meanings r 2 represents the proportion of the variation in y that is explained by the linear relationship between x and y. Example: Using the heights and weights for a group of people, you find the correlation coefficient to be: r = 0.796, so r 2 = So we conclude that about 63.4% of the peoples weight can be explained by the relationship between height and weight. This suggests that 36.6% of the variation in weights cannot be explained by height. Slide adapted from David Lippman's

37 r 2 in Tableau

38 Example: Relationship between Tree Circumference and Height Height (ft) Circumference (ft) Slide adapted from David Lippman's

39 Relationship between Tree Circumference and Height Height (ft) Circumference (ft) Outliers can strongly influence the graph of the regression line and inflate the correlation coefficient. In the above example, removing the outlier drops the correlation coefficient from Slide adapted from David Lippman's r = to r =

40 Correlation Correlation Coefficient 0 Correlation Coefficient.3 Source: Altman. Practical Statistics for Medical Research

41 Correlation Correlation Coefficient -.5 Correlation Coefficient.7 Source: Altman. Practical Statistics for Medical Research

42 Summary Statistical models serve to inspect and categorise the nature of trends and relations between variables and fators (effects) Distribution is a critical element in deciding what statistical measures to use, should be the lens by which you determine the appropriate metric eyeballing your distribution is a first step in forming your next queries 42

43 Four sets of data with the same correlation of 0.816

44 Sheet a Avg. actual_score Avg. user_score Average of user_score vs. average of actual_score. Color shows details about action_name. 44

45 display Avg. actual_score chart treewithberry treewithoutber.. Average of actual_score for each display. Color shows details about action_name. Data and Image Models IAT

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

What are we working with? Data Abstractions Week 4 Lecture A IAT 814 Lyn Bartram Munzner s What-Why-How What are we working with? DATA abstractions, statistical methods Why are we doing it? Task abstractions