Statistics 111 - Lecture 6 Looking at data one variable Chapter 1.1 Moore, McCabe and Craig Probability vs. Statistics Probability 1. We know the distribution of the random variable (Normal, Binomial) 2. We know how to compute different parameters of the random variable (Expected value, Variance) Statistics 1. We have a sample (dataset) of n observations 2. We don t know the underline distribution 3. We try to estimate and infer about the parameters of the population 1
Dataset Definitions Individuals are the objects described in a set of data. Variable is any characteristic of an individuals. A variable can take different values for different individuals. Categorical variable places an individual into one of several groups Examples: gender, race Quantitative variable takes on numerical values that are usually considered as continuous Examples: height, age, wages Distributions A distribution describes what values a variable takes and how frequently these values occur. The distribution of a variable can be described graphically: Categorical Variable Bar plot, Pie Chart Quantitative Variable Boxplot, Histogram Characteristics of distributions: Center Spread Shape Outliers 2
Barplots and Pie Charts For categorical variables, we can graph the distribution using bar plots and pie charts Barplots and Pie Charts Your favorite Color 3
Barplots and Pie Charts Pie charts are generally not as useful as bar plots Need to have all categories to make a pie chart harder to compare subsets of categories Scale of pie charts can sometimes be misleading harder to see small differences Boxplots Box plots are an effective tool for conveying information of continuous variables Box contains the central 50% of the data, with a line indicating the median Median is the value with 50% of data on either side Whiskers contain most of the rest of the data, except for suspected outliers Outliers are suspiciously large or small values 4
Boxplots Box plots were originally designed to visually diagnosed a normal distribution. Boxplot: Shoe Size of Stat 111 Class Almost all values are between 5 and 13 50% of values are between 7.5 and 10 Center (Median) is around 8.5 Couple of suspected outliers: 14 and 14.5 5
Frequency 0 2 4 6 8 10 12 6/3/2010 Summary of Boxplots Useful for displaying center and spread of a distribution, as well as potential outliers However, boxplot doesn t really give us much of an idea of the shape of the distribution Histograms are much better graphical summaries of shape Histograms Histograms emphasize frequency of different values in the distribution 60 62 64 66 68 70 72 74 Height X-axis: Values are divided into bins Y-axis: Height of each bin is the frequency that values from that bin appear in dataset 6
Frequency 0 2 4 6 8 10 12 Density 0.00 0.05 0.10 0.15 6/3/2010 Another Example: Height in Stat 111 60 64 68 72 Height 60 64 68 72 Height Vertical axis is sometimes the density (or relative frequency) : equal to the frequency of the bin divided by the total number of obs Histograms versus Boxplots Both graphs give a good idea of the spread Boxplots may be a little clearer in terms of the center and outliers in a distribution center outliers spread of likely values center 7
Histograms versus Boxplots Histograms much more effective at displaying the shape of a distribution Skewness: departure from left-right symmetry Multi-modality: presence of multiple high frequency values clearly not symmetric not symmetric? second peak Symmetry - Histograms vs. Boxplots 8
Density Curves Often easier to examine a distribution with a smooth curve instead of a histogram Example: vocabulary scores from 947 seventh graders in Gary, Indiana Example with Test Score Data Number of scores less than 6 in population is 287 out of 947, so relative frequency is 0.303 Using a density curve (normal distribution), the approximate frequency is 0.293 9
Approximations Real data will never exactly fit a density curve ie. be exactly symmetric or normally-distributed Graphs that made a difference. 10
Time to JMP! 11