Stat 528 (Autumn 2008) Density Curves and the Normal Distribution Reading: Section 1.3 Density curves An example: GRE scores Measures of center and spread The normal distribution Features of the normal distribution Useful rules of thumb z-scores and the standard normal Calculations using the normal distribution Normal quantile plots (checking normality graphically) 1
Density curves Think of the histogram. It is often useful to think of a mathematical description for the distributions we observe in data. A density curve is a function describing the height of a relative frequency histogram at given values of the distribution. It is a mathematical model for the pattern of a distribution. The density curve always has non-negative height. The area under the density curve is one. It is possible to model our data using a smooth curve? Then, given a useful model for the data we observe, we can answer questions about the data using the model. 2
An example GRE scores The Graduate Record Examinations (GRE) are widely used to help predict the performance of applicants to graduate schools. Suppose a psychology department at a university has 34 applicants with the following quantitative GRE scores: 569 528 698 676 543 702 436 482 655 500 536 617 548 334 620 567 605 564 545 647 579 518 465 744 645 513 728 627 584 572 449 575 399 797 Frequency 9 8 7 6 5 4 3 2 1 0 350 400 450 500 550 600 GRE score 650 700 750 800 3
Histogram of the GRE scores Summarize the distribution of the GRE scores. Is there some smooth curve which is a good summary for this distribution? 4
Measures of center and spread for density curves The median of a density curve is the value for which 50% of the area under the curve is on the left and 50% is on the right. The mean of a density curve is the balancing point. The mode(s) of a density curve is/are the value(s) which have the highest density (height of curve). The mean and median are the same for a symmetric density curve. Need to do some math (calculus) to calculate the measures of spread for density curves. 5
An example Example 1.83: Figure 1.36 of the book (page 85) displays three density curves, each with three points marked on it. At which points to the mean and median fall? 6
The normal distribution Most important distribution in statistics. It turns up everywhere. e.g., heights and weights, test scores, measurement errors in scientific experiments, concentrations of chemicals. Also occurs because certain averages are approximately normal distributions (see later). A normal distribution is described by a density curve with a given equation. The height of a density curve at a point x is ) 1 (x µ)2 f(x) = exp (. 2πσ 2σ 2 The normal distribution is determined by two parameters called µ and σ. µ can be any value, but σ > 0. 7
Plots of the normal distribution mu=0, sigma=1 mu=5, sigma=1 f(x) 0.0 0.1 0.2 0.3 0.4 f(x) 0.0 0.02 0.04 0.06 0.08-3 -2-1 0 1 2 3 x mu=0, sigma=5-15 -10-5 0 5 10 15 x f(x) 0.0 0.1 0.2 0.3 0.4 f(x) 0.0 0.02 0.04 0.06 0.08 2 3 4 5 6 7 8 x mu=5, sigma=5-10 -5 0 5 10 15 20 x 8
Features of the normal distribution The distribution is symmetric the median is equal to µ. There are points of inflection at µ σ and µ + σ. What this means: curve turns downwards between µ σ and µ + σ. curve turns upwards outside µ σ and µ + σ. The mean of a normal distribution is µ. The standard deviation is σ. 9
Useful rules of thumb for the normal distribution For a normal distribution with mean µ and standard deviation σ: about 68% of the observations are in the range µ σ to µ + σ. about 95% of the observations are in the range µ 2σ to µ + 2σ. about 99.7% of the observations are in the range µ 3σ to µ + 3σ. 10
Example Systolic pressure is the force of blood in the arteries as the heart beats. Suppose that the systolic blood pressure for males aged 40-49, is normal distributed with a mean of 134.7 mmhg and a standard deviation of 3.1 mmhg. Answer the following questions. 1. Plot the density curve for this distribution. 2. Between what systolic blood pressure values, do the middle 95% of all males aged 40-49 lie? 3. How small are the smallest 2.5% of all blood pressures for males aged 40-49? 4. How large are the largest 2.5% values of all blood pressures for males aged 40-49? 11
Example (cont.) 12
Remarks We use the notation N(µ, σ) to denote a normal distribution with mean µ and standard deviation σ. e.g., The distribution of blood pressures is N(134.7, 3.1). In the last example, we answered questions about a variable using a mathematical model not actual data (the source of the model for the data is not specified in the example). 13
GRE example Suppose that the quantitative GRE scores for applicants in the psychology department are approximately normal with mean µ = 544 and standard deviation σ = 103. What proportion of applicants have a score less than 500? What proportion of applicants have a score larger than 700? What proportion of applicants have a score between 500 and 700? 14
Evaluating proportions using a density curve Areas under the density curve represent the relative frequency or proportion of ranges of values occurring. For the normal distribution, we evaluate these areas using z-scores. The z-score or standardized value of an observation x from a distribution with mean µ and standard deviation σ is z = x µ σ. The z-score measures how many standard deviations x is away from the mean. A z-score can be positive or negative. 15
The standard normal distribution µ = 0 and σ = 1 corresponds to the standard normal distribution, i.e., N(0, 1). Key Fact: If we have a variable, X, with a N(µ, σ) distribution then the standardized variable has a N(0,1) distribution. Z = X µ σ Table A (inside cover of the textbook) tabulates the areas to the left of a value in the standard normal distribution - this is the only table we need. Game plan: State the problem. Standardize by converting from N(µ, σ) to N(0, 1). Use the table to evaluate the area to the left of the curve. Answer the question. 16
GRE example (cont.) Let X denote the quantitative GRE scores for psychology applicants. X has a N(544, 103) distribution. Part (a): What proportion of applicants have a score less than 500? 17
GRE example (cont.) Part (b): What proportion of applicants have a score larger than 700? 18
GRE example (cont.) Part (c): What proportion of applicants have a score between 500 and 700? 19
Blood pressures revisited Suppose that the systolic blood pressure for males aged 40-49, is normal distributed with a mean of 134.7 mmhg and a standard deviation of 3.1 mmhg. What blood pressure value will place a male aged 40-49 in the top 5%? In the top 1%? 20
Normal quantile plots The normal quantile plot is a method we use to determine whether a sample of observations can be modeled by a normal distribution. The procedure: 1. Sort the data from smallest to largest. 2. Calculate the percentile of each data value. (for i = 1,..., n, the ith largest value is the (i 0.5)/n 100% percentile) 3. Calculate the z-score for each percentile. 4. Plot the data values on the y-axis versus the z-scores on the y-axis. If the distribution is close to normal, the plot points will lie close to a straight line. We let MINITAB do the calculations. 21
MINITAB Example Load the GRE scores data from the class website. Select the menu command Graph Probability Plot. Select the Simple graph type. In the dialog box for Graph variables select C3. Click Distribution: Under the Data Display tab, untick Show confidence interval and click OK. Click Scale: Under the Axes and Ticks tab, select Transpose Y and X. Under tab Y-scale Type, select Score, and click OK. Click OK again to produce the figure. 22
Normal quantile plot of GRE scores Conclusions? 23
Normal quantile plot of the hurricane losses (from the Introduction notes) Are the hurricane losses normally distributed? 24