Statistical Tests for Variable Discrimination

Size: px

Start display at page:

Download "Statistical Tests for Variable Discrimination"

Alexis Hart
6 years ago
Views:

1 Statistical Tests for Variable Discrimination University of Trento - FBK 26 February, 2015 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

2 General statistics Descriptional: Describing samples statistic properties Mathematical: Studying the probability distributions Question around the samples starting from a known distribution Knowing the 50% of the population read books, what is the probability that in a sample of 100 subjects 70 of them read books? Inferential: Starting from the samples, what about the statistical distribution? In a sample of 100 subjects, 65 of them read books. May I infer that more than 50% of the general population read books? What is the probability of an error? (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

3 Relative frequencies and Percentage Example Given the birthwt dataset where n = 189 What are the relative frequencies for the race variable? Relative frequencies can be computedd as: nc n head(birthwt) ## low age lwt race smoke ptl ht ui ftv bwt ## African-American ## Other ## White ## White ## White ## Other table(birthwt$race) ## Frequencies ## ## White African-American Other ## (table(birthwt$race) / nrow(birthwt))*100 # Relative Frequencies ## ## White African-American Other ## (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

4 Statistical Inference Definition: The process of using the data to draw conclusions about the whole population Example Examples of statistical inference. Let s say I want to test the hypothesis about the average normal body temperature. 1 Get the body temperature of the whole population NOT FEASIBLE 2 Study a sample of representative members selected from the population Samples should be chosen randomly Samples are assumed to be independent 3 Try to estimate the unknown population average NB The real population average remains unknown. The estimation depends on our observations There is always an uncertainty (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

5 How to choose the population? More on sampling How do we select samples from a population? SRS: Simple Random Sampling. The most straight sampling procedure. Give a number 1... N to each member in the population Extract randomly n numbers Change of being selected is the same for any group of n members in the population SS Stratified Sampling. The sample should be comparable to the whole population with respect to representative groups. No subgroup in the observations should be overrepresented CS Clustering sample. Start the sampling grouping in clusters Sample from the clusters Subsample some or all members of the cluster (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

6 Population vs Samples Population parameters estimate Mean: µ = N i=1 x i N Population N x is an estimator of the µ (true population mean) In particular x µ for n Variance: σ 2 = N i=1 (x i µ) 2 N Mean: x = Variance: s 2 = Sample n n i=1 x i n n i=1 (x i x) 2 n 1 mean(birthwt$smoke) ## Smoking mothers mean ## [1] var(birthwt$smoke) ## Smoking mothers variance ## [1] mean(birthwt$smoke) * (1 - mean(birthwt$smoke)) ## See the bernoulli dist. ## [1] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

7 Law of Large Numbers µ^ If the sample size is large enough... The mean estimator converges to the population mean Mean estimation for n > Inf from N(0,1) ^2 10^3 10^4 10^5 Number of extraction (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

8 Sample distributions Probability distributions for estimators are called sampling distribution Assumptions Assume random variable X has a normal N (0, 1) distribution Assume σ 2 is known We use X to estimate µ What is the sampling distribution of X? Extract n samples from the population X 1,...,n N (µ, σ 2 ) with X 1,...,n independent. X 1 + X X n N (nµ, nσ 2 ) n i=1 X i N (nµ, nσ 2 ) The sum of n identically distributed normal variables is itself normally distributed n i=1 Given the sample mean estimator X = X i the mean and variance of the sample mean n estimator is: nµ/n and nσ 2 /n 2 = σ 2 /n X N (µ, σ 2 /n) (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

9 Sample distributions II Example Consider the random variable X N (125, 15 2 ) representing the systolic blood pression Extract 100 samples X 1,, X 100 N (125, 15 2 ) and X N (125, 15 2 /100) Estimators depend on the specific sample selected from the population Repeating the sampling lead to different values for the estimator Theoretical Distribution Sample mean probability distribution Density Density x X (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

10 Hints on how to compute those plots Draw the population density distribution Extract 100 samples from the population distribution Create the probability distribution Plot everything Draw the sample mean distribution Extract 100 samples from the distribution Estimate the mean of the distribution Repeat the same operation 1000 times Plot everything (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

11 Confidence Intervals Definition Variations of the estimators if different members of the population were selected Example Consider the Systolic Blood Pressure example: We know the sample mean distribution is: X = N (µ, σ 2 /n) Since the % rule applies, with 0.95 of probability: µ X µ We want to estimate the true population µ probability, X 3 µ X + 3 µ falls within [ X 3, X + 3] we could repeatedly sample n, find the sample mean and determine the interval In reality we have only one sample so the true µ with 0.95 of probability is in: [ x 3, x + 3] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

12 Confidence intervals for the Population Proportion Suppose we want to find the 95% CI for the population proportion of mothers who smoke during pregnancy in Using the birthwt dataset x = 0.39 sum(birthwt$smoke)/189 ## [1] Estimate the variance s 2 = p(1 p) = 0.24 s <- (sum(birthwt$smoke)/189) * (1-sum(birthwt$smoke)/189) ## [1] The Standard Error (SE) for the sample mean is σ n = SE <- sqrt(s/189) The 95% CI is [p z crit SE, p + z crit SE]: p(1 p) n = 0.3 [ , ] = [0.33, 0.45] Therefore we can define the Margin of Error as: e = z crit σ n (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

13 The % rule The % rule for normally distributed values: 68% of values fall within 1 standard deviation of the mean P(µ σ < X µ + σ) = % of values fall within 2 standard deviation of the mean P(µ 2σ < X µ + 2σ) = % of values fall within 3 standard deviation of the mean P(µ 3σ < X µ + 3σ) = (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

14 Check the % rule with R For a sufficient number of samples we can estimate the typical ranges n < mynorm <- rnorm(n) # Extract n samples from N(0,1) sum(mynorm>mean(mynorm)-sd(mynorm) & mynorm<=mean(mynorm)+sd(mynorm))/n ## [1] sum(mynorm>mean(mynorm)-2*sd(mynorm) & mynorm<=mean(mynorm)+2*sd(mynorm))/n ## [1] sum(mynorm>mean(mynorm)-3*sd(mynorm) & mynorm<=mean(mynorm)+3*sd(mynorm))/n ## [1] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

15 How the rule looks like 68% Interval 95% Interval Density σ + σ Density σ +2σ x x (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

16 Exercises Recall the % Rule and find the multiplier for the confidence intervals at 70, 80, 90% for a normally distributed variable. We assume that the probability distribution of blood pressure, X N (µ, σ 2 ) distribution. Suppose we know that σ = 6. To estimate µ, we randomly selected 9 people and measured their blood pressure. The sample mean is x = Write down the sampling distribution of the sample mean X and find its standard deviation. 2 Find the 75% CI estimation for µ (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

17 Case-Control study Example We want to study the effect of smoking on lung cancer. Retrospective Select a group of patients with lung cancer and survey them to determine if they have smoked in the past. Prospective Select a group of smokers and observe them over time without influencing the natural process. To make resonable conclusion we need to compare patients in the study with patients with the same habits without lung cancer which are similar in all other aspects. Compare cases (lung cuncer patients) with controls (no lung cancer) Individual in the case group should not be related with the control group. (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

18 Hypothesis Testing Assumptions Idea: Starting with an hypothesis we want to test if it is real In the Body Temperature dataset the hypothesis is that in average the body temperature is less than 98.6 degree F The statement can be expressed as µ < 98.6 We can now create an hypothesis which invalidates the previous one µ This is called the null hypothesis H 0 The null hypothesis reflects the nothing of interest We can define the alternative hypothesis denoting this as H A or H 1 which is what we want to investigate The procedure of evaluating the hypothesis is called hypothesis testing Examine the evidence the data provides against the null hypothesis. If the evidence is strong we reject H 0 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

19 Testing the mean In particular we want to test: X H 0 N (µ, σ 2 /n) Example From the body temperature dataset: We have H 0 : µ = 98.6 and H a : µ < 98.6 Select 25 healthy patients and σ 2 = 1 thus: X H 0 N (98.6, 1/25) From the 25 samples we have only one x. Suppose x = 98.4 We want to evaluate the lower tail probability for x = 98.4 The significance level is the p-value defined as: p obs = P( X x H 0 ) (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

20 Visualizing the hypothesis testing See the probability for p obs and the x p obs = P( X x H 0 ) Density p obs x x p obs = P( X 98.4) pnorm(98.4,mean=m,sd=s) ## Compute the above probability ## [1] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

21 Hypothesis testing One-side vs Two-sides One-sided Test H 0 : µ = µ 0 against H 1 : µ < µ 0 Departure from the mean is on one direction Example with body temperature: H 0 : µ = 98.6 and H 1 : µ < 98.6 Computing: p obs = P(Z z) where Z = X µ0 σ/ x µ0 N (0, 1) and z = n σ/ n Two-sided We might be indifferent to the direction, thus: H 0 : µ = µ 0 and H 1 : µ µ 0 Example with body temperature: H 0 = µ = 98.6 and H 1 : µ 98.6 Computing: p obs = P(Z z ) + P(Z z ) = 2 P(Z z ) (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

22 Two-sided hypothesis tests Distribution of the Z normalized standard variable with z = 1 Z distribution Density p obs z x (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

Hypothesis Testing Aim: Answering to the general population distribution variable, starting from the samples collected From the population A and B average m A and m B Hypothesis: The mean µ A and µ B

23 Hypothesis Testing Aim: Answering to the general population distribution variable, starting from the samples collected From the population A and B average m A and m B Hypothesis: The mean µ A and µ B from population A and B respectively are equal (H0 null hypothesis) Alternatively,more of interest... µ A µ B H1=not H0 Result: Whether to accept or refuse H0 minimizing the type I error (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

24 Hypothesis testing T-test Example T-test: 1 Assumptions: Observations are indipendent Observation come from gaussian variables with mean µ a and µ b and variance σ a and σ b σ a = σ b 2 Null hypothesis H0: µ a = µ b 3 Compute T variable y = ma m b sp 1 na + 1 n b s p = (na 1)s2 a +(n b 1)s2 b na+n b 2 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

25 Examples in R One sided Using the Pima.tr dataset to test H 0 : µ = 30 and H 1 : µ > 30 t.test(pima.tr$bmi, alternative="greater", mu=30, conf.level=0.95) ## ## One Sample t-test ## ## data: Pima.tr$bmi ## t = , df = 199, p-value = 1.331e-07 ## alternative hypothesis: true mean is greater than 30 ## 95 percent confidence interval: ## Inf ## sample estimates: ## mean of x ## Two sided-two sample Use the BodyTemperature dataset to test if there is differences in body temperature between genders t.test(temperature~gender, data=bt, var.equal=true) ## ## Two Sample t-test ## ## data: Temperature by Gender ## t = , df = 98, p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean in group F mean in group M ## (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

26 Paired t-test Until now we assumed variables in two groups are independent. Example What if the variables are dependent? Is the t.test still valid? 1 Test the effect of a diet on blood pressure A sample can have a lower blood pressure before starting the experiment There can be differences given by the age of the subjects How to avoid the effect of this issues? A possible solution is to assign the same subject to each diet group Each subject follow the prescribed diet, and we measure the blood pressure, then they are asked to follow another diet for six months and then measure the blood pressure again. NB Individual in the two groups are paired (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

27 Paired t-test Examples Example To show the use of the paired version of the t.test we use the study on the effect of tobacco smoke on patelet function by Levine. hypothesis Higher frequency of arterial thrombosis in cigarette smokers could be partially explained by increased platelet aggregation caused by smoking study in a group of eleven people he measured the patelet aggregation before and after smoking a cigarette testing test if the difference in patelet aggregation:h 0 : µ = 0 and H 1 : µ < 0 t.test(pt$before,pt$after, paired=true) ## ## Paired t-test ## ## data: pt$before and pt$after ## t = , df = 10, p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of the differences ## (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

28 Testing for normality All what we have seen before suppose the variables are normally distributed How do we check this? 1: Visual Inspection Test normality for Body Mass Index qqnorm(pima.tr$bmi) Normal Q Q Plot Theoretical Quantiles Sample Quantiles Test normality for Age qqnorm(pima.tr$age) Normal Q Q Plot Theoretical Quantiles Sample Quantiles (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

29 Testing for normality 2: Normality tests Shapiro-Wilk test for checking the normality It evaluates the null hypothesis that the distribution of a random variable is normal. Test normality for Body Mass Index shapiro.test(pima.tr$bmi) ## ## Shapiro-Wilk normality test ## ## data: Pima.tr$bmi ## W = 0.991, p-value = Test normality for Age shapiro.test(pima.tr$age) ## ## Shapiro-Wilk normality test ## ## data: Pima.tr$age ## W = , p-value = 1.853e-12 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

30 Testing for Homoscedasticity Null Hypothesis H 0 : The variance of the groups are equal. Parametric test: bartlett test: bartlett.test(x,y) levene test (from car library): levenetest(y x) (Non) Parametric tests: Fligner-Killeen test: fligner.test(y x) bartlett.test(bt$temperature,bt$gender) ## ## Bartlett test of homogeneity of variances ## ## data: bt$temperature and bt$gender ## Bartlett's K-squared = 2.189, df = 1, p-value = Density N = 51 Bandwidth = (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

31 Excercise I 1 We assume that the probability distribution of blood pressure, X N (µ, σ 2 ) distribution suppose that we did not know σ and estimated it using the sample standard deviation s=6 1 Find the standard error for the sample mean as the estimator of the population mean 2 Find the 80% CI estimation for µ based on this sample 2 Given a distribution with 20 degree of freedom compute the confidence interval at 0.99, 0.95, 0.90 probability. 3 Using the bodytemperature dataset, find the point estimate and the 78% confidence interval estimate for the population means of hear rate and normal body temperature 4 Suppose that we interviewed a random sample of 2000 people and found that 320 of them smoke regularly. Find the 90% confidence interval for the population proportion of smokers 5 With the Pima.tr dataset suppose a BMI greater than 30 denote obesity. We know obesity and diabetes are related. Suppose sample size is n = 100 and σ 2 = 6 2. How can you test if this population is obese? Write the formulas and test it using R. 6 Use the Pima.tr to find the difference between the sample means of diastolic blood pressure for diabetic and nondiabetic Pima Indian women. Is the differ- ence between the means of diastolic blood pressure statistically significant at 0.01 level? 7 Answer the above question for the number of pregnancies and BMI 8 Use the birthwt data set to examine the relationship between hypertension history (ht) and the risk of having low-birthweight baby (low). 9 Use the birthwt dataset and examining the effect of smoke on birth weight. There is any significant difference? What is the p-value? (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical