Statistical Tests for Variable Discrimination

Size: px
Start display at page:

Download "Statistical Tests for Variable Discrimination"

Transcription

1 Statistical Tests for Variable Discrimination University of Trento - FBK 26 February, 2015 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

2 General statistics Descriptional: Describing samples statistic properties Mathematical: Studying the probability distributions Question around the samples starting from a known distribution Knowing the 50% of the population read books, what is the probability that in a sample of 100 subjects 70 of them read books? Inferential: Starting from the samples, what about the statistical distribution? In a sample of 100 subjects, 65 of them read books. May I infer that more than 50% of the general population read books? What is the probability of an error? (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

3 Relative frequencies and Percentage Example Given the birthwt dataset where n = 189 What are the relative frequencies for the race variable? Relative frequencies can be computedd as: nc n head(birthwt) ## low age lwt race smoke ptl ht ui ftv bwt ## African-American ## Other ## White ## White ## White ## Other table(birthwt$race) ## Frequencies ## ## White African-American Other ## (table(birthwt$race) / nrow(birthwt))*100 # Relative Frequencies ## ## White African-American Other ## (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

4 Statistical Inference Definition: The process of using the data to draw conclusions about the whole population Example Examples of statistical inference. Let s say I want to test the hypothesis about the average normal body temperature. 1 Get the body temperature of the whole population NOT FEASIBLE 2 Study a sample of representative members selected from the population Samples should be chosen randomly Samples are assumed to be independent 3 Try to estimate the unknown population average NB The real population average remains unknown. The estimation depends on our observations There is always an uncertainty (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

5 How to choose the population? More on sampling How do we select samples from a population? SRS: Simple Random Sampling. The most straight sampling procedure. Give a number 1... N to each member in the population Extract randomly n numbers Change of being selected is the same for any group of n members in the population SS Stratified Sampling. The sample should be comparable to the whole population with respect to representative groups. No subgroup in the observations should be overrepresented CS Clustering sample. Start the sampling grouping in clusters Sample from the clusters Subsample some or all members of the cluster (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

6 Population vs Samples Population parameters estimate Mean: µ = N i=1 x i N Population N x is an estimator of the µ (true population mean) In particular x µ for n Variance: σ 2 = N i=1 (x i µ) 2 N Mean: x = Variance: s 2 = Sample n n i=1 x i n n i=1 (x i x) 2 n 1 mean(birthwt$smoke) ## Smoking mothers mean ## [1] var(birthwt$smoke) ## Smoking mothers variance ## [1] mean(birthwt$smoke) * (1 - mean(birthwt$smoke)) ## See the bernoulli dist. ## [1] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

7 Law of Large Numbers µ^ If the sample size is large enough... The mean estimator converges to the population mean Mean estimation for n > Inf from N(0,1) ^2 10^3 10^4 10^5 Number of extraction (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

8 Sample distributions Probability distributions for estimators are called sampling distribution Assumptions Assume random variable X has a normal N (0, 1) distribution Assume σ 2 is known We use X to estimate µ What is the sampling distribution of X? Extract n samples from the population X 1,...,n N (µ, σ 2 ) with X 1,...,n independent. X 1 + X X n N (nµ, nσ 2 ) n i=1 X i N (nµ, nσ 2 ) The sum of n identically distributed normal variables is itself normally distributed n i=1 Given the sample mean estimator X = X i the mean and variance of the sample mean n estimator is: nµ/n and nσ 2 /n 2 = σ 2 /n X N (µ, σ 2 /n) (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

9 Sample distributions II Example Consider the random variable X N (125, 15 2 ) representing the systolic blood pression Extract 100 samples X 1,, X 100 N (125, 15 2 ) and X N (125, 15 2 /100) Estimators depend on the specific sample selected from the population Repeating the sampling lead to different values for the estimator Theoretical Distribution Sample mean probability distribution Density Density x X (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

10 Hints on how to compute those plots Draw the population density distribution Extract 100 samples from the population distribution Create the probability distribution Plot everything Draw the sample mean distribution Extract 100 samples from the distribution Estimate the mean of the distribution Repeat the same operation 1000 times Plot everything (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

11 Confidence Intervals Definition Variations of the estimators if different members of the population were selected Example Consider the Systolic Blood Pressure example: We know the sample mean distribution is: X = N (µ, σ 2 /n) Since the % rule applies, with 0.95 of probability: µ X µ We want to estimate the true population µ probability, X 3 µ X + 3 µ falls within [ X 3, X + 3] we could repeatedly sample n, find the sample mean and determine the interval In reality we have only one sample so the true µ with 0.95 of probability is in: [ x 3, x + 3] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

12 Confidence intervals for the Population Proportion Suppose we want to find the 95% CI for the population proportion of mothers who smoke during pregnancy in Using the birthwt dataset x = 0.39 sum(birthwt$smoke)/189 ## [1] Estimate the variance s 2 = p(1 p) = 0.24 s <- (sum(birthwt$smoke)/189) * (1-sum(birthwt$smoke)/189) ## [1] The Standard Error (SE) for the sample mean is σ n = SE <- sqrt(s/189) The 95% CI is [p z crit SE, p + z crit SE]: p(1 p) n = 0.3 [ , ] = [0.33, 0.45] Therefore we can define the Margin of Error as: e = z crit σ n (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

13 The % rule The % rule for normally distributed values: 68% of values fall within 1 standard deviation of the mean P(µ σ < X µ + σ) = % of values fall within 2 standard deviation of the mean P(µ 2σ < X µ + 2σ) = % of values fall within 3 standard deviation of the mean P(µ 3σ < X µ + 3σ) = (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

14 Check the % rule with R For a sufficient number of samples we can estimate the typical ranges n < mynorm <- rnorm(n) # Extract n samples from N(0,1) sum(mynorm>mean(mynorm)-sd(mynorm) & mynorm<=mean(mynorm)+sd(mynorm))/n ## [1] sum(mynorm>mean(mynorm)-2*sd(mynorm) & mynorm<=mean(mynorm)+2*sd(mynorm))/n ## [1] sum(mynorm>mean(mynorm)-3*sd(mynorm) & mynorm<=mean(mynorm)+3*sd(mynorm))/n ## [1] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

15 How the rule looks like 68% Interval 95% Interval Density σ + σ Density σ +2σ x x (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

16 Exercises Recall the % Rule and find the multiplier for the confidence intervals at 70, 80, 90% for a normally distributed variable. We assume that the probability distribution of blood pressure, X N (µ, σ 2 ) distribution. Suppose we know that σ = 6. To estimate µ, we randomly selected 9 people and measured their blood pressure. The sample mean is x = Write down the sampling distribution of the sample mean X and find its standard deviation. 2 Find the 75% CI estimation for µ (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

17 Case-Control study Example We want to study the effect of smoking on lung cancer. Retrospective Select a group of patients with lung cancer and survey them to determine if they have smoked in the past. Prospective Select a group of smokers and observe them over time without influencing the natural process. To make resonable conclusion we need to compare patients in the study with patients with the same habits without lung cancer which are similar in all other aspects. Compare cases (lung cuncer patients) with controls (no lung cancer) Individual in the case group should not be related with the control group. (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

18 Hypothesis Testing Assumptions Idea: Starting with an hypothesis we want to test if it is real In the Body Temperature dataset the hypothesis is that in average the body temperature is less than 98.6 degree F The statement can be expressed as µ < 98.6 We can now create an hypothesis which invalidates the previous one µ This is called the null hypothesis H 0 The null hypothesis reflects the nothing of interest We can define the alternative hypothesis denoting this as H A or H 1 which is what we want to investigate The procedure of evaluating the hypothesis is called hypothesis testing Examine the evidence the data provides against the null hypothesis. If the evidence is strong we reject H 0 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

19 Testing the mean In particular we want to test: X H 0 N (µ, σ 2 /n) Example From the body temperature dataset: We have H 0 : µ = 98.6 and H a : µ < 98.6 Select 25 healthy patients and σ 2 = 1 thus: X H 0 N (98.6, 1/25) From the 25 samples we have only one x. Suppose x = 98.4 We want to evaluate the lower tail probability for x = 98.4 The significance level is the p-value defined as: p obs = P( X x H 0 ) (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

20 Visualizing the hypothesis testing See the probability for p obs and the x p obs = P( X x H 0 ) Density p obs x x p obs = P( X 98.4) pnorm(98.4,mean=m,sd=s) ## Compute the above probability ## [1] (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

21 Hypothesis testing One-side vs Two-sides One-sided Test H 0 : µ = µ 0 against H 1 : µ < µ 0 Departure from the mean is on one direction Example with body temperature: H 0 : µ = 98.6 and H 1 : µ < 98.6 Computing: p obs = P(Z z) where Z = X µ0 σ/ x µ0 N (0, 1) and z = n σ/ n Two-sided We might be indifferent to the direction, thus: H 0 : µ = µ 0 and H 1 : µ µ 0 Example with body temperature: H 0 = µ = 98.6 and H 1 : µ 98.6 Computing: p obs = P(Z z ) + P(Z z ) = 2 P(Z z ) (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

22 Two-sided hypothesis tests Distribution of the Z normalized standard variable with z = 1 Z distribution Density p obs z x (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

23 Hypothesis Testing Aim: Answering to the general population distribution variable, starting from the samples collected From the population A and B average m A and m B Hypothesis: The mean µ A and µ B from population A and B respectively are equal (H0 null hypothesis) Alternatively,more of interest... µ A µ B H1=not H0 Result: Whether to accept or refuse H0 minimizing the type I error (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

24 Hypothesis testing T-test Example T-test: 1 Assumptions: Observations are indipendent Observation come from gaussian variables with mean µ a and µ b and variance σ a and σ b σ a = σ b 2 Null hypothesis H0: µ a = µ b 3 Compute T variable y = ma m b sp 1 na + 1 n b s p = (na 1)s2 a +(n b 1)s2 b na+n b 2 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

25 Examples in R One sided Using the Pima.tr dataset to test H 0 : µ = 30 and H 1 : µ > 30 t.test(pima.tr$bmi, alternative="greater", mu=30, conf.level=0.95) ## ## One Sample t-test ## ## data: Pima.tr$bmi ## t = , df = 199, p-value = 1.331e-07 ## alternative hypothesis: true mean is greater than 30 ## 95 percent confidence interval: ## Inf ## sample estimates: ## mean of x ## Two sided-two sample Use the BodyTemperature dataset to test if there is differences in body temperature between genders t.test(temperature~gender, data=bt, var.equal=true) ## ## Two Sample t-test ## ## data: Temperature by Gender ## t = , df = 98, p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean in group F mean in group M ## (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

26 Paired t-test Until now we assumed variables in two groups are independent. Example What if the variables are dependent? Is the t.test still valid? 1 Test the effect of a diet on blood pressure A sample can have a lower blood pressure before starting the experiment There can be differences given by the age of the subjects How to avoid the effect of this issues? A possible solution is to assign the same subject to each diet group Each subject follow the prescribed diet, and we measure the blood pressure, then they are asked to follow another diet for six months and then measure the blood pressure again. NB Individual in the two groups are paired (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

27 Paired t-test Examples Example To show the use of the paired version of the t.test we use the study on the effect of tobacco smoke on patelet function by Levine. hypothesis Higher frequency of arterial thrombosis in cigarette smokers could be partially explained by increased platelet aggregation caused by smoking study in a group of eleven people he measured the patelet aggregation before and after smoking a cigarette testing test if the difference in patelet aggregation:h 0 : µ = 0 and H 1 : µ < 0 t.test(pt$before,pt$after, paired=true) ## ## Paired t-test ## ## data: pt$before and pt$after ## t = , df = 10, p-value = ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## ## sample estimates: ## mean of the differences ## (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

28 Testing for normality All what we have seen before suppose the variables are normally distributed How do we check this? 1: Visual Inspection Test normality for Body Mass Index qqnorm(pima.tr$bmi) Normal Q Q Plot Theoretical Quantiles Sample Quantiles Test normality for Age qqnorm(pima.tr$age) Normal Q Q Plot Theoretical Quantiles Sample Quantiles (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

29 Testing for normality 2: Normality tests Shapiro-Wilk test for checking the normality It evaluates the null hypothesis that the distribution of a random variable is normal. Test normality for Body Mass Index shapiro.test(pima.tr$bmi) ## ## Shapiro-Wilk normality test ## ## data: Pima.tr$bmi ## W = 0.991, p-value = Test normality for Age shapiro.test(pima.tr$age) ## ## Shapiro-Wilk normality test ## ## data: Pima.tr$age ## W = , p-value = 1.853e-12 (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

30 Testing for Homoscedasticity Null Hypothesis H 0 : The variance of the groups are equal. Parametric test: bartlett test: bartlett.test(x,y) levene test (from car library): levenetest(y x) (Non) Parametric tests: Fligner-Killeen test: fligner.test(y x) bartlett.test(bt$temperature,bt$gender) ## ## Bartlett test of homogeneity of variances ## ## data: bt$temperature and bt$gender ## Bartlett's K-squared = 2.189, df = 1, p-value = Density N = 51 Bandwidth = (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

31 Excercise I 1 We assume that the probability distribution of blood pressure, X N (µ, σ 2 ) distribution suppose that we did not know σ and estimated it using the sample standard deviation s=6 1 Find the standard error for the sample mean as the estimator of the population mean 2 Find the 80% CI estimation for µ based on this sample 2 Given a distribution with 20 degree of freedom compute the confidence interval at 0.99, 0.95, 0.90 probability. 3 Using the bodytemperature dataset, find the point estimate and the 78% confidence interval estimate for the population means of hear rate and normal body temperature 4 Suppose that we interviewed a random sample of 2000 people and found that 320 of them smoke regularly. Find the 90% confidence interval for the population proportion of smokers 5 With the Pima.tr dataset suppose a BMI greater than 30 denote obesity. We know obesity and diabetes are related. Suppose sample size is n = 100 and σ 2 = 6 2. How can you test if this population is obese? Write the formulas and test it using R. 6 Use the Pima.tr to find the difference between the sample means of diastolic blood pressure for diabetic and nondiabetic Pima Indian women. Is the differ- ence between the means of diastolic blood pressure statistically significant at 0.01 level? 7 Answer the above question for the number of pregnancies and BMI 8 Use the birthwt data set to examine the relationship between hypertension history (ht) and the risk of having low-birthweight baby (low). 9 Use the birthwt dataset and examining the effect of smoke on birth weight. There is any significant difference? What is the p-value? (UNITN-FBK) Statistical Tests for Variable Discrimination 26 February, / 31

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Chapter 2 Data Exploration

Chapter 2 Data Exploration Chapter 2 Data Exploration 2.1 Data Visualization and Summary Statistics After clearly defining the scientific question we try to answer, selecting a set of representative members from the population of

More information

STA215 Inference about comparing two populations

STA215 Inference about comparing two populations STA215 Inference about comparing two populations Al Nosedal. University of Toronto. Summer 2017 June 22, 2017 Two-sample problems The goal of inference is to compare the responses to two treatments or

More information

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015 STAT 113: Lab 9 Colin Reimer Dawson Last revised November 10, 2015 We will do some of the following together. The exercises with a (*) should be done and turned in as part of HW9. Before we start, let

More information

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown Z-TEST / Z-STATISTIC: used to test hypotheses about µ when the population standard deviation is known and population distribution is normal or sample size is large T-TEST / T-STATISTIC: used to test hypotheses

More information

Package distdichor. R topics documented: September 24, Type Package

Package distdichor. R topics documented: September 24, Type Package Type Package Package distdichor September 24, 2018 Title Distributional Method for the Dichotomisation of Continuous Outcomes Version 0.1-1 Author Odile Sauzet Maintainer Odile Sauzet

More information

Unit 5: Estimating with Confidence

Unit 5: Estimating with Confidence Unit 5: Estimating with Confidence Section 8.3 The Practice of Statistics, 4 th edition For AP* STARNES, YATES, MOORE Unit 5 Estimating with Confidence 8.1 8.2 8.3 Confidence Intervals: The Basics Estimating

More information

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution Stat 528 (Autumn 2008) Density Curves and the Normal Distribution Reading: Section 1.3 Density curves An example: GRE scores Measures of center and spread The normal distribution Features of the normal

More information

Interval Estimation. The data set belongs to the MASS package, which has to be pre-loaded into the R workspace prior to use.

Interval Estimation. The data set belongs to the MASS package, which has to be pre-loaded into the R workspace prior to use. Interval Estimation It is a common requirement to efficiently estimate population parameters based on simple random sample data. In the R tutorials of this section, we demonstrate how to compute the estimates.

More information

The Bootstrap and Jackknife

The Bootstrap and Jackknife The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter

More information

Condence Intervals about a Single Parameter:

Condence Intervals about a Single Parameter: Chapter 9 Condence Intervals about a Single Parameter: 9.1 About a Population Mean, known Denition 9.1.1 A point estimate of a parameter is the value of a statistic that estimates the value of the parameter.

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Chapter 8. Interval Estimation

Chapter 8. Interval Estimation Chapter 8 Interval Estimation We know how to get point estimate, so this chapter is really just about how to get the Introduction Move from generating a single point estimate of a parameter to generating

More information

Lab #9: ANOVA and TUKEY tests

Lab #9: ANOVA and TUKEY tests Lab #9: ANOVA and TUKEY tests Objectives: 1. Column manipulation in SAS 2. Analysis of variance 3. Tukey test 4. Least Significant Difference test 5. Analysis of variance with PROC GLM 6. Levene test for

More information

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users BIOSTATS 640 Spring 2018 Review of Introductory Biostatistics STATA solutions Page 1 of 13 Key Comments begin with an * Commands are in bold black I edited the output so that it appears here in blue Unit

More information

Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit.

Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit. Continuous Improvement Toolkit Normal Distribution The Continuous Improvement Map Managing Risk FMEA Understanding Performance** Check Sheets Data Collection PDPC RAID Log* Risk Analysis* Benchmarking***

More information

Quantitative - One Population

Quantitative - One Population Quantitative - One Population The Quantitative One Population VISA procedures allow the user to perform descriptive and inferential procedures for problems involving one population with quantitative (interval)

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

The Normal Distribution. John McGready, PhD Johns Hopkins University

The Normal Distribution. John McGready, PhD Johns Hopkins University The Normal Distribution John McGready, PhD Johns Hopkins University General Properties of The Normal Distribution The material in this video is subject to the copyright of the owners of the material and

More information

Chapter 2 Modeling Distributions of Data

Chapter 2 Modeling Distributions of Data Chapter 2 Modeling Distributions of Data Section 2.1 Describing Location in a Distribution Describing Location in a Distribution Learning Objectives After this section, you should be able to: FIND and

More information

MAT 110 WORKSHOP. Updated Fall 2018

MAT 110 WORKSHOP. Updated Fall 2018 MAT 110 WORKSHOP Updated Fall 2018 UNIT 3: STATISTICS Introduction Choosing a Sample Simple Random Sample: a set of individuals from the population chosen in a way that every individual has an equal chance

More information

INTRODUCTION to SAS STATISTICAL PACKAGE LAB 3

INTRODUCTION to SAS STATISTICAL PACKAGE LAB 3 Topics: Data step Subsetting Concatenation and Merging Reference: Little SAS Book - Chapter 5, Section 3.6 and 2.2 Online documentation Exercise I LAB EXERCISE The following is a lab exercise to give you

More information

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics: UPPSALA UNIVERSITY Department of Mathematics Måns Thulin, thulin@math.uu.se Analysis of regression and variance Fall 2011 COMPUTER EXERCISE 2: One-way ANOVA In this computer exercise we will work with

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

CHAPTER 2 Modeling Distributions of Data

CHAPTER 2 Modeling Distributions of Data CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers HW 34. Sketch

More information

CHAPTER 2 Modeling Distributions of Data

CHAPTER 2 Modeling Distributions of Data CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers Density Curves

More information

Analysis of variance - ANOVA

Analysis of variance - ANOVA Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try

More information

23.2 Normal Distributions

23.2 Normal Distributions 1_ Locker LESSON 23.2 Normal Distributions Common Core Math Standards The student is expected to: S-ID.4 Use the mean and standard deviation of a data set to fit it to a normal distribution and to estimate

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

Predicting Diabetes using Neural Networks and Randomized Optimization

Predicting Diabetes using Neural Networks and Randomized Optimization Predicting Diabetes using Neural Networks and Randomized Optimization Kunal Sharma GTID: ksharma74 CS 4641 Machine Learning Abstract This paper analysis the following randomized optimization techniques

More information

The Normal Distribution & z-scores

The Normal Distribution & z-scores & z-scores Distributions: Who needs them? Why are we interested in distributions? Important link between distributions and probabilities of events If we know the distribution of a set of events, then we

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

The Normal Distribution & z-scores

The Normal Distribution & z-scores & z-scores Distributions: Who needs them? Why are we interested in distributions? Important link between distributions and probabilities of events If we know the distribution of a set of events, then we

More information

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz Nonparametric and Simulation-Based Tests Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1 What is Parametric Testing? 2 Warmup #1, Two Sample Test for p 1 p 2 Ohio Issue 1, the Drug and Criminal Justice Policies

More information

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal Email: yuppal@ysu.edu Chapter 8: Interval Estimation Population Mean: Known Population Mean: Unknown Margin of Error and the Interval

More information

BIOL Gradation of a histogram (a) into the normal curve (b)

BIOL Gradation of a histogram (a) into the normal curve (b) (التوزيع الطبيعي ( Distribution Normal (Gaussian) One of the most important distributions in statistics is a continuous distribution called the normal distribution or Gaussian distribution. Consider the

More information

Use of Extreme Value Statistics in Modeling Biometric Systems

Use of Extreme Value Statistics in Modeling Biometric Systems Use of Extreme Value Statistics in Modeling Biometric Systems Similarity Scores Two types of matching: Genuine sample Imposter sample Matching scores Enrolled sample 0.95 0.32 Probability Density Decision

More information

CHAPTER 2 Modeling Distributions of Data

CHAPTER 2 Modeling Distributions of Data CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers Density Curves

More information

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization Machine Learning A 708.064 WS15/16 1sst KU Version: January 11, 2016 Exercises Problems marked with * are optional. 1 Conditional Independence I [3 P] a) [1 P] For the probability distribution P (A, B,

More information

Confidence Intervals. Dennis Sun Data 301

Confidence Intervals. Dennis Sun Data 301 Dennis Sun Data 301 Statistical Inference probability Population / Box Sample / Data statistics The goal of statistics is to infer the unknown population from the sample. We ve already seen one mode of

More information

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13. Minitab@Oneonta.Manual: Selected Introductory Statistical and Data Manipulation Procedures Gordon & Johnson 2002 Minitab version 13.0 Minitab@Oneonta.Manual: Selected Introductory Statistical and Data

More information

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C is one of many capability metrics that are available. When capability metrics are used, organizations typically provide

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Lab 5 - Risk Analysis, Robustness, and Power

Lab 5 - Risk Analysis, Robustness, and Power Type equation here.biology 458 Biometry Lab 5 - Risk Analysis, Robustness, and Power I. Risk Analysis The process of statistical hypothesis testing involves estimating the probability of making errors

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Assumption 1: Groups of data represent random samples from their respective populations.

Assumption 1: Groups of data represent random samples from their respective populations. Tutorial 6: Comparing Two Groups Assumptions The following methods for comparing two groups are based on several assumptions. The type of test you use will vary based on whether these assumptions are met

More information

Week 7: The normal distribution and sample means

Week 7: The normal distribution and sample means Week 7: The normal distribution and sample means Goals Visualize properties of the normal distribution. Learning the Tools Understand the Central Limit Theorem. Calculate sampling properties of sample

More information

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015 MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015 2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample

More information

So..to be able to make comparisons possible, we need to compare them with their respective distributions.

So..to be able to make comparisons possible, we need to compare them with their respective distributions. Unit 3 ~ Modeling Distributions of Data 1 ***Section 2.1*** Measures of Relative Standing and Density Curves (ex) Suppose that a professional soccer team has the money to sign one additional player and

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

E-Campus Inferential Statistics - Part 2

E-Campus Inferential Statistics - Part 2 E-Campus Inferential Statistics - Part 2 Group Members: James Jones Question 4-Isthere a significant difference in the mean prices of the stores? New Textbook Prices New Price Descriptives 95% Confidence

More information

Stat 427/527: Advanced Data Analysis I

Stat 427/527: Advanced Data Analysis I Stat 427/527: Advanced Data Analysis I Chapter 3: Two-Sample Inferences September, 2017 1 / 44 Stat 427/527: Advanced Data Analysis I Chapter 3: Two-Sample Inferences September, 2017 2 / 44 Topics Suppose

More information

The Normal Distribution & z-scores

The Normal Distribution & z-scores & z-scores Distributions: Who needs them? Why are we interested in distributions? Important link between distributions and probabilities of events If we know the distribution of a set of events, then we

More information

Soci Statistics for Sociologists

Soci Statistics for Sociologists University of North Carolina Chapel Hill Soci708-001 Statistics for Sociologists Fall 2009 Professor François Nielsen Stata Commands for Module 7 Inference for Distributions For further information on

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

Chapter 2: The Normal Distribution

Chapter 2: The Normal Distribution Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60

More information

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter2 Description of samples and populations. 2.1 Introduction. Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that

More information

Table Of Contents. Table Of Contents

Table Of Contents. Table Of Contents Statistics Table Of Contents Table Of Contents Basic Statistics... 7 Basic Statistics Overview... 7 Descriptive Statistics Available for Display or Storage... 8 Display Descriptive Statistics... 9 Store

More information

Exploring Persuasiveness of Just-in-time Motivational Messages for Obesity Management

Exploring Persuasiveness of Just-in-time Motivational Messages for Obesity Management Exploring Persuasiveness of Just-in-time Motivational Messages for Obesity Management Megha Maheshwari 1, Samir Chatterjee 1, David Drew 2 1 Network Convergence Lab, Claremont Graduate University http://ncl.cgu.edu

More information

One Factor Experiments

One Factor Experiments One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal

More information

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

The problem we have now is called variable selection or perhaps model selection. There are several objectives. STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We

More information

Notes on Simulations in SAS Studio

Notes on Simulations in SAS Studio Notes on Simulations in SAS Studio If you are not careful about simulations in SAS Studio, you can run into problems. In particular, SAS Studio has a limited amount of memory that you can use to write

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Confidence Intervals: Estimators

Confidence Intervals: Estimators Confidence Intervals: Estimators Point Estimate: a specific value at estimates a parameter e.g., best estimator of e population mean ( ) is a sample mean problem is at ere is no way to determine how close

More information

Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.

Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC. Mixed Effects Models Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC March 6, 2018 Resources for statistical assistance Department of Statistics

More information

CHAPTER 2: Describing Location in a Distribution

CHAPTER 2: Describing Location in a Distribution CHAPTER 2: Describing Location in a Distribution 2.1 Goals: 1. Compute and use z-scores given the mean and sd 2. Compute and use the p th percentile of an observation 3. Intro to density curves 4. More

More information

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization Machine Learning A 708.064 13W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence a) [1 P] For the probability distribution P (A, B, C, D) with the factorization P (A, B,

More information

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide STEPS Epi Info Training Guide Department of Chronic Diseases and Health Promotion World Health Organization 20 Avenue Appia, 1211 Geneva 27, Switzerland For further information: www.who.int/chp/steps WHO

More information

Equivalence Tests for Two Means in a 2x2 Cross-Over Design using Differences

Equivalence Tests for Two Means in a 2x2 Cross-Over Design using Differences Chapter 520 Equivalence Tests for Two Means in a 2x2 Cross-Over Design using Differences Introduction This procedure calculates power and sample size of statistical tests of equivalence of the means of

More information

Using R. Liang Peng Georgia Institute of Technology January 2005

Using R. Liang Peng Georgia Institute of Technology January 2005 Using R Liang Peng Georgia Institute of Technology January 2005 1. Introduction Quote from http://www.r-project.org/about.html: R is a language and environment for statistical computing and graphics. It

More information

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016 Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

More information

MICROSOFT EXCEL BASIC FORMATTING

MICROSOFT EXCEL BASIC FORMATTING MICROSOFT EXCEL BASIC FORMATTING To create a new workbook: [Start All Programs Microsoft Office - Microsoft Excel 2010] To rename a sheet(1): Select the sheet whose tab you want to rename (the selected

More information

Advanced Statistical Computing Week 2: Monte Carlo Study of Statistical Procedures

Advanced Statistical Computing Week 2: Monte Carlo Study of Statistical Procedures Advanced Statistical Computing Week 2: Monte Carlo Study of Statistical Procedures Aad van der Vaart Fall 2012 Contents Sampling distribution Estimators Tests Computing a p-value Permutation Tests 2 Sampling

More information

SD 372 Pattern Recognition

SD 372 Pattern Recognition SD 372 Pattern Recognition Lab 2: Model Estimation and Discriminant Functions 1 Purpose This lab examines the areas of statistical model estimation and classifier aggregation. Model estimation will be

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Nonparametric and Simulation-Based Tests. STAT OSU, Spring 2019 Dalpiaz

Nonparametric and Simulation-Based Tests. STAT OSU, Spring 2019 Dalpiaz Nonparametric and Simulation-Based Tests STAT 3202 @ OSU, Spring 2019 Dalpiaz 1 What is Parametric Testing? 2 Warmup #1, Two Sample Test for p 1 p 2 Ohio Issue 1, the Drug and Criminal Justice Policies

More information

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables: Regression Lab The data set cholesterol.txt available on your thumb drive contains the following variables: Field Descriptions ID: Subject ID sex: Sex: 0 = male, = female age: Age in years chol: Serum

More information

CHAPTER 6. The Normal Probability Distribution

CHAPTER 6. The Normal Probability Distribution The Normal Probability Distribution CHAPTER 6 The normal probability distribution is the most widely used distribution in statistics as many statistical procedures are built around it. The central limit

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

Confidence Interval of a Proportion

Confidence Interval of a Proportion Confidence Interval of a Proportion FPP 20-21 Using the sample to learn about the box Box models and CLT assume we know the contents of the box (the population). In real-world problems, we do not. In random

More information

Distributions of random variables

Distributions of random variables Chapter 3 Distributions of random variables 31 Normal distribution Among all the distributions we see in practice, one is overwhelmingly the most common The symmetric, unimodal, bell curve is ubiquitous

More information

The ctest Package. January 3, 2000

The ctest Package. January 3, 2000 R objects documented: The ctest Package January 3, 2000 bartlett.test....................................... 1 binom.test........................................ 2 cor.test.........................................

More information

This is a good time to refresh your memory on double-integration. We will be using this skill in the upcoming lectures.

This is a good time to refresh your memory on double-integration. We will be using this skill in the upcoming lectures. Chapter 5: JOINT PROBABILITY DISTRIBUTIONS Part 1: Sections 5-1.1 to 5-1.4 For both discrete and continuous random variables we will discuss the following... Joint Distributions (for two or more r.v. s)

More information

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio

More information

for statistical analyses

for statistical analyses Using for statistical analyses Robert Bauer Warnemünde, 05/16/2012 Day 6 - Agenda: non-parametric alternatives to t-test and ANOVA (incl. post hoc tests) Wilcoxon Rank Sum/Mann-Whitney U-Test Kruskal-Wallis

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

i2itracks Population Health Analytics (ipha) Custom Reports & Dashboards

i2itracks Population Health Analytics (ipha) Custom Reports & Dashboards i2itracks Population Health Analytics (ipha) Custom Reports & Dashboards 377 Riverside Drive, Suite 300 Franklin, TN 37064 707-575-7100 www.i2ipophealth.com Table of Contents Creating ipha Custom Reports

More information

PART III APPLICATIONS

PART III APPLICATIONS S. Vieira PART III APPLICATIONS Fuzz IEEE 2013, Hyderabad India 1 Applications Finance Value at Risk estimation based on a PFS model for density forecast of a continuous response variable conditional on

More information

Box-Cox Transformation for Simple Linear Regression

Box-Cox Transformation for Simple Linear Regression Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are

More information

BIOS: 4120 Lab 11 Answers April 3-4, 2018

BIOS: 4120 Lab 11 Answers April 3-4, 2018 BIOS: 4120 Lab 11 Answers April 3-4, 2018 In today s lab we will briefly revisit Fisher s Exact Test, discuss confidence intervals for odds ratios, and review for quiz 3. Note: The material in the first

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Introduction to hypothesis testing

Introduction to hypothesis testing Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests

More information

For our example, we will look at the following factors and factor levels.

For our example, we will look at the following factors and factor levels. In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

More information