Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Size: px

Start display at page:

Download "Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014"

Roy Young
5 years ago
Views:

1 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

2 Age standardization Incidence and prevalence are strongly agedependent Risks rising (e.g. chronic diseases) or declining (e.g. measles) with age Comparisons between populations and over time may be very misleading A single age-independent index representing a set of age-specific rates may be more appropriate

Mortality in Denmark and Greenland, men,

3 Mortality in Denmark and Greenland, men, 1975 Direct standardization Please interpret this table? IR(DK-standardized to Greenlandic age-distribution) = 0.016* * * * * *66.5 = 3.8 Indirect standardization

4 Example Trend study of lung cancer incidence among women Denmark Lung Cancer Denmark Women 9 8 ratecrude segi 7 scand Lung Cancer Denmark Women ratecrude Example 2 Incidence of multiple sclerosis Denmark European Standard Population

abusers Copenhagen 1952-1992 Compare incidence of heart

5 Example indirect standardization 19,185 subjects (3,817 women) who attended outpatient clinics for alcohol abusers Copenhagen Compare incidence of heart disease by the incidence rate in the greater Copenhagen area

6 Problems Direct standardisation can produce unreliable estimates when the calculations are based on small numbers Indirect standardisations from different populations cannot be directly compared only compared to the standard Compared to regression methods Regression based methods are available but are rarely applied in practice When individual data are available (presence / absence of disease, age and sex), a logistic regression can be used to estimate the standardized rate The main advantage is that it allows adjustment by continuous variables in addition to categorical variables Missing data What does missing mean The pattern of missingness (nomenclature) How and why is it missing? Missing values Common in research Nonresponse Loss to follow-up Lack of overlap between linked data sets (not so common) Methods for handling

7 What is item nonresponse? Unit Nonresponse vs. Item Nonresponse ID Q1 Q2 Q ??? ID Q1 Q2 Q ? 1 458? ? Unit Nonresponse Examples Person who is not at home Person who does not pick up the phone Person who hangs up on you Rat that dies before the study The country you could not get data on etc. Item Nonresponse I Don t Know Refusals to respond Questions left blank Failed measurement etc. Best way to deal with Missing Data is not to have any

8 Minimizing Unit Nonresponse Call back if not home Refusal conversion Don t mess up Clear and understandable questionnaire Polite request Incentives Minimizing Item Nonresponse Well written questions Minimize misunderstandings cross-cultural example Standardized vs. non-standardized Minimize skip patterns What kind of missing data should be modeled? If an item is missing from your dataset but you suspect that it has a true value I don t know might simply mean I don t know Don t model it as if there was a true value Dead people (attrition) The pattern of missingness (nomenclature) Ignorable MCAR - Missing Completely at Random MAR - Missing at Random Non-ignorable NMAR - Not Missing at Random

9 Missing completely at random Missing Completely at Random: if the data are missing completely at random then missing values cannot be predicted any better Cause of missingness completely random process (like coin flip) Cause uncorrelated with variables of interest Example: parents move No bias if cause omitted In the unlikely event that the process is missing completely at random, then inferences based on complete cases are unbiased, but inefficient because we have lost some cases Missing at random Missingness may be related to measured variables But no residual relationship with unmeasured variables No bias if you control for measured variables For example, if highly educated are more likely to participate in a survey, then the process is missing at random as long we know the educational level of all persons If data is missing at random, then inferences based on complete cases will be biased and inefficient Missing not at random Non-Ignorable / NMAR: if the probability that a cell is missing depends on the unobserved value of the missing value For example, individuals responses to income questions, where high income people are more likely to refuse to answer survey questions about income and other variables in the data set cannot predict which respondents have high income If your missing data is non-ignorable, then inferences based on complete cases will be biased and inefficient Classical Missing Data Treatments Whatever you do, you are doing something Case Deletion Listwise (complete case analysis) Pairwise (available case analysis) Indicator variable (dummy variable) Single Imputation (Unconditional) Mean Imputation Conditional Mean Imputation (expected value) Weighting

10 Listwise Deletion and Multi-Item Excludes the whole case Default in most software Works if mechanism is MCAR and if pattern and sample size allows (need to have enough complete cases) Can be biased Pairwise Deletion An option for using all available information correlation/covariance matrixes Different calculations may be based on different populations Very unpredictable bias Indicator method For each variable with missing values, create a missing-value indicator to accompany the variable in all analysis Assumes MCAR Even if the stratum is just a random sample of all subjects, the stratum will yield a confounded estimate of the exposure effect Technique Mean imputation Calculate mean over cases that have values for Y Impute this mean where Y is missing Ditto for X 1, X 2, etc. Problems ignores relationships among X and Y underestimates covariances

(Unconditional) Mean Imputation Mean imputation Standard errors too low CI difficult to calculate Scatterplots are from Joe Schafer s website Conditional mean imputation Technique & implicit

+ X 1 g 1 + Y g 2 If both Y and X 2 are missing impute means of cases with similar values for X 1 Y = d 0 + X 1 d 1 X 2 = f 0 + X 1 f 1 Problem Ignores random components (no e) àunderestimates

11 (Unconditional) Mean Imputation Mean imputation Standard errors too low CI difficult to calculate Scatterplots are from Joe Schafer s website Conditional mean imputation Technique & implicit models If Y is missing impute mean of cases with similar values for X 1, X 2 Y = b 0 + X 1 b 1 + X 2 b 2 Likewise, if X 2 is missing impute mean of cases with similar values for X 1, Y X 1 = g 0 + X 1 g 1 + Y g 2 If both Y and X 2 are missing impute means of cases with similar values for X 1 Y = d 0 + X 1 d 1 X 2 = f 0 + X 1 f 1 Problem Ignores random components (no e) àunderestimates variances, se s Imputation of Expected Value Good for creating expected values Bad for multivariate analysis Decreases standard errors Creates overconfident outcomes Increases probability of Type I error

12 Problem with single imputation Underestimates se s! Treats imputed values like observed values when they are actually less certain Ignores imputation variation Sampling variation Imputation variation If you take a different sample you get different parameter estimates Standard errors reflect this One way to estimate sampling variation measure variation across multiple samples called bootstrapping Imputation variation If you impute different values you get different parameter estimates Standard errors should reflect this, too One way to estimate imputation variation measure variation across multiple imputed data sets called multiple imputation Multiple Imputation Example Models both expected value and uncertainty. Using the Missing Data Model you specify it simulates and imputes missing values multiple times creating M complete datasets (M=5 is usually OK. It is a good idea to simulate more) Analyze each dataset independently Combines results to get unbiased estimates. Models both uncertainty and expectation

13 Multiple Imputation Simple Procedure 1. Impute using PROC MI 3. Do analysis: PROC REG, LOGISTIC, etc. using by _imputation_; in the procedure 4. Combine results using PROC MIANALYZE PROC MI Sample Output PROC MI Typical syntax: proc mi data=bmx out=impdat seed=33155; var bmxbmi bmxht bmxwt bmxarmc bmxarml; run; data= 1 copy of data with missing values out= 5 copies of data with imputed values (will be different across copies) seed= random seed, you can keep same to reconstruct your results var Variables with missing values you need imputed, in model, and those that may be helpful with imputation PROC MI Options nimpute=5 # imputations, default=5 0 gives missing patterns set min & max, sometimes maximum= doesn t converge as well minimum= round= round off option

14 Output dataset Regression Fit your model as if data had no missing values, using by _imputation_; proc reg data=impdat outest=parmcov covout; model bmxbmi=bmxht bmxwt bmxarmc bmxarml; by _imputation_; run; You ll get nimpute (usually 5) sets of output Estimates, covariances, errors will be combined in MIANALYZE Need to generate parameter estimates and covariance data set (varies by procedure) Parameter Est. & Covariance Matrix proc logistic data=impdat descending; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; proc mixed data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /solution covb; by _imputation_; ods output covparms=parmcov; run; Parameter Est. & Covariance Matrix proc genmod data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run;

15 PROC MIANALYZE Syntax depends on what procedure you used in previous step: PROC MIANALYZE Output proc mianalyze data=parmcov; (or) proc mianalyze parms=parmsdat covb=covbdat; (or) proc mianalyze parms=parmsdat xpxi=xpxidat; (then type this:) modeleffects intercept bmxht bmxwt bmxarmc bmxarml; run; Note the var statement is now modeleffects Note that the dependent variable is omitted STATA *preparing dataset for multipel imputation mi query mi set mlong mi describe, detail mi register imputed total set seed mi impute mvn total = i.smoking i.isced4 i.samliv3 i.s57a_ i.alder4 i.gender, add(20) force mi describe, detail *rounding the imputed binary values to the nearest integer *replace bingedrinking = 0 if bingedrinking <0.5 *replace bingedrinking = 1 if bingedrinking >0.5 *replace change_new = round(change_new) *examination of imputations: comparing main descriptive statistics from some imputations to those from the observed data mi xeq : summarize total mi estimate: xtmixed total i.gender group##month username:, mle mi estimate: mean total, over(sex group month) Weigted regression Suppose that a national survey sampled 2000 subjects with 1000 men and 1000 women The response were 500 for men and 750 for women If there are large differences between men and women, a simple average of 2000 observations will be a distorted representation of the population mean By down-weighting women and up-weighting men we could obtain the accurate picture of the population

Values not missing at random (NMAR) Probability

higher for the overweight (depends on Y) is higher

16 Values not missing at random (NMAR) Probability that values are missing depends on the missing values themselves e.g., the probability that weight Y is missing is higher for the overweight (depends on Y) is higher for women (depends on X1) and sometimes X1 is missing, too. Methods available not today!

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization