Epidemiological analysis PhD-course in epidemiology

Size: px

Start display at page:

Download "Epidemiological analysis PhD-course in epidemiology"

Susan Riley
5 years ago
Views:

1 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1

2 2

3 3

4 4

5 Age standardization Incidence and prevalence are strongly agedependent Risks rising (e.g. chronic diseases) or declining (e.g. measles) with age Comparisons between populations and over time may be very misleading A single age-independent index representing a set of age-specific rates may be more appropriate 5

6 Mortality in Denmark and Greenland, men, 1975 Please interpret this table? Direct standardization IR(DK-standardized to Greenlandic age-distribution) = 0.016* * * * * *66.5 = 3.8 6

7 Indirect standardization 7

8 Example Trend study of lung cancer incidence among women Denmark Lung Cancer Denmark Women 9 8 ratecrude

9 Lung Cancer Denmark Women ratecrude segi scand Example 2 Incidence of multiple sclerosis Denmark European Standard Population 9

10 Example indirect standardization 19,185 subjects (3,817 women) who attended outpatient clinics for alcohol abusers Copenhagen Compare incidence of heart disease by the incidence rate in the greater Copenhagen area 10

11 11

12 Problems Direct standardisation can produce unreliable estimates when the calculations are based on small numbers Indirect standardisations from different populations cannot be directly compared only compared to the standard Compared to regression methods Regression based methods are available but are rarely applied in practice When individual data are available (presence / absence of disease, age and sex), a logistic regression can be used to estimate the standardized rate The main advantage is that it allows adjustment by continuous variables in addition to categorical variables 12

13 Missing data What does missing mean The pattern of missingness (nomenclature) How and why is it missing? Methods for handling Common in research Missing values Nonresponse Loss to follow-up Lack of overlap between linked data sets (not so common) 13

14 What is item nonresponse? Unit Nonresponse vs. Item Nonresponse ID Q1 Q2 Q ??? ID Q1 Q2 Q ? 1 458? ? Unit Nonresponse Examples Person who is not at home Person who does not pick up the phone Person who hangs up on you Rat that dies before the study The country you could not get data on etc. 14

15 Item Nonresponse I Don t Know Refusals to respond Questions left blank Failed measurement etc. Best way to deal with Missing Data is not to have any 15

16 Minimizing Unit Nonresponse Call back if not home Refusal conversion Don t mess up Clear and understandable questionnaire Polite request Incentives Minimizing Item Nonresponse Well written questions Minimize misunderstandings cross-cultural example Standardized vs. non-standardized Minimize skip patterns 16

17 What kind of missing data should be modeled? If an item is missing from your dataset but you suspect that it has a true value I don t know might simply mean I don t know Don t model it as if there was a true value Dead people (attrition) The pattern of missingness (nomenclature) Ignorable MCAR - Missing Completely at Random MAR - Missing at Random Non-ignorable NMAR - Not Missing at Random 17

18 Missing completely at random Missing Completely at Random: if the data are missing completely at random then missing values cannot be predicted any better Cause of missingness completely random process (like coin flip) Cause uncorrelated with variables of interest Example: parents move No bias if cause omitted In the unlikely event that the process is missing completely at random, then inferences based on complete cases are unbiased, but inefficient because we have lost some cases Missing at random Missingness may be related to measured variables But no residual relationship with unmeasured variables No bias if you control for measured variables For example, if highly educated are more likely to participate in a survey, then the process is missing at random as long we know the educational level of all persons If data is missing at random, then inferences based on complete cases will be biased and inefficient 18

19 Missing not at random Non-Ignorable / NMAR: if the probability that a cell is missing depends on the unobserved value of the missing value For example, individuals responses to income questions, where high income people are more likely to refuse to answer survey questions about income and other variables in the data set cannot predict which respondents have high income If your missing data is non-ignorable, then inferences based on complete cases will be biased and inefficient Classical Missing Data Treatments Whatever you do, you are doing something Case Deletion Listwise (complete case analysis) Pairwise (available case analysis) Indicator variable (dummy variable) Single Imputation (Unconditional) Mean Imputation Conditional Mean Imputation (expected value) Weighting 19

20 Listwise Deletion and Multi-Item Excludes the whole case Default in most software Works if mechanism is MCAR and if pattern and sample size allows (need to have enough complete cases) Can be biased Pairwise Deletion An option for using all available information correlation/covariance matrixes Different calculations may be based on different populations Very unpredictable bias 20

Indicator method For each variable with missing values, create a missing-value indicator to accompany the variable in all analysis Assumes MCAR Even if the stratum is just a random sample of all

21 Indicator method For each variable with missing values, create a missing-value indicator to accompany the variable in all analysis Assumes MCAR Even if the stratum is just a random sample of all subjects, the stratum will yield a confounded estimate of the exposure effect Technique Mean imputation Calculate mean over cases that have values for Y Impute this mean where Y is missing Ditto for X 1, X 2, etc. Problems ignores relationships among X and Y underestimates covariances 21

22 (Unconditional) Mean Imputation Scatterplots are from Joe Schafer s website Mean imputation Standard errors too low CI difficult to calculate 22

23 Conditional mean imputation Technique & implicit models If Y is missing impute mean of cases with similar values for X 1, X 2 Y = b 0 + X 1 b 1 + X 2 b 2 Likewise, if X 2 is missing impute mean of cases with similar values for X 1, Y X 1 = g 0 + X 1 g 1 + Y g 2 If both Y and X 2 are missing impute means of cases with similar values for X 1 Y = d 0 + X 1 d 1 X 2 = f 0 + X 1 f 1 Problem Ignores random components (no e) Underestimates variances, se s Imputation of Expected Value Good for creating expected values Bad for multivariate analysis Decreases standard errors Creates overconfident outcomes Increases probability of Type I error 23

24 Problem with single imputation Underestimates se s! Treats imputed values like observed values when they are actually less certain Ignores imputation variation Imputation variation Sampling variation If you take a different sample you get different parameter estimates Standard errors reflect this One way to estimate sampling variation measure variation across multiple samples called bootstrapping Imputation variation If you impute different values you get different parameter estimates Standard errors should reflect this, too One way to estimate imputation variation measure variation across multiple imputed data sets called multiple imputation 24

25 Multiple Imputation Models both expected value and uncertainty. Using the Missing Data Model you specify it simulates and imputes missing values multiple times creating M complete datasets (M=5 is usually OK. It is a good idea to simulate more) Analyze each dataset independently Combines results to get unbiased estimates. Models both uncertainty and expectation Example 25

26 Multiple Imputation Simple Procedure 1. Impute using PROC MI 3. Do analysis: PROC REG, LOGISTIC, etc. using by _imputation_; in the procedure 4. Combine results using PROC MIANALYZE PROC MI Typical syntax: proc mi data=bmx out=impdat seed=33155; var bmxbmi bmxht bmxwt bmxarmc bmxarml; run; data= 1 copy of data with missing values out= 5 copies of data with imputed values (will be different across copies) seed= random seed, you can keep same to reconstruct your results var Variables with missing values you need imputed, in model, and those that may be helpful with imputation 26

27 PROC MI Sample Output PROC MI Options nimpute=5 # imputations, default=5 0 gives missing patterns minimum= set min & max, sometimes maximum= doesn t converge as well round= round off option 27

28 Output dataset Regression Fit your model as if data had no missing values, using by _imputation_; proc reg data=impdat outest=parmcov covout; model bmxbmi=bmxht bmxwt bmxarmc bmxarml; by _imputation_; run; You ll get nimpute (usually 5) sets of output Estimates, covariances, errors will be combined in MIANALYZE Need to generate parameter estimates and covariance data set (varies by procedure) 28

29 Parameter Est. & Covariance Matrix proc logistic data=impdat descending; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; proc mixed data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /solution covb; by _imputation_; ods output covparms=parmcov; run; Parameter Est. & Covariance Matrix proc genmod data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; 29

30 PROC MIANALYZE Syntax depends on what procedure you used in previous step: proc mianalyze data=parmcov; (or) proc mianalyze parms=parmsdat covb=covbdat; (or) proc mianalyze parms=parmsdat xpxi=xpxidat; (then type this:) modeleffects intercept bmxht bmxwt bmxarmc bmxarml; run; Note the var statement is now modeleffects Note that the dependent variable is omitted PROC MIANALYZE Output 30

31 STATA *preparing dataset for multipel imputation mi query mi set mlong mi describe, detail mi register imputed total set seed mi impute mvn total = i.smoking i.isced4 i.samliv3 i.s57a_ i.alder4 i.gender, add(20) force mi describe, detail *rounding the imputed binary values to the nearest integer *replace bingedrinking = 0 if bingedrinking <0.5 *replace bingedrinking = 1 if bingedrinking >0.5 *replace change_new = round(change_new) *examination of imputations: comparing main descriptive statistics from some imputations to those from the observed data mi xeq : summarize total mi estimate: xtmixed total i.gender group##month username:, mle mi estimate: mean total, over(sex group month) Weigted regression Suppose that a national survey sampled 2000 subjects with 1000 men and 1000 women The response were 500 for men and 750 for women If there are large differences between men and women, a simple average of 2000 observations will be a distorted representation of the population mean By down-weighting women and up-weighting men we could obtain the accurate picture of the population 31

32 Values not missing at random (NMAR) Probability that values are missing depends on the missing values themselves e.g., the probability that weight Y is missing is higher for the overweight (depends on Y) is higher for women (depends on X1) and sometimes X1 is missing, too. Methods available not today! 32

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising