Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Size: px
Start display at page:

Download "Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data"

Transcription

1 Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used approach to compensate for missing or invalid values in sample surveys (Kalton and Kasprzyi, 1986). Imputation consists in replacing predicted values for the not available/not acceptable values. In Official Statistics, imputation is preferred to other treatments (lie analysis of complete data, analysis of available data, weighting adjustment) for several reasons: in sample surveys it is usually desirable to produce complete and consistent sets of microdata, thus allowing the application of standard complete-data methods for subsequent data analyses; furthermore, the results obtained from different analyses are consistent with one another, unlie the results of analyses from an incomplete data set; finally, imputation allows the use of the same survey weight for all items, unlie the weighting adjustment method. In this paper we focus on hot dec imputation (Little and Rubin, 2002). In hot dec methods the missing or invalid data for a (partially) non responding unit are replaced by the observed values taen from a similar" sampling unit (donor), properly chosen. The accuracy of imputations can be improved by first forming imputation classes, and then performing imputations separately within each of them. In random hot dec the donor is randomly chosen among the responding units in imputation cells. In nearest-neighbour donor imputation the donor corresponds to the most similar respondent in the imputation class with respect to some covariates. The popularity of hot dec methods is due to its attractive features. First of all, hot dec imputation can handle variables that are difficult to treat by explicit modelling, hence this method is expected to be more robust against departures from model assumptions than methods based on parametric models, lie ratio and regression imputation. Furthermore, since observed values are used for imputation, no synthetic values are imputed (as opposed to other methods lie mean, ratio or regression which may produce nonsensical values). The main problem with imputation is that it introduces an extra component of variability that must be considered in estimation. Analyses performed on imputed values treated as if they were observed can be misleading when estimates of the variance do not include the variability component due to imputation. As a result, the precision of estimates is overstated, and subsequent statistical analyses can be misleading (e.g., confidence intervals have lower than nominal levels). The approaches proposed in literature to obtain valid variance estimators for parameter estimates in presence of imputed data include model-assisted techniques (Särndal, 1992), multiple imputation (Rubin, 1987), resampling methods (Rao and Shao, 1992). Among non parametric resampling techniques, the jacnife method is a very popular one. A major advantage of jacnife is the fact that the variance of complicated estimators (e.g. nonlinear statistics) can be calculated in a relatively easy way without the

2 theoretical derivation of variance formulas as in other approaches. It is straightforward to note that the naive jacnife that uses the standard jacnife formulas and treats imputed values as if they were observed, underestimates the true variance. For imputed data, different jacnife estimators have been proposed for variance estimation under different imputation models, for different target parameters and sampling designs. Among others, Chen and Shao (2001) proposed jacnife variance estimators asymptotically unbiased and consistent for the sample mean for stratified multistage surveys under nearest-neighbour hot dec imputation; Saigo and Sitter (2005) proposed consistent reimputation jacnife variance estimators for sample totals and means for stratified multistage samples under ratio and regression imputation. In general, jacnife is a computationally intensive technique, and its application may be prohibitive especially in case of large scale surveys. In order to overcome these limitations, Kott (2001) proposed, in the case of complete data, a revised version of the jacnife procedure that is computationally feasible. The first version of this method, suitable when there is a large number of sampled units per stratum, is the delete-a-group jacnife (DAGJK). It was extended by Kott to deal with situations where the number of sampled units per stratum is less than the number of jacnife replicates (EDGJK). In this paper we propose an adjusted version of DAGJK (AD-DAGJK) and EDAGJK (AD- EDAGJK) to deal with the variance estimation in presence of imputed data. In particular, we focus on random hot dec imputation. The adjustment is analogous to that proposed by Rao and Shao (1992) for jacnife variance estimation under random hot dec imputation. In the paper, the variance estimators for AD-DAGJK and AD-EDAGJK are evaluated in comparison to the adjusted jacnife by Rao and Shao (AJ) and the standard variance estimation formula that treats imputed data as they were observed, thus ignoring the extracomponent of the variability due to imputation. The comparison is based on a Monte Carlo experiment consisting in drawing 500 samples from the 1991 Population Census according to the Italian Labour Force Survey sampling design. For all the items of interest, a controlled percentage of item nonresponses is simulated using a Missing at Random (MAR) nonresponse mechanism. Missing values are imputed using a stratified random hot dec. The evaluation is based on the analysis of the relative bias, the relative root mean squared error, the variance, the confidence interval coverage and the confidence intervals length (95%). The paper is structured as follows. In Section 2 and 3 the resampling variance estimation methods are described. Section 4 contains the description of the simulation study, while results are illustrated in Section 5. Conclusions and future wor are reported in Section 6. 2 The adjusted jacnife under random hot dec imputation Let A be the full sample and let us denote with A R and A M the subset of respondents and nonrespondents respectively. Let us assume that a sample A of size n is obtained through a stratified sample design with strata h = 1,..., L with n h observations in the hth stratum. The sampling weights are generally denoted by w for = 1,..., n. We suppose to treat the missing items via random hot dec (HD) imputation within classes, i.e., for each missing item we randomly select an observed value within the same class (imputation cell). The imputation cells should be defined so that the missing mechanism can be considered completely at random (MCAR) within each imputation cell. In the following, the hot dec

3 cells are indexed by g = 1,..., G, n Rg is the number of respondents (A Rg ) in the gth cell, and the imputed values are denoted by y for A M. Rao and Shao (1992) proposed an adjusted jacnife (AJ) variance estimator under HD imputation that is consistent assuming a uniform response mechanism within each imputation cell. For a stratified multistage sampling design with ignorable finite population correction factor, the AJ is where V ar(ŷi) = Ŷ I = L n h h=1 j=1 ( L h=1 [(n h 1)/n h ] (Ŷ (j) Ih ŶI) 2 (1) A Rh w y + A Mh w y is the estimate of the total computed on observed and imputed data. The term Ŷ (j) Ih in equation (1) is the estimate of Ŷ when the jth unit (in a two stage sampling design, we refer to the PSU j h) is omitted G Ŷ (j) Ih = y + ( ) y + ŷ (j) Rg ȳ Rg (3) g=1 A Rg w (j) A Mg w (j) where w (j) is the sampling weight for unit adjusted to account for the omission of unit j, ŷ (j) Rg = A Rg w (j) y / A Rg w (j) and ȳ Rg = A Rg w y / A Rg w. ) (2) 3 Delete-A-Group Jacnife and the Extended method The use of jacnife in survey statistics has many advantages such as: no need for model assumptions, straightforward variance computation of nonlinear statistics and easy calculation of domain estimates. Furthermore, unit nonresponses are easily dealt with and, by means of AJ, item nonresponses can be dealt with as well. Nevertheless, the application of jacnife, that is computer intensive, is generally not feasible for large scale surveys. In this section we describe the delete-a-group jacnife (DAGJK) and the extended DAGJK (EDAGJK) methods. These techniques are computationally less intensive than classical jacnife and can be applied also in case of large scale surveys. 3.1 Variance estimation in case of complete data DAGJK and EDAGJK (Kott, 2001) belong to the group of strategies aiming at reducing the number of jacnife replications, while maintaining adequate precision of variance estimates. These methods are based on the following jacnife procedure: 1. Primary Sample Units (PSUs) are randomly ordered in each stratum; 2. the PSUs are systematically allocated into R groups; 3. for each unit, R different sampling weights (replicate sampling weights) are computed as follows:

4 for DAGJK w, when h and no PSU h belongs to the group r = [ 0, when ] PSU in group r n h /(n h n (r) h ) w, otherwise. for EDAGJK = w when h and no PSU h belongs to the group r w [1 (n h 1) Z], when PSU in group r w (1 + Z) otherwise (4) (5) where Z 2 = R/ [(R 1) n h (n h 1)] Given the replicate weights expressed by formula (4) for DAGJK or by formula (5) for EDAGJK, the variance estimator is V ar(ŷ ) = R R 1 R (Ŷ (r) Ŷ )2 (6) where Ŷ (r) = s w(r) y. One important characteristic of DAGJK and EDAGJK is that the variance estimates improve when the number of random groups increases. A ey aspect that distinguishes the two methods is their behaviour with respect to the number of sampled PSUs in the strata. In particular, DAGJK provides upward biased estimates when this number is small, while EDAGJK is expected to fix this problem as shown in Kott (2001). r=1 3.2 DAGJK and EDAGJK variance estimation with Rao and Shao adjustment for imputation In order to tae into account item nonresponse in variance estimation, we propose an adjusted version of DAGJK (AD-DAGJK) and of EDAGJK (AD-EDAGJK) based on the Rao and Shao adjustment for HD imputation. The proposed variance estimator, which combines the estimator in formula (6) with the jacnife adjustments described in formulas (2) and (3), is V ar(ŷi) = R R 1 R r=1 (Ŷ (r) I ŶI) 2 (7) where, analogously to formula (3), Ŷ (r) I Ŷ (r) I = G g=1 i i A Rg is defined as y i + i i A Mg ( ) yi + ŷ (r) Rg ȳ Rg (8) where i i A Rg i are the replicate weights computed according to formula (4) or (5), ŷ (r) Rg = y i / i A Rg i, and ŶI is already defined in formula (2). In the case of incomplete data and HD imputation, the theoretical properties of AD- DAGJK and AD-EDAGJK must be investigated. In next section, we empirically study the properties of these methods.

5 4 Description of a Monte Carlo simulation study on real survey data The AD-DAGJK and AD-EDAGJK are evaluted by means of a Monte Carlo simulation study based on real survey data. A comparison with the standard Horvitz-Thompson variance estimation (without taing into account the imputation process) and the AJ estimation under HD imputation is shown. In the experiment, 500 samples are selected from the 1991 Italian Population Census data. The target population of the experiment is the geographical region Lazio, excluding the province of Rome (1,372,572 units). The samples are drawn according to the Italian Labour Force sampling design, that is summarized in the following steps: the municipalities of each province are ordered by population size and strata of municipalities with population size equal to a given threshold are formed. Strata with only one municipality are referred to as self-representing (S-R). There are 7 S-R strata; in each S-R stratum a sample of households (PSUs) of size proportional to the population is selected (stratified cluster design); in the non S-R stratum (NS-R) a pps sample of municipalities (PSUs) of size 2 is drawn, and a sample of households is selected (two stage stratified design). There are 18 NS-R strata; the number of sampled PSUs (municipalities and households) is 552. We are interested in estimating the total number of employed and unemployed persons. Missing items are introduced in the variable employment (employed/not employed) according to a missinbg at random mechanism (MAR). The nonresponse probabilities depend on two observed variables: X 1 (levels: 1,2,3,4) referring to the household s type; X 2, an indicator variable that depends on whether the unit belongs to either a S-R or a NS-R stratum. The nonresponse probabilities are reported in Table 1. Missing items are imputed by means of HD within imputation cells corresponding to the classes used for generating nonresponse, see Table 1. Table 1: Missing rates for the simulated nonresponse mechanism X 1 = 1 X 1 = 2 X 1 = 3 X 1 = 4 NS-R 10% 20% 30% 40% S-R 40% 30% 20% 10% It is worthwhile to note that jacnife methods are nearly unbiased when the number of sampled PSUs in each stratum is large. Kott (1998, 2001) suggests that each stratum should have at least 5 PSUs. Therefore, the AD-DAGJK and AJ methods do not seem to be appropriate with this sampling desing, while the AD-EDAGJK is expected to overcome this problem. An additional problem is that the presence of a non negligible PSU sampling fraction can determine an upward bias of the variance estimates. We notice that we are in this critical situation since, as shown in Table 2, 5 NS-R strata have a sampling fraction greater than 40%.

6 Table 2: Frequency of NS-R strata by PSU sampling fraction < 20% 20% 40% 40% 60% > 60% T otal F requency Results The first step of the analysis tests the effect of the number of random groups on the variance estimates obtained via EDAGJK and AD-EDAGJK. Two main quality indicators are considered: the Relative Bias (RB) and the Relative Root Mean Squared Error (RRMSE). Furthermore, the two indicators are computed under two cases: in absence of nonresponse (case A), and with nonresponse adjusted by HD imputation (case B). In both cases, we need a true reference variance for the evaluation. In case A, this is computed by the following steps: draw 5000 samples according to the sample design previously described; for each sample compute the Horvitz-Thompson estimate ; compute the variance of the 5000 estimates. In case B: draw 5000 samples according to the sample design previously described; in each sample, mechanism; simulate nonresponse according to the previously described impute each incomplete sample through HD; compute the Horvitz-Thompson estimate for each imputed sample; compute the variance of the 5000 estimates. In both cases, we denote the reference variance by V ar(ỹ ). The RB indicator is expressed by [ RB = V ar( Ỹ )] V ar(ỹ ) (c) 500 V ar(ỹ ) (9) where [ V ar( Ỹ )] (c) c=1 is the EDAGJK or AD-EDAGJK variance estimator, in case A and B respectively, obtained in the cth Monte Carlo iteration. We remar that, in case A, Ỹ indicates Ŷ, while in case B Ỹ is equal to ŶI. The RRMSE is given by { [ } 2 RRMSE = V ar( Ỹ )] V ar(ỹ ) (c) ] (10) [V ar(ỹ ) c=1

7 Table 3 shows that in both cases A and B, the number of random groups has a slight influence on the RB values. The minimum RB is attained with 5 random groups. In general, the imputation process increases the bias of the variance estimates. The RRMSE seems to be more dependent than RB with respect to the random group number. RRMSE decreases when the number of random groups increases. 30 random groups seem a good compromise between having good quality variance estimates and bounding the computational efforts. In the following, the comparative analysis is cariied out using 30 random groups. Table 3: RB and RRMSE of EDAGJK and AD-EDAGJK variance estimators R Without missing data With imputed data RB RRMSE RB RRMSE Figure 1 depicts the boxplot of the variance estimates obtained through EDAGJK, DAGJK, the standard Horvitz-Thompson variance estimator (STANDARD), and the jacnife method in case A. The dashed line represents the true reference variance. Let us note that EDAGJK and STANDARD are nearly unbiased estimators. Neverthless STANDARD method shows a slightly more efficient variance estimation than EDAGJK. The poor performance of the DAGJK and the jacnife may be caused by the sampling design. This design has many strata with only two PSUs (NS-R strata) and, as described in Kott (2001), when the stratum sample size is smaller than the number of random groups the DAGJK has an upward bias. This is also true for the jacnife method that may be viewed as a particular case of DAGJK where the number of random groups equals the overall number of PSUs. On the other hand, the greater number of random groups in jacnife produces a more efficient estimate than DAGJK method. Figure 2 shows the boxplots of the variance estimates obtained through AD-EDAGJK, AD-DAGJK, STANDARD, and AJ in case B. We may still observe that the proposed AD- EDAGJK method is nearly unbiased, while STANDARD method ignoring the adjustment for imputation produces a downward biased estimates. DAGJK and jacnife show an upward bias. Hence, results concerning the case B point out that AD-EDAGJK outperforms the other methods under study. The last step of the analysis compares the methods in terms of coverage rate of 95% confidence interval (Table 4). In case A, Table 4 shows that the coverage rates for the confidence intervals are generally close to the 95% level. The less biased estimator, EDAGJK and STANDARD, have a coverage rates around 91%, while the coverage rates greater than 97% for DAGJK and jacnife are due to larger lenght of the confidence interval. In case B, STANDARD method attains the 88% coverage rate, very far from the nominal one. The proposed AD-EDAGJK produces a coverage rate (92.5%) closest to the nominal level with a quite small relative lenght of the confidence interval.

8 Figure 1: Boxplots of the variance estimation methods without missing values Figure 2: Boxplots of the variance estimation methods with imputed data

9 Table 4: 95% Confidence interval (CI) coverage and CI relative lenght (RL) by method Without missing data With imputed data METHODS 95% CI CI RL 95% CI CI RL COVERAGE COVERAGE EDAGJK/AD-EDAGJK DAGJK/AD-DAGJK STANDARD JACKKNIFE/AJ Conclusion Imputation is commonly used to compensate for missing or invalid values in sample surveys, and variance estimation, that does not tae into account imputed data, can produce misleading measures of quality of the estimates. This problem is becoming a pressing target in Official Statistics. in case of large scale surveys, in the implementation of a suitable method of variance estimation, it is also crucial to tae into account the computational feasibility. The proposed approach based on EDAGJK with Rao and Shao adjustment to balance quality aspects with computational issues shows good properties in the Labour Force Survey Monte Carlo experiment in terms of precision of the variance estimates being, at the same time, computational feasible. Furthermore, the empirical results show that the approach seems to be suitable for the complex designs generally used in Official Statistics. Future analyses may be developed in the following directions: investigation of theoretical properties; study of the effect of the introduction of a finite population correction factor; application of the method with calibration estimators. References Chen, J., Shao, J., (2000). Nearest Neighbor Imputation for Survey Data. Journal of Official Statistics, 16, pp Chen, J., Shao, J., (2001). Jacnife Variance Estimation for Nearest-Neighbour Imputation. Journal of the American Statistical Association, 96, No Kalton, G., Kasprzy, D., (1986). The Treatment of Missing Survey Data. Survey Methodology, 12, pp Kott, P. S., (2001). The delete-a-group jacnife. Journal of Official Statistics, 17, pp Lee H., Rancourt E., Särndal C.-E., (2002). Variance Estimation from Survey Data under Single Value Imputation, in Survey Nonresponse, in Groves, R. et al (eds.), J. Wiley and Sons, New Yor, pp Little, J., Rubin, D., (2002). Statistical Analysis with Missing data, New Yor, Wiley. Rancourt, E., Särndal, C.-E., Lee, H., (1994). Estimation of the variance in presence of Nearest Neughbour Imputation, Proceedings of the Section on Survey Research methods,

10 American Statistical Association, pp Rao, J.N.K., Shao, J., (1992). Jacnife variance estimation with survey data under hot dec imputation, Biometria, 79, 4, pp Rubin, D.B., (1987). Multiple imputation for nonresponse in surveys. Wiley, New Yor. Saigo, H., Sitter R.R., (2005). Jacnife variance estimator with reimputation for randomly imputed survey data. Statistics and probability Letters, 73, pp Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model assisted survey sampling. Springer-Verlag, New Yor.

The Use of Sample Weights in Hot Deck Imputation

The Use of Sample Weights in Hot Deck Imputation Journal of Official Statistics, Vol. 25, No. 1, 2009, pp. 21 36 The Use of Sample Weights in Hot Deck Imputation Rebecca R. Andridge 1 and Roderick J. Little 1 A common strategy for handling item nonresponse

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Small area estimation by model calibration and "hybrid" calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland

Small area estimation by model calibration and hybrid calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland Small area estimation by model calibration and "hybrid" calibration Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland NTTS Conference, Brussels, 10-12 March 2015 Lehtonen R. and Veijanen

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

Data corruption, correction and imputation methods.

Data corruption, correction and imputation methods. Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed

More information

A Fast Multivariate Nearest Neighbour Imputation Algorithm

A Fast Multivariate Nearest Neighbour Imputation Algorithm A Fast Multivariate Nearest Neighbour Imputation Algorithm Norman Solomon, Giles Oatley and Ken McGarry Abstract Imputation of missing data is important in many areas, such as reducing non-response bias

More information

Weighting and estimation for the EU-SILC rotational design

Weighting and estimation for the EU-SILC rotational design Weighting and estimation for the EUSILC rotational design JeanMarc Museux 1 (Provisional version) 1. THE EUSILC INSTRUMENT 1.1. Introduction In order to meet both the crosssectional and longitudinal requirements,

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This module is part of the Memobust Handboo on Methodology of Modern Business Statistics 26 March 2014 Method: Statistical Matching Methods Contents General section... 3 Summary... 3 2. General description

More information

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey Kanru Xia 1, Steven Pedlow 1, Michael Davern 1 1 NORC/University of Chicago, 55 E. Monroe Suite 2000, Chicago, IL 60603

More information

Nonparametric imputation method for arxiv: v2 [stat.me] 6 Feb nonresponse in surveys

Nonparametric imputation method for arxiv: v2 [stat.me] 6 Feb nonresponse in surveys Nonparametric imputation method for arxiv:1603.05068v2 [stat.me] 6 Feb 2017 nonresponse in surveys Caren Hasler and Radu V. Craiu February 7, 2017 Abstract Many imputation methods are based on statistical

More information

Dual-Frame Sample Sizes (RDD and Cell) for Future Minnesota Health Access Surveys

Dual-Frame Sample Sizes (RDD and Cell) for Future Minnesota Health Access Surveys Dual-Frame Sample Sizes (RDD and Cell) for Future Minnesota Health Access Surveys Steven Pedlow 1, Kanru Xia 1, Michael Davern 1 1 NORC/University of Chicago, 55 E. Monroe Suite 2000, Chicago, IL 60603

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

1. Estimation equations for strip transect sampling, using notation consistent with that used to

1. Estimation equations for strip transect sampling, using notation consistent with that used to Web-based Supplementary Materials for Line Transect Methods for Plant Surveys by S.T. Buckland, D.L. Borchers, A. Johnston, P.A. Henrys and T.A. Marques Web Appendix A. Introduction In this on-line appendix,

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

The Bootstrap and Jackknife

The Bootstrap and Jackknife The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter

More information

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices Int J Adv Manuf Technol (2003) 21:249 256 Ownership and Copyright 2003 Springer-Verlag London Limited Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices J.-P. Chen 1

More information

3.6 Sample code: yrbs_data <- read.spss("yrbs07.sav",to.data.frame=true)

3.6 Sample code: yrbs_data <- read.spss(yrbs07.sav,to.data.frame=true) InJanuary2009,CDCproducedareportSoftwareforAnalyisofYRBSdata, describingtheuseofsas,sudaan,stata,spss,andepiinfoforanalyzingdatafrom theyouthriskbehaviorssurvey. ThisreportprovidesthesameinformationforRandthesurveypackage.Thetextof

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Survey estimation under informative nonresponse with follow-up

Survey estimation under informative nonresponse with follow-up University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2006 Survey estimation under informative nonresponse with follow-up Seppo

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Usage of R in Offi cial Statistics Survey Data Analysis at the Statistical Offi ce of the Republic of Slovenia

Usage of R in Offi cial Statistics Survey Data Analysis at the Statistical Offi ce of the Republic of Slovenia Usage of R in Offi cial Statistics Survey Data Analysis at the Statistical Offi ce of the Republic of Slovenia Jerneja PIKELJ (jerneja.pikelj@gov.si) Statistical Offi ce of the Republic of Slovenia ABSTRACT

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave. LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave. http://en.wikipedia.org/wiki/local_regression Local regression

More information

Analysis of missing values in simultaneous. functional relationship model for circular variables

Analysis of missing values in simultaneous. functional relationship model for circular variables Analysis of missing values in simultaneous linear functional relationship model for circular variables S. F. Hassan, Y. Z. Zubairi and A. G. Hussin* Centre for Foundation Studies in Science, University

More information

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data

The Importance of Modeling the Sampling Design in Multiple. Imputation for Missing Data The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data Jerome P. Reiter, Trivellore E. Raghunathan, and Satkartar K. Kinney Key Words: Complex Sampling Design, Multiple

More information

Design and estimation for split questionnaire surveys

Design and estimation for split questionnaire surveys University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2009 Design and estimation for split questionnaire surveys James O. Chipperfield

More information

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach

More information

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013 Applied Statistics Lab Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013 Approaches to Complex Sample Variance Estimation In simple random samples many estimators are linear estimators

More information

Improving Imputation Accuracy in Ordinal Data Using Classification

Improving Imputation Accuracy in Ordinal Data Using Classification Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

MISSING DATA AND MULTIPLE IMPUTATION

MISSING DATA AND MULTIPLE IMPUTATION Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This

More information

Comparative Evaluation of Synthetic Dataset Generation Methods

Comparative Evaluation of Synthetic Dataset Generation Methods Comparative Evaluation of Synthetic Dataset Generation Methods Ashish Dandekar, Remmy A. M. Zen, Stéphane Bressan December 12, 2017 1 / 17 Open Data vs Data Privacy Open Data Helps crowdsourcing the research

More information

Improved Sampling Weight Calibration by Generalized Raking with Optimal Unbiased Modification

Improved Sampling Weight Calibration by Generalized Raking with Optimal Unbiased Modification Improved Sampling Weight Calibration by Generalized Raking with Optimal Unbiased Modification A.C. Singh, N Ganesh, and Y. Lin NORC at the University of Chicago, Chicago, IL 663 singh-avi@norc.org; nada-ganesh@norc.org;

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 13: The bootstrap (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 30 Resampling 2 / 30 Sampling distribution of a statistic For this lecture: There is a population model

More information

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure SAS/STAT 14.2 User s Guide The SURVEYIMPUTE Procedure This document is an individual chapter from SAS/STAT 14.2 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute

More information

A Discussion of Weighting Procedures for Unit Nonresponse

A Discussion of Weighting Procedures for Unit Nonresponse Journal of Official Statistics, Vol. 32, No. 1, 2016, pp. 129 145, http://dx.doi.org/10.1515/jos-2016-0006 A Discussion of Weighting Procedures for Unit Nonresponse David Haziza 1 and Éric Lesage 2 Weighting

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

EFFECTS OF ADJUSTMENTS FOR WAVE NONRESPONSE ON PANEL SURVEY ESTIMATES. 1. Introduction

EFFECTS OF ADJUSTMENTS FOR WAVE NONRESPONSE ON PANEL SURVEY ESTIMATES. 1. Introduction EFFECTS OF ADJUSTMENTS FOR WAVE NONRESPONSE ON PANEL SURVEY ESTIMATES Graham Kalton and Michael E. Miller, University of Michigan 1. Introduction Wave nonresponse occurs in a panel survey when a unit takes

More information

Acknowledgments. Acronyms

Acknowledgments. Acronyms Acknowledgments Preface Acronyms xi xiii xv 1 Basic Tools 1 1.1 Goals of inference 1 1.1.1 Population or process? 1 1.1.2 Probability samples 2 1.1.3 Sampling weights 3 1.1.4 Design effects. 5 1.2 An introduction

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008

HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008 HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008 HILDA Standard Errors: A Users Guide Clinton Hayes The HILDA Project was initiated, and is funded, by the Australian Government Department of

More information

Section 4 Matching Estimator

Section 4 Matching Estimator Section 4 Matching Estimator Matching Estimators Key Idea: The matching method compares the outcomes of program participants with those of matched nonparticipants, where matches are chosen on the basis

More information

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Donsig Jang, Amang Sukasih, Xiaojing Lin Mathematica Policy Research, Inc. Thomas V. Williams TRICARE Management

More information

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS

More information

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical

More information

Analysis of Complex Survey Data with SAS

Analysis of Complex Survey Data with SAS ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods

More information

2. Description of the Procedure

2. Description of the Procedure 1. Introduction Item nonresponse occurs when questions from an otherwise completed survey questionnaire are not answered. Since the population estimates formed by ignoring missing data are often biased,

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 jfox@mcmaster.ca February 2004 Abstract Nonparametric regression analysis

More information

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu

More information

Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling

Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling Moritz Baecher May 15, 29 1 Introduction Edge-preserving smoothing and super-resolution are classic and important

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Simulating from the Polya posterior by Glen Meeden, March 06

Simulating from the Polya posterior by Glen Meeden, March 06 1 Introduction Simulating from the Polya posterior by Glen Meeden, glen@stat.umn.edu March 06 The Polya posterior is an objective Bayesian approach to finite population sampling. In its simplest form it

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016 Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

More information

Capture Recapture Sampling and Indirect Sampling

Capture Recapture Sampling and Indirect Sampling Journal of Official Statistics, Vol. 28, No., 22, pp. 27 Capture Recapture Sampling and Indirect Sampling Pierre Lavallée and Louis-Paul Rivest 2 Capture recapture sampling is used to estimate the total

More information

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Bengt Muth en University of California, Los Angeles Tihomir Asparouhov Muth en & Muth en Mplus

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Sample surveys especially household surveys conducted by the statistical agencies

Sample surveys especially household surveys conducted by the statistical agencies VARIANCE ESTIMATION WITH THE JACKKNIFE METHOD IN THE CASE OF CALIBRATED TOTALS LÁSZLÓ MIHÁLYFFY 1 Estimating the variance or standard error of survey data with the ackknife method runs in considerable

More information

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA ABSTRACT Statistical analyses can be greatly hampered by missing

More information

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in

More information

Fast Automated Estimation of Variance in Discrete Quantitative Stochastic Simulation

Fast Automated Estimation of Variance in Discrete Quantitative Stochastic Simulation Fast Automated Estimation of Variance in Discrete Quantitative Stochastic Simulation November 2010 Nelson Shaw njd50@uclive.ac.nz Department of Computer Science and Software Engineering University of Canterbury,

More information

Statistical Matching of Two Surveys with a Common Subset

Statistical Matching of Two Surveys with a Common Subset Marco Ballin, Marcello D Orazio, Marco Di Zio, Mauro Scanu, Nicola Torelli Statistical Matching of Two Surveys with a Common Subset Working Paper n. 124 2009 1 Statistical Matching of Two Surveys with

More information

More Summer Program t-shirts

More Summer Program t-shirts ICPSR Blalock Lectures, 2003 Bootstrap Resampling Robert Stine Lecture 2 Exploring the Bootstrap Questions from Lecture 1 Review of ideas, notes from Lecture 1 - sample-to-sample variation - resampling

More information

ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD. Julius Goodman. Bechtel Power Corporation E. Imperial Hwy. Norwalk, CA 90650, U.S.A.

ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD. Julius Goodman. Bechtel Power Corporation E. Imperial Hwy. Norwalk, CA 90650, U.S.A. - 430 - ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD Julius Goodman Bechtel Power Corporation 12400 E. Imperial Hwy. Norwalk, CA 90650, U.S.A. ABSTRACT The accuracy of Monte Carlo method of simulating

More information

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 6 SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

An Average-Case Analysis of the k-nearest Neighbor Classifier for Noisy Domains

An Average-Case Analysis of the k-nearest Neighbor Classifier for Noisy Domains An Average-Case Analysis of the k-nearest Neighbor Classifier for Noisy Domains Seishi Okamoto Fujitsu Laboratories Ltd. 2-2-1 Momochihama, Sawara-ku Fukuoka 814, Japan seishi@flab.fujitsu.co.jp Yugami

More information

arxiv: v1 [stat.me] 29 May 2015

arxiv: v1 [stat.me] 29 May 2015 MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis Vincent Audigier 1, François Husson 2 and Julie Josse 2 arxiv:1505.08116v1 [stat.me] 29 May 2015 Applied Mathematics

More information

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

Research with Large Databases

Research with Large Databases Research with Large Databases Key Statistical and Design Issues and Software for Analyzing Large Databases John Ayanian, MD, MPP Ellen P. McCarthy, PhD, MPH Society of General Internal Medicine Chicago,

More information

4.5 The smoothed bootstrap

4.5 The smoothed bootstrap 4.5. THE SMOOTHED BOOTSTRAP 47 F X i X Figure 4.1: Smoothing the empirical distribution function. 4.5 The smoothed bootstrap In the simple nonparametric bootstrap we have assumed that the empirical distribution

More information

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization 10 th World Congress on Structural and Multidisciplinary Optimization May 19-24, 2013, Orlando, Florida, USA Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization Sirisha Rangavajhala

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH

STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH 4 Th International Conference New Challenges for Statistical Software - The Use of R in Official Statistics Bucharest, 7-8 April 2016 STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH Marco Ballin

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

Learning and Evaluating Classifiers under Sample Selection Bias

Learning and Evaluating Classifiers under Sample Selection Bias Learning and Evaluating Classifiers under Sample Selection Bias Bianca Zadrozny IBM T.J. Watson Research Center, Yorktown Heights, NY 598 zadrozny@us.ibm.com Abstract Classifier learning methods commonly

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Are We Really Doing What We think We are doing? A Note on Finite-Sample Estimates of Two-Way Cluster-Robust Standard Errors

Are We Really Doing What We think We are doing? A Note on Finite-Sample Estimates of Two-Way Cluster-Robust Standard Errors Are We Really Doing What We think We are doing? A Note on Finite-Sample Estimates of Two-Way Cluster-Robust Standard Errors Mark (Shuai) Ma Kogod School of Business American University Email: Shuaim@american.edu

More information

HOW TO PROVE AND ASSESS CONFORMITY OF GUM-SUPPORTING SOFTWARE PRODUCTS

HOW TO PROVE AND ASSESS CONFORMITY OF GUM-SUPPORTING SOFTWARE PRODUCTS XX IMEKO World Congress Metrology for Green Growth September 9-14, 2012, Busan, Republic of Korea HOW TO PROVE AND ASSESS CONFORMITY OF GUM-SUPPORTING SOFTWARE PRODUCTS N. Greif, H. Schrepf Physikalisch-Technische

More information

VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung

VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung POLYTECHNIC UNIVERSITY Department of Computer and Information Science VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung Abstract: Techniques for reducing the variance in Monte Carlo

More information

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset Int. Journal of Math. Analysis, Vol. 5, 2011, no. 45, 2217-2227 Analysis of Imputation Methods for Missing Data in AR(1) Longitudinal Dataset Michikazu Nakai Innovation Center for Medical Redox Navigation,

More information

BACKGROUND INFORMATION ON COMPLEX SAMPLE SURVEYS

BACKGROUND INFORMATION ON COMPLEX SAMPLE SURVEYS Analysis of Complex Sample Survey Data Using the SURVEY PROCEDURES and Macro Coding Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT The paper presents

More information

Lecture 26: Missing data

Lecture 26: Missing data Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:

More information