Statistical matching: conditional. independence assumption and auxiliary information
|
|
- Marylou Hoover
- 5 years ago
- Views:
Transcription
1 Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information
2 Outline The conditional independence model(cia) Parametric macro methods The normal case Maximum likelihood Parametric micro methods: an overview Nonparametric macro methods: an overview Nonparametric micro methods Random hot deck Conditional random hot deck Rank hot deck Distance hot deck Constrained hot deck Auxiliary information
3 Statistical matching Let us assume that data are collected in two sample surveys, say A and B of size n A and n B from the same population. Some X variables are observed in both the samples Variables Y are observed only in survey A Variables Z are observed only in survey B. The goal is inference on (X,Y,Z), or at least on the bivariate (Y,Z)
4 Statistical matching Goal: estimation of parameters describing (Y,Z) or (X,Y,Z)
5 A first identifiable model Let us consider the class of models F for (X,Y,Z) to the following set: where f Y X is the conditional density of Y given X, f Z X is the conditional density of Z given X f X is the marginal density of X. Consequence 1: this class of distributions for (X,Y,Z) is the conditional independence of Y and Z given X (CIA). Consequence 2: this model is identifiable for A B Note: this is not the only identifiable model! Help: this model can be useful in many different cases (use of proxy variables and uncertainty)!
6 The different matching contexts Output Macro Micro Approach Parametric Nonparametric Let s tackle this problem in a familiar context for inferential statistics: data are drawn according to a probability law that follows a parametric model, and the objective is macro. In the following we will mainly consider two distributions: the normal and the multinomial
7 Parametric macro methods In a parametric model, each probability law that can generate our sample data can be described by a finite number of parameters. Under the CIA, given the sample A B, the likelihood function becomes:
8 Parametric macro methods Parameter estimation becomes straightforward: Use sample A B for estimating Use A for estimating Use B for estimating
9 Parametric macro methods: the normal case Let (X,Y,Z) be a three-variate normal r.v. with parameters: Under the CIA, the parameter YZ is superfluous For the statistical matching problem, it is convenient to consider the equivalent distribution defined by this parameterization: X, Y X, Z X.
10 Parametric macro methods: the normal case Estimates for the re-parameterization
11 Parametric macro methods: the normal case For the estimates of the parameters of the marginal distribution of X, the whole sample A B can be used
12 Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Y given X, only sample A can be used Hence, the marginal parameters for Y are:
13 Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Z given X, only sample B can be used Hence, the marginal parameters for Z are:
14 Comment: why maximum likelihood estimation? What happens if, instead of the previous maximum likelihood parameter estimation, we consider a direct estimation from the data set where the corresponding variable(s) are observed? For instance, let s consider the case of a direct estimation of the Y mean value on A (i.e. with the sample average of Y in A) instead of using μ Y (a kind of regression estimate in a double sampling) Where ρ XY is the correlation coefficient between X and Y. The maximum likelihood estimator is much more efficient when B sample size increases and X and Y are highly correlated.
15 Comment: why maximum likelihood estimation? When parameters are estimated distinctly on the part of A B that is complete for the corresponding r.v., it might happen that the estimates are not coherent. For instance, the estimated variance and covariance matrix for (X, Y) can be negative definite! This does not happen in the simultaneous estimation of all the parameters by means of the maximum likelihood estimation
16 Example
17 Example Under the CIA the maximum likelihood estimate of the parameters are: From the previous estimates we get:
18 Parametric macro methods: the multinomial case Let (X,Y,Z) be a multinomial r.v. with parameters: where following characteristics is a vector of parameters with the
19 Parametric macro methods: the multinomial case Adopting the same re-parameterization of the joint distribution, under the CIA the parameters of interest are: In this context, the parameters of the joint distribution computed according to the following formulas are When the interest is only on the pairwise distribution (Y,Z)
20 Parametric macro methods: the multinomial case Given the sample A B, the maximum likelihood estimator is
21 Example Let s consider the following two samples A and B, where I=2, J=2, K=3.
22 Example The maximum likelihood estimates of the parameters are:
23 Example The maximum likelihood estimates of the parameters of the joint distribution are:
24 Parametric macro methods: conclusions The CIA model is identifiable (i.e. with a unique estimate) for the data set A B The application of the maximum likelihood estimator is very easy Even if the problem is characterized by missing data, the problem can be split in three «complete data» subproblems, one for each parameter of the re-parameterization There can be other estimation methods can be statistically consistent, but incoherent
25 Selected references Anderson T W (1957) ``Maximum likelihood estimates for a multivariate normal distribution when some observations are missing'', JASA, 52, Anderson T W (1984) An Introduction to Multivariate Statistical Analysis, Wiley Rubin D B (1974)``Characterizing the Estimation of Parameters in Incomplete--Data Problems'', JASA, 69, D'Orazio, M., Di Zio, M. and Scanu, M. (2006) Statistical matching for categorical data: displaying uncertainty and using logical constraints. JOS, 22, Moriarity C, Scheuren F (2001)``Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure'', JOS, 17,
26 Parametric micro methods Output Macro Micro Approach Parametric Nonparametric We are still in the familiar context of parametric data models, but now the objective is micro, i.e. we are interested in a complete data set where (X, Y, Z) are jointly available. This is the context where imputation methods are usually used!
27 Objective and context Objective: to create a complete data set for (X, Y, Z) Context: partially observed data set
28 Parametric micro methods: rationale Method: imputation of missing values. In a parametric context: 1. Estimate the distribution parameters 2. Take a (not necessarily random) value from the estimated distribution
29 Selected references Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2 nd edition, Wiley Rubin D B (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, Kadane J B (1978) Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., Republished on Journal of Official Statistics, 17, Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53,
30 Non parametric macro methods Approach Output Parametric Nonparametric Macro Micro The family of distributions where the distribution of (X, Y, Z) belongs cannot be represented by a finite number of parameters. Although the statistical literature on nonparametrics is huge, this is by far the most neglected situation in statistical matching! Anyway we check it in order to link macro and micro methods, as for the parametric case
31 Non parametric macro methods: rationale Usually neglected in the statistical matching literature., anyway it is possible to develop the methodologies similarly to the parametric macro case. Two situations have been mainly studied: 1. X categorical, Y and Z ordered or numerical: estimation of the empirical cumulative distribution 2. X, Y and Z numerical: estimation of the nonparametric regression function As a matter of fact, the first approach will be helpful for the random generation of imputations in the corresponding micro methods, the second for the conditional mean matching and, whenever possible, adding a random residual
32 Selected references Paass G (1985) Statistical record linkage methodology, state of the art and future prospects, in Bulletin of the International Statistical Institute, Proceedings of the 45th Session, volume LI, Book 2 Marella D., Scanu M., Conti P.L. (2008) On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53,
33 Non parametric micro methods Approach Output Parametric Nonparametric Macro Micro Who applied these methods, seldom assumed anything about the distribution of (X, Y, Z) Each micro method has a macro counterpart, i.e. a representation of how the distribution of (X, Y, Z) should be done. The problem is: to be aware or not?
34 Non parametric micro methods The nonparametric micro matching methods consist of essentially three imputation procedures 1. Random hot deck 2. Rank hot deck 3. Distance hot deck As already seen in the parametric case, each one of these methods correspond to a specific nonparametric macro approach of the distribution f x, y, z or of a characteristic value. In general, these methods do not organize the two data sets A and B as a unique sample A B.
35 Parametric micro methods A is the recipient file B is the donor file and and these are the data these are the data to to impute use for imputation The idea is to consider a file as a recipient and the other as the donor
36 Example In order to define the different hot deck methods, let s consider an example Example: let A and B be the following ones A : n A = 6, observed variables: Gender, Age, Income B : n B = 10, observed variables: Gender, Age, Expenditures A=recipient B=donor Common variables X=(X 1 =Gender, X 2 =Age) Y=(Income) Z=(Expenditures)
37 Example
38 Random hot deck: the method 1. Let us draw one random value from B and assign it to the first value to impute in A. 2. Follow the same procedure for all the a A Example: In general we have n B n A = 10 6 possible different ways to impute A
39 Conditional random hot deck: the method 1. Let s fix a conditional variable, e.g. X 1 2. For the first record a=1, let us draw one random value from the subset of units in B that X 1 b = F. 3. Follow the same procedure for all the a A Example: The number of different completed data sets we can get is m B m A + n B m B n A ma = = 1312
40 Comments 1. Random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z 2. Conditional random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z X 3. It is possible to eliminate the already drawn value from the set of possible donors (constrained procedure), anyway the preservation of the observed distribution of Z or Z X in B is geopardized
41 Rank hot deck Let s assume that n B = kn A, k integer. Compute the empirical cumulative distribution functions To each a A assign b B chosen so that In other words, this method imputes the values whose quintiles of X are similar in A and B respectively
42 Rank hot deck Rank the two sample A and B according to X 1
43 Rank hot deck These are the values of the empirical cumulative distribution function of X 1 in A and B respectively
44 Rank hot deck This is the result In this example, there is only one way to impute a value
45 Distance hot deck To each a A assign b B chosen so that it is the nearest according to the common variables. This method depends on the distance function. Different choices are available. If X is numeric, it is possible to choose the Manhattan distance Other distances can be the Euclidean, If X is multivariate, the available distances are the Mahalanobis, Canberra, If X is categorical and unordered, it is possible to consider the classes of imputation (i.e. the distance is the «equality»)
46 Distance hot deck - example Let s consider X 2 as the variable to use in order to compute distances (i.e. we choose as donors those records in B whose age is the most similar to the one in A) Choose one value at random if there are more than one same distance donors The overall distance between donor and recipients is
47 Constrained distance hot deck In the former procedure, a donor can be chosen more than once if it is the nearest of more than one record in A. In order to resue the same information more than once, the following constrained procedure has been defined a. Minimize b. Under the constraints
48 Constrained distance hot deck: advantages and disadvantages Constrained matching allows to preserve the marginal distribution of the variable to impute (Z) Constrained distance hot deck is characterized by a larger distance between donors and recipients than the one for distance hot deck
49 Constrained distance hot deck: example The overall donor recipient distance is
50 Constrained distance hot deck: comments Distance hot deck is equivalent to the estimation of a nonparametric regression function via the knn method, when k=1. These methods produce always live data as imputations Sometimes, parametric and nonparametric procedures are applied together: mixed methods Example: Regression step - impute intermediate values in A and B Matching step use a distance hot deck by selecting b * with the shortest distance
51 Selected references Kadane J B (1978) Some Statistical Problems in Merging Data Files, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., Published also on Journal of Official Statistics, 17, Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2 nd edition, Wiley Okner B A (1972) Constructing a new data base from existing microdata sets: the 1966 merge file, Annals of Economic and Social Measurement, 1, Rodgers W L (1984) An Evaluation of Statistical Matching, Journal of Business and Economic Statistics, 2, Rubin D B (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, Sims C A (1972), Comments on Okner, Annals of Economic and Social Measurement, 1, Singh A C, Mantel H, Kinack M, Rowe G (1993) Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, 19, Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, Conti P.L., Marella D., Scanu M. (2008). Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53,
52 Auxiliary information In order to avoid the CIA, two different kinds of auxiliary information have been usually considered: 1) a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2) a plausible value of the inestimable parameters of either (Y,Z X) or (Y,Z) These additional sources of information may originate from an outdated statistical investigation; administrative register; a supplemental (even small) ad hoc survey; proxy variables (Y,Z ) Pay always attention to their accuracy!!!
53 Auxiliary information on parameters Auxiliary information on parameters can be in terms of: information about q yz x Information about q yz This kind of information restricts the parameter space Q to a subspace Q*, where Q* involves all the parameters q Q compatible with the auxiliary information. NOTE: most of the times the unconstrained maximum likelihood estimate is not compatible with this information: this leads to parameter estimates of the estimable parameters that are strictly different from the ones on the CIA
54 Example: Auxiliary info on r yz = r* yz Let us suppose that Value r* yz = 0.7 is compatible, det(r ) = while r* yz = 0.9 is not compatible, det(r ) =-0.008
55 Use of a third file C, complete on X,Y,Z In a parametric macro approach: use the EM on the union of A, B, C In a parametric micro approach: use conditional mean matching In a nonparametric micro approach: use hot deck (first impute A with record from C, then impute live B values on A using the imputed records) Parametric and nonparametric methods can also be mixed (e.g. impute A records with the use of a conditional mean matching that makes use of an additional file C, then impute live B values)
56 Selected references Rässler S. (2002) Statistical Matching: a frequentist theory, practical applications and alternative Bayesian approaches, Springer Moriarity C., Scheuren F. (2001) Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure, Jour. of Official Statistics, 17, Moriarity C., Scheuren F. (2003) A Note on Rubin s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation, Jour. of Business and Economic Statistics, 21, Moriarity C., Scheuren F. (2004), Regression based statistical matching: recent developments, Proceedings of the Section on Survey Research Methods, American Statistical Association D Orazio M., Di Zio M., Scanu M. (2006) Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints, Jour. of Official Statistics, 22, 1 22 Singh A.C., Mantel M.D., Kinack M.D., Rowe G. (1993). Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption, Survey Methodology, vol. 19, N. 1, pp
This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics
This module is part of the Memobust Handboo on Methodology of Modern Business Statistics 26 March 2014 Method: Statistical Matching Methods Contents General section... 3 Summary... 3 2. General description
More informationStatistical Matching of Two Surveys with a Common Subset
Marco Ballin, Marcello D Orazio, Marco Di Zio, Mauro Scanu, Nicola Torelli Statistical Matching of Two Surveys with a Common Subset Working Paper n. 124 2009 1 Statistical Matching of Two Surveys with
More informationStatistical Matching using Fractional Imputation
Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:
More informationEstimation methods for the integration of administrative sources
Estimation methods for the integration of administrative sources Task 5b: Review of estimation methods identified in Task 3 a report containing technical summary sheet for each identified estimation/statistical
More informationCleanup and Statistical Analysis of Sets of National Files
Cleanup and Statistical Analysis of Sets of National Files William.e.winkler@census.gov FCSM Conference, November 6, 2013 Outline 1. Background on record linkage 2. Background on edit/imputation 3. Current
More informationStatistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland
Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed
More informationData corruption, correction and imputation methods.
Data corruption, correction and imputation methods. Yerevan 8.2 12.2 2016 Enrico Tucci Istat Outline Data collection methods Duplicated records Data corruption Data correction and imputation Data validation
More informationMissing Data Analysis for the Employee Dataset
Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients
More informationStatistical Methods for the Analysis of Repeated Measurements
Charles S. Davis Statistical Methods for the Analysis of Repeated Measurements With 20 Illustrations #j Springer Contents Preface List of Tables List of Figures v xv xxiii 1 Introduction 1 1.1 Repeated
More informationMissing Data Analysis for the Employee Dataset
Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1
More informationBayesian Estimation for Skew Normal Distributions Using Data Augmentation
The Korean Communications in Statistics Vol. 12 No. 2, 2005 pp. 323-333 Bayesian Estimation for Skew Normal Distributions Using Data Augmentation Hea-Jung Kim 1) Abstract In this paper, we develop a MCMC
More informationCOPULA MODELS FOR BIG DATA USING DATA SHUFFLING
COPULA MODELS FOR BIG DATA USING DATA SHUFFLING Krish Muralidhar, Rathindra Sarathy Department of Marketing & Supply Chain Management, Price College of Business, University of Oklahoma, Norman OK 73019
More informationMissing Data. Where did it go?
Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing
More informationCHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS
Examples: Missing Data Modeling And Bayesian Analysis CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Mplus provides estimation of models with missing data using both frequentist and Bayesian
More informationHierarchical Mixture Models for Nested Data Structures
Hierarchical Mixture Models for Nested Data Structures Jeroen K. Vermunt 1 and Jay Magidson 2 1 Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, Netherlands
More informationHandling missing data for indicators, Susanne Rässler 1
Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4
More informationSTATISTICS (STAT) Statistics (STAT) 1
Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).
More informationData analysis using Microsoft Excel
Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data
More informationHandling Data with Three Types of Missing Values:
Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling
More informationCHAPTER 1 INTRODUCTION
Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,
More informationA noninformative Bayesian approach to small area estimation
A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported
More informationin this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a
Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to
More informationComparative Evaluation of Synthetic Dataset Generation Methods
Comparative Evaluation of Synthetic Dataset Generation Methods Ashish Dandekar, Remmy A. M. Zen, Stéphane Bressan December 12, 2017 1 / 17 Open Data vs Data Privacy Open Data Helps crowdsourcing the research
More informationMissing Data: What Are You Missing?
Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION
More informationMissing Data Techniques
Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationNuts and Bolts Research Methods Symposium
Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Topics to Discuss: Types of Variables Constructing a Variable Code Book Developing Excel Spreadsheets
More informationOpening Windows into the Black Box
Opening Windows into the Black Box Yu-Sung Su, Andrew Gelman, Jennifer Hill and Masanao Yajima Columbia University, Columbia University, New York University and University of California at Los Angels July
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationSOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.
SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing
More informationNORM software review: handling missing values with multiple imputation methods 1
METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly
More informationMachine Learning: An Applied Econometric Approach Online Appendix
Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail
More informationMODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES
UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in
More informationVariance Estimation in Presence of Imputation: an Application to an Istat Survey Data
Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used
More informationMissing Data and Imputation
Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex
More informationarxiv: v1 [stat.me] 29 May 2015
MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis Vincent Audigier 1, François Husson 2 and Julie Josse 2 arxiv:1505.08116v1 [stat.me] 29 May 2015 Applied Mathematics
More informationAnalysis of Incomplete Multivariate Data
Analysis of Incomplete Multivariate Data J. L. Schafer Department of Statistics The Pennsylvania State University USA CHAPMAN & HALL/CRC A CR.C Press Company Boca Raton London New York Washington, D.C.
More informationCHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA
Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent
More informationPaper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by
Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS
More informationConditional Volatility Estimation by. Conditional Quantile Autoregression
International Journal of Mathematical Analysis Vol. 8, 2014, no. 41, 2033-2046 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ijma.2014.47210 Conditional Volatility Estimation by Conditional Quantile
More informationSmoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data
Smoothing Dissimilarities for Cluster Analysis: Binary Data and unctional Data David B. University of South Carolina Department of Statistics Joint work with Zhimin Chen University of South Carolina Current
More informationFHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim
CONTRIBUTED RESEARCH ARTICLE 140 FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim Abstract Fractional hot deck imputation (FHDI), proposed by Kalton and
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationDevelopment of Synthetic Microdata for Educational Use in Japan
2013 Joint IASE / IAOS Satellite Conference, Macau Tower, Macau, China, 22nd-24th August, 2013 Development of Synthetic Microdata for Educational Use in Japan Naoki Makita Shinsuke Ito* National Statistics
More informationSmall area estimation by model calibration and "hybrid" calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland
Small area estimation by model calibration and "hybrid" calibration Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland NTTS Conference, Brussels, 10-12 March 2015 Lehtonen R. and Veijanen
More informationMultiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health
Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options
More informationDynamic Thresholding for Image Analysis
Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British
More informationMultiple-imputation analysis using Stata s mi command
Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationStatistical Matching of Discrete Data by Bayesian Networks
JMLR: Workshop and Conference Proceedings vol 52, 159-170, 2016 PGM 2016 Statistical Matching of Discrete Data by Bayesian Networks Eva Endres Thomas Augustin Department of Statistics, Ludwig-Maximilians-Universität
More informationClustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford
Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically
More informationMissing data analysis. University College London, 2015
Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG
More informationStatistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400
Statistics (STAT) 1 Statistics (STAT) STAT 1200: Introductory Statistical Reasoning Statistical concepts for critically evaluation quantitative information. Descriptive statistics, probability, estimation,
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationThe Use of Sample Weights in Hot Deck Imputation
Journal of Official Statistics, Vol. 25, No. 1, 2009, pp. 21 36 The Use of Sample Weights in Hot Deck Imputation Rebecca R. Andridge 1 and Roderick J. Little 1 A common strategy for handling item nonresponse
More informationON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS
ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS NURNABI MEHERUL ALAM M.Sc. (Agricultural Statistics), Roll No. I.A.S.R.I, Library Avenue, New Delhi- Chairperson: Dr. P.K. Batra Abstract: Block designs
More informationStatistical Analysis of List Experiments
Statistical Analysis of List Experiments Kosuke Imai Princeton University Joint work with Graeme Blair October 29, 2010 Blair and Imai (Princeton) List Experiments NJIT (Mathematics) 1 / 26 Motivation
More informationDigital Image Classification Geography 4354 Remote Sensing
Digital Image Classification Geography 4354 Remote Sensing Lab 11 Dr. James Campbell December 10, 2001 Group #4 Mark Dougherty Paul Bartholomew Akisha Williams Dave Trible Seth McCoy Table of Contents:
More informationLecture 7: Linear Regression (continued)
Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions
More informationPerformance of Sequential Imputation Method in Multilevel Applications
Section on Survey Research Methods JSM 9 Performance of Sequential Imputation Method in Multilevel Applications Enxu Zhao, Recai M. Yucel New York State Department of Health, 8 N. Pearl St., Albany, NY
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More informationMULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER
MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER A.Shabbir 1, 2 and G.Verdoolaege 1, 3 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Max Planck Institute
More informationJMP Book Descriptions
JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest
More information( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components
Review Lecture 14 ! PRINCIPAL COMPONENT ANALYSIS Eigenvectors of the covariance matrix are the principal components 1. =cov X Top K principal components are the eigenvectors with K largest eigenvalues
More informationTHE 2002 U.S. CENSUS OF AGRICULTURE DATA PROCESSING SYSTEM
Abstract THE 2002 U.S. CENSUS OF AGRICULTURE DATA PROCESSING SYSTEM Kara Perritt and Chadd Crouse National Agricultural Statistics Service In 1997 responsibility for the census of agriculture was transferred
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationSection 4 Matching Estimator
Section 4 Matching Estimator Matching Estimators Key Idea: The matching method compares the outcomes of program participants with those of matched nonparticipants, where matches are chosen on the basis
More informationA Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis
Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach
More informationMissing Data Missing Data Methods in ML Multiple Imputation
Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:
More informationMassive Data Analysis
Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that
More informationWELCOME! Lecture 3 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important
More informationTime Series Analysis by State Space Methods
Time Series Analysis by State Space Methods Second Edition J. Durbin London School of Economics and Political Science and University College London S. J. Koopman Vrije Universiteit Amsterdam OXFORD UNIVERSITY
More informationPrograms for MDE Modeling and Conditional Distribution Calculation
Programs for MDE Modeling and Conditional Distribution Calculation Sahyun Hong and Clayton V. Deutsch Improved numerical reservoir models are constructed when all available diverse data sources are accounted
More informationDetecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference
Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400
More information- 1 - Fig. A5.1 Missing value analysis dialog box
WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation
More informationAn Introduction to the Bootstrap
An Introduction to the Bootstrap Bradley Efron Department of Statistics Stanford University and Robert J. Tibshirani Department of Preventative Medicine and Biostatistics and Department of Statistics,
More informationA Fast Clustering Algorithm with Application to Cosmology. Woncheol Jang
A Fast Clustering Algorithm with Application to Cosmology Woncheol Jang May 5, 2004 Abstract We present a fast clustering algorithm for density contour clusters (Hartigan, 1975) that is a modified version
More informationRecord Linkage for the American Opportunity Study: Formal Framework and Research Agenda
1 / 14 Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda Stephen E. Fienberg Department of Statistics, Heinz College, and Machine Learning Department, Carnegie Mellon
More informationAutomatic Selection of Compiler Options Using Non-parametric Inferential Statistics
Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationbook 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition
book 2014/5/6 15:21 page v #3 Contents List of figures List of tables Preface to the second edition Preface to the first edition xvii xix xxi xxiii 1 Data input and output 1 1.1 Input........................................
More informationCreating a data file and entering data
4 Creating a data file and entering data There are a number of stages in the process of setting up a data file and analysing the data. The flow chart shown on the next page outlines the main steps that
More informationA Bayesian analysis of survey design parameters for nonresponse, costs and survey outcome variable models
A Bayesian analysis of survey design parameters for nonresponse, costs and survey outcome variable models Eva de Jong, Nino Mushkudiani and Barry Schouten ASD workshop, November 6-8, 2017 Outline Bayesian
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationA Fast Multivariate Nearest Neighbour Imputation Algorithm
A Fast Multivariate Nearest Neighbour Imputation Algorithm Norman Solomon, Giles Oatley and Ken McGarry Abstract Imputation of missing data is important in many areas, such as reducing non-response bias
More informationChapter 1 Introduction. Chapter Contents
Chapter 1 Introduction Chapter Contents OVERVIEW OF SAS/STAT SOFTWARE................... 17 ABOUT THIS BOOK.............................. 17 Chapter Organization............................. 17 Typographical
More informationThe Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection
Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationCHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT
CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also
More informationModule 1 Lecture Notes 2. Optimization Problem and Model Formulation
Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization
More informationBCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method
BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method Introduction This program is developed in the lab of Hongyu Zhao
More informationMissing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA
Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA ABSTRACT Statistical analyses can be greatly hampered by missing
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationApplication of Characteristic Function Method in Target Detection
Application of Characteristic Function Method in Target Detection Mohammad H Marhaban and Josef Kittler Centre for Vision, Speech and Signal Processing University of Surrey Surrey, GU2 7XH, UK eep5mm@ee.surrey.ac.uk
More informationPackage midastouch. February 7, 2016
Type Package Version 1.3 Package midastouch February 7, 2016 Title Multiple Imputation by Distance Aided Donor Selection Date 2016-02-06 Maintainer Philipp Gaffert Depends R (>=
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationExcel 2010 with XLSTAT
Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with
More informationPSY 9556B (Feb 5) Latent Growth Modeling
PSY 9556B (Feb 5) Latent Growth Modeling Fixed and random word confusion Simplest LGM knowing how to calculate dfs How many time points needed? Power, sample size Nonlinear growth quadratic Nonlinear growth
More informationMultivariate Normal Random Numbers
Multivariate Normal Random Numbers Revised: 10/11/2017 Summary... 1 Data Input... 3 Analysis Options... 4 Analysis Summary... 5 Matrix Plot... 6 Save Results... 8 Calculations... 9 Summary This procedure
More information