This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Similar documents
X- Chart Using ANOM Approach

Machine Learning: Algorithms and Applications

Programming in Fortran 90 : 2017/2018

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Wishing you all a Total Quality New Year!

Adjustment methods for differential measurement errors in multimode surveys

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Using Auxiliary Data for Adjustment In Longitudinal Research. Dirk Sikkel Joop Hox Edith de Leeuw

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Support Vector Machines

Hermite Splines in Lie Groups as Products of Geodesics

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

An Optimal Algorithm for Prufer Codes *

S1 Note. Basis functions.

Review of approximation techniques

GSLM Operations Research II Fall 13/14

Problem Set 3 Solutions

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

y and the total sum of

A Binarization Algorithm specialized on Document Images and Photos

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Classifier Selection Based on Data Complexity Measures *

A fault tree analysis strategy using binary decision diagrams

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

The Codesign Challenge

Smoothing Spline ANOVA for variable screening

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Cluster Analysis of Electrical Behavior

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Support Vector Machines

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Lecture #15 Lecture Notes

Related-Mode Attacks on CTR Encryption Mode

CMPS 10 Introduction to Computer Science Lecture Notes

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Cell Count Method on a Network with SANET

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Array transposition in CUDA shared memory

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

TN348: Openlab Module - Colocalization

Variance estimation in EU-SILC survey

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Load-Balanced Anycast Routing

Meta-heuristics for Multidimensional Knapsack Problems

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Mathematics 256 a course in differential equations for engineering students

Understanding K-Means Non-hierarchical Clustering

An Entropy-Based Approach to Integrated Information Needs Assessment

Multilevel Analysis with Informative Weights

Anonymisation of Public Use Data Sets

Biostatistics 615/815

Bayesian inference for sample surveys

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Solving two-person zero-sum game by Matlab

CS 534: Computer Vision Model Fitting

3D vector computer graphics

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Feature Reduction and Selection

Analysis of Continuous Beams in General

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

A Semi-parametric Regression Model to Estimate Variability of NO 2

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Performance Evaluation of Information Retrieval Systems

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Outlier Detection based on Robust Parameter Estimates

Analysis of Non-coherent Fault Trees Using Ternary Decision Diagrams

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Virtual Machine Migration based on Trust Measurement of Computer Node

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Unsupervised Learning

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Simulation Based Analysis of FAST TCP using OMNET++

Parallel matrix-vector multiplication

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

A Similarity-Based Prognostics Approach for Remaining Useful Life Estimation of Engineered Systems

Transcription:

Ths module s part of the Memobust Handbook on Methodology of Modern Busness Statstcs 26 March 2014

Theme: Donor Imputaton Contents General secton... 3 1. Summary... 3 2. General descrpton... 3 2.1 Introducton to donor mputaton... 3 2.2 Random and sequental hot deck mputaton... 4 2.3 Nearest-neghbour mputaton... 4 2.4 Predctve mean matchng... 6 2.5 Practcal ssues... 6 3. Desgn ssues... 7 4. Avalable software tools... 7 5. Decson tree of methods... 7 6. Glossary... 7 7. References... 7 Interconnectons wth other modules... 9 Admnstratve secton... 10

General secton 1. Summary The obectve n donor mputaton s to fll n the mssng values for a gven unt by copyng observed values of another unt, the donor. Typcally, the donor s chosen n such a way that t resembles the mputed unt as much as possble on one or more background characterstcs. The ratonale behnd ths s that f the two unts match (exactly or approxmately) on a number of relevant auxlary varables, t s lkely that ther scores on the target varable wll also be smlar. 2. General descrpton 1 2.1 Introducton to donor mputaton The obectve n donor mputaton s to fll n the mssng values for a gven unt (the recpent) by copyng the correspondng observed values of another unt (the donor). The term hot deck donor mputaton apples when the donor comes from the same data set as the recpent. In the context of busness statstcs, ths s the most commonly encountered form of donor mputaton. If the donor s taken from another data set, ths s known as cold deck donor mputaton. Most applcatons of cold deck mputaton use data that were collected at a prevous pont n tme. Often, the donor record s then smply an earler observaton of the recpent unt tself. Ths type of donor mputaton s only vald for varables that can be consdered more or less constant between observaton tmes; ts applcablty n the context of busness statstcs s therefore lmted. In the remander of ths module, we shall focus on hot deck mputaton. Lettng y denote the score of the we can wrte the generc formula for hot deck donor mputaton as: th unt on the target varable y and usng the ndex d for a donor, ~ y = y d. (1) Typcally, one searches for a donor that resembles the recpent as much as possble on one or more auxlary varables. There exst dfferent ways to select a donor, leadng to dfferent varants of hot deck mputaton. In ths module, we shall descrbe random and sequental hot deck mputaton (Secton 2.2), nearest-neghbour mputaton (Secton 2.3), and predctve mean matchng (Secton 2.4). Some practcal ssues are dscussed n Secton 2.5. In formula (1) and n the descrpton below, we focus on mputng one target varable at a tme. In practce, one often encounters records wth several mssng values. In that case, the standard approach s to mpute all mssng values n a record from the same donor. Ths helps to preserve the multvarate relatons between the mputed varables. In fact, an mportant practcal advantage of donor mputaton compared to model-based mputaton s that t can be extended to multvarate mputaton n ths natural way. 1 Ths secton s to a large extent based on Chapter 6 of Israëls et al. (2011). 3

2.2 Random and sequental hot deck mputaton In random hot deck mputaton, mputaton classes are formed based on categorcal auxlary varables. For each recpent unt n a gven mputaton class, the group of potental donors conssts of the unts wthn the same class wth y observed. Of these potental donors, one s selected at random typcally through equal-probablty samplng and used to mpute the recpent. Note that ths procedure mples that the donor and the recpent have exactly the same values on all auxlary varables that are used to defne the mputaton classes. Condtonal on these auxlary varables, the donor s selected completely at random. Sequental hot deck mputaton also requres that the donor and the recpent have dentcal values on the auxlary varables, but here the data set s not explctly splt nto groups. Instead, one goes over the records n the data set n order and mputes each mssng value by the last prevously encountered observed value for a unt wth the same scores on the auxlary varables. Thus, the recpent s mputed usng as a donor the last unt wth y observed that belongs to the same mputaton class and that comes before the recpent n the data fle. Hstorcally, the sequental hot deck method had the advantage that t can be carred out by a computer n a very effcent manner. The algorthm requres ust one pass over the data set (Kalton and Kasprzyk, 1986). Wth the rse of computng power, ths s no longer consdered a real advantage for most practcal applcatons. For the sequental hot deck method, the mputatons obvously depend on the order of the records n the data set. The method can be appled after a random sortng of the records; ths yelds stochastc mputatons and s sometmes called random sequental hot deck. Alternatvely, determnstc mputatons may be obtaned by sortng the records on one or more background characterstcs. Ether way, t s recommended to perform some form of explct sortng before applyng ths method, because otherwse the results may be based due to an mplct and unforeseen orderng of the unts n the fle. Typcally, the standard errors of means and totals of y wll be nflated by random (sequental) hot deck mputaton (Lttle and Rubn, 2002). In part, ths may be due to the rsk of outlers beng magnfed, whch can be avoded by excludng outlers from the group of potental donors. More generally, t s desrable to avod that the same unt can be used as a donor for many dfferent recpents. In random hot deck mputaton, ths can be acheved by usng a more elaborate selecton mechansm, so that a repeated use of the same donor s only allowed once all or most of the potental donors wthn an mputaton class have had a turn. In sequental hot deck mputaton, a repeated use of the same donor may occur whenever there are several tem non-respondents close together n the data fle. One way to prevent ths s to consder an extenson of sequental hot deck mputaton. Under ths extenson, one stores the last K observed values wthn an mputaton class (for some K > 1). Whenever an tem non-respondent s encountered, t s mputed by choosng at random one of the K potental donor values. 2.3 Nearest-neghbour mputaton In nearest-neghbour mputaton, we drop the restrcton that the donor and the recpent have dentcal scores on all auxlary varables. Instead, the auxlary varables are used to defne a dstance functon D (, between unts and k, where s the recpent and k s a potental donor. The nearest neghbour of unt s defned as the respondent d that mnmses ths dstance functon. Formally, 4

d = arg mn D(,, (2) k obs where obs denotes the set of unts wth y observed,.e., the set of potental donors. Before gong nto the mputaton method tself, we wll brefly dscuss possble choces of the dstance functon n formula (2). Assumng for now that the auxlary varables ( x, K, x ) are all quanttatve (but see Secton 2.5), a frequently used famly of dstance functons s gven by: 1 q q Dz (, = x = 1 x k 1/ z z (3) wth z > 0. For z = 2, formula (3) yelds the well-known Eucldean dstance. For z = 1, t s ust the sum of the absolute dfferences x x ; ths s sometmes called the cty-block or Manhattan k dstance. As z becomes larger, formula (3) places a hgher penalty on large dfferences for ndvdual auxlary varables. In fact, by lettng z tend to nfnty n (3), we obtan the so-called mnmax dstance gven by D (, = max = 1, K, q x x k. (4) Accordng to dstance (4), the nearest neghbour should not devate strongly from the recpent on any auxlary varable x. Practcal applcatons of nearest-neghbour mputaton that nvolve dstance functon (3) wth choces other than z = 1, z = 2, or z are rare. A generalsaton of (3) s obtaned by ncludng weght factors γ that express the mportance of each auxlary varable for the purpose of fndng accurate mputatons: 1/ z q z Dz, (, x x γ = γ k. (5) = 1 In addton, note that the contrbutons of the auxlary varables to (3) or (5) are mplctly weghted f these varables are measured on dfferent scales. For nstance, f x 1 represents last year s turnover n Euros and x 2 represents the number of employees, then the value of D1 (, = x1 x1k + x2 x2k wll depend almost exclusvely on the frst term n practce. To prevent ths, one should frst standardse the auxlary varables so that ther varances are equal to 1. Alternatvely, the so-called Mahalanobs dstance could be used whch also takes correlatons between varables nto account (see, e.g., Lttle and Rubn, 2002); ths can be seen as a generalsaton of the Eucldean dstance D (, ). In ts basc form, the nearest-neghbour method mputes an tem non-respondent by usng ts nearest neghbour as donor. Ths yelds a determnstc mputaton. As before, the underlyng dea s that two unts that are closely matched on relevant background characterstcs [.e., for whch D (, has a small value] are lkely to also have a smlar score on the target varable. A stochastc generalsaton of nearest-neghbour mputaton frst selects the K unts that are closest to unt n terms of D (,.e., the K nearest neghbours as potental donors and then draws one of these unts at random. In some applcatons, unequal drawng probabltes are assgned to the K nearest neghbours so that wthn ths group the unts wth smaller values of D (, are more lkely to 2 k 5

be selected as donor. Followng Banker et al. (2000), an approprate choce of drawng probablty for the th k potental donor s then gven by: t Dmn p k D k ( ), ( k = 1, K, K), (6) (, ) where D = mn D(, denotes the dstance of the nearest neghbour and t 0 s a parameter mn k obs determnng the selecton mechansm. Equal-probablty selecton s obtaned as a specal case of (6) wth t = 0. The method concdes wth ordnary determnstc nearest-neghbour mputaton n the lmt t. 2.4 Predctve mean matchng Lttle (1988) descrbed a varant of donor mputaton known as predctve mean matchng. In ths mputaton method, a lnear regresson s frst performed of the target varable y on some auxlary varables x, K, x. The regresson model s ftted on the data of unts wthout tem non-response. 1 q Next, the resultng regresson equaton s used to obtan predcted values ŷ for all records, n accordance wth formula (4) n the module Imputaton Model-Based Imputaton. For tem nonrespondent wth predcted value predcted value ŷ d s as close as possble to ŷ, we select as donor the tem respondent d for whch the ŷ. Fnally, the observed value y d of the donor s mputed, n accordance wth formula (1) above. The latter feature makes ths method a form of donor mputaton rather than model-based mputaton. It should be noted that predctve mean matchng s actually a specal case of nearest-neghbour mputaton. Ths s easly seen by consderng the dstance functon D pmm (, = yˆ yˆ k and choosng the donor accordng to formula (2). Alternatvely, ths dstance functon can be expressed as a weghted sum of dfferences between the auxlary varables used n the regresson (De Waal et al., 2011, p. 253). 2.5 Practcal ssues Random and sequental hot deck mputaton requre that the auxlary varables are categorcal, because these varables are used to construct mputaton classes. Quanttatve auxlary varables can be ncluded by frst dervng categorsed versons of them (e.g., a sze class varable based on the number of employees). Nearest-neghbour mputaton s used manly wth quanttatve auxlary varables. It s also possble to nclude categorcal auxlary varables, but ths requres an approprate extenson of the dstance functon. One way to do ths s to assgn, for each categorcal varable separately, a dstance to each possble par of values. For an auxlary varable can be summarsed n the form of an x wth m categores, ths local dstance functon m m matrx A. Next, we can defne a global dstance functon of the form (3) or (5), by replacng the absolute dfference x x by the value k 6

A x, x ) n these expressons. Smlarly, a combnaton of quanttatve and qualtatve auxlary ( k varables can also be handled n nearest-neghbour mputaton. An alternatve way to handle a combnaton of quanttatve and qualtatve auxlary varables s to combne the random and nearest-neghbour hot deck methods. That s, we frst use the categorcal varables to construct mputaton classes. Next, wthn each mputaton class, we apply the nearestneghbour method usng a dstance functon of quanttatve varables. In ths case, the donor has to match the recpent exactly on the categorcal varables but ther scores on the quanttatve varables may be dfferent. The approach n the prevous paragraph offers more flexblty. It s possble to take samplng weghts nto account n the selecton of the donor; see Kalton (1983) and Andrdge and Lttle (2009). As dscussed n Imputaton Man Module, there s no consensus of opnon on the necessty n general of ncorporatng samplng weghts nto mputaton procedures. However, t s often useful to ensure that recpents are mputed from donors wth smlarly-szed weghts. Effectvely, donor mputaton ncreases the weght of a donor by addng the weghts of ts recpents (Kalton, 1983). Therefore, f a donor wth a small weght s used to mpute a recpent wth a much larger weght, the nfluence of that donor on the survey estmates ncreases dsproportonally; as a result, the varances of these estmates wll be nflated. To prevent ths, the weghtng varable or the desgn varables that consttute the weghtng model may be ncluded as auxlary varables n the donor selecton. Andrdge and Lttle (2009) compared the performance of hot deck mputaton wth and wthout the ncluson of samplng weghts n a smulaton study. 3. Desgn ssues 4. Avalable software tools Several R packages are avalable that can perform hot deck donor mputaton, ncludng StatMatch and mce. The Banff system by Statstcs Canada performs nearest-neghbour mputaton for quanttatve data. CANCEIS, another tool by Statstcs Canada, offers more advanced nearestneghbour mputaton functonalty for quanttatve and qualtatve data. It should be noted that CANCEIS s manly amed at socal statstcs, n partcular the populaton census. 5. Decson tree of methods 6. Glossary For defntons of terms used n ths module, please refer to the separate Glossary provded as part of the handbook. 7. References Andrdge, R. R. and Lttle, R. J. (2009), The Use of Samplng Weghts n Hot Deck Imputaton. Journal of Offcal Statstcs 25, 21 36. 7

Banker, M., Lachance, M., and Porer, P. (2000), 2001 Canadan Census Mnmum Change Donor Imputaton Methodology. Workng Paper, UN/ECE Work Sesson on Statstcal Data Edtng, Cardff. De Waal, T., Pannekoek, J., and Scholtus, S. (2011), Handbook of Statstcal Data Edtng and Imputaton. John Wley & Sons, New Jersey. Israëls, A., Kuvenhoven, L., van der Laan, J., Pannekoek, J., and Schulte Nordholt, E. (2011), Imputaton. Methods Seres Theme, Statstcs Netherlands, The Hague. Kalton, G. (1983), Compensatng for Mssng Survey Data. Survey Research Center Insttute for Socal Research, The Unversty of Mchgan. Kalton, G. and Kasprzyk, D. (1986), The Treatment of Mssng Survey Data. Survey Methodology 12, 1 16. Lttle, R. J. A. (1988), Mssng-Data Adustments n Large Surveys. Journal of Busness & Economc Statstcs 6, 287 296. Lttle, R. J. A. and Rubn, D. B. (2002), Statstcal Analyss wth Mssng Data, second edton. John Wley & Sons, New York. 8

Interconnectons wth other modules 8. Related themes descrbed n other modules 1. Imputaton Man Module 2. Imputaton Model-Based Imputaton 9. Methods explctly referred to n ths module 1. 10. Mathematcal technques explctly referred to n ths module 1. 11. GSBPM phases explctly referred to n ths module 1. GSBPM Sub-process 5.4: Impute 12. Tools explctly referred to n ths module 1. Banff 2. CANCEIS 3. R 13. Process steps explctly referred to n ths module 1. Imputaton,.e., determnng and fllng n new values for occurrences of mssng or dscarded values n a data fle 9

Admnstratve secton 14. Module code Imputaton-T-Donor Imputaton 15. Verson hstory Verson Date Descrpton of changes Author Insttute 0.1 28-03-2013 frst verson Sander Scholtus CBS (Netherlands) 0.2 15-07-2013 mprovements based on Swedsh revew 0.3 07-10-2013 mprovements based on Norwegan revew 0.3.1 21-10-2013 prelmnary release 1.0 26-03-2014 fnal verson wthn the Memobust proect Sander Scholtus Sander Scholtus CBS (Netherlands) CBS (Netherlands) 16. Template verson and prnt date Template verson used 1.0 p 4 d.d. 22-11-2012 Prnt date 21-3-2014 18:16 10