This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Ths module s part of the Memobust Handbook on Methodology of Modern Busness Statstcs 26 March 2014

Theme: Donor Imputaton Contents General secton... 3 1. Summary... 3 2. General descrpton... 3 2.1 Introducton to donor mputaton... 3 2.2 Random and sequental hot deck mputaton... 4 2.3 Nearest-neghbour mputaton... 4 2.4 Predctve mean matchng... 6 2.5 Practcal ssues... 6 3. Desgn ssues... 7 4. Avalable software tools... 7 5. Decson tree of methods... 7 6. Glossary... 7 7. References... 7 Interconnectons wth other modules... 9 Admnstratve secton... 10

General secton 1. Summary The obectve n donor mputaton s to fll n the mssng values for a gven unt by copyng observed values of another unt, the donor. Typcally, the donor s chosen n such a way that t resembles the mputed unt as much as possble on one or more background characterstcs. The ratonale behnd ths s that f the two unts match (exactly or approxmately) on a number of relevant auxlary varables, t s lkely that ther scores on the target varable wll also be smlar. 2. General descrpton 1 2.1 Introducton to donor mputaton The obectve n donor mputaton s to fll n the mssng values for a gven unt (the recpent) by copyng the correspondng observed values of another unt (the donor). The term hot deck donor mputaton apples when the donor comes from the same data set as the recpent. In the context of busness statstcs, ths s the most commonly encountered form of donor mputaton. If the donor s taken from another data set, ths s known as cold deck donor mputaton. Most applcatons of cold deck mputaton use data that were collected at a prevous pont n tme. Often, the donor record s then smply an earler observaton of the recpent unt tself. Ths type of donor mputaton s only vald for varables that can be consdered more or less constant between observaton tmes; ts applcablty n the context of busness statstcs s therefore lmted. In the remander of ths module, we shall focus on hot deck mputaton. Lettng y denote the score of the we can wrte the generc formula for hot deck donor mputaton as: th unt on the target varable y and usng the ndex d for a donor, ~ y = y d. (1) Typcally, one searches for a donor that resembles the recpent as much as possble on one or more auxlary varables. There exst dfferent ways to select a donor, leadng to dfferent varants of hot deck mputaton. In ths module, we shall descrbe random and sequental hot deck mputaton (Secton 2.2), nearest-neghbour mputaton (Secton 2.3), and predctve mean matchng (Secton 2.4). Some practcal ssues are dscussed n Secton 2.5. In formula (1) and n the descrpton below, we focus on mputng one target varable at a tme. In practce, one often encounters records wth several mssng values. In that case, the standard approach s to mpute all mssng values n a record from the same donor. Ths helps to preserve the multvarate relatons between the mputed varables. In fact, an mportant practcal advantage of donor mputaton compared to model-based mputaton s that t can be extended to multvarate mputaton n ths natural way. 1 Ths secton s to a large extent based on Chapter 6 of Israëls et al. (2011). 3

2.2 Random and sequental hot deck mputaton In random hot deck mputaton, mputaton classes are formed based on categorcal auxlary varables. For each recpent unt n a gven mputaton class, the group of potental donors conssts of the unts wthn the same class wth y observed. Of these potental donors, one s selected at random typcally through equal-probablty samplng and used to mpute the recpent. Note that ths procedure mples that the donor and the recpent have exactly the same values on all auxlary varables that are used to defne the mputaton classes. Condtonal on these auxlary varables, the donor s selected completely at random. Sequental hot deck mputaton also requres that the donor and the recpent have dentcal values on the auxlary varables, but here the data set s not explctly splt nto groups. Instead, one goes over the records n the data set n order and mputes each mssng value by the last prevously encountered observed value for a unt wth the same scores on the auxlary varables. Thus, the recpent s mputed usng as a donor the last unt wth y observed that belongs to the same mputaton class and that comes before the recpent n the data fle. Hstorcally, the sequental hot deck method had the advantage that t can be carred out by a computer n a very effcent manner. The algorthm requres ust one pass over the data set (Kalton and Kasprzyk, 1986). Wth the rse of computng power, ths s no longer consdered a real advantage for most practcal applcatons. For the sequental hot deck method, the mputatons obvously depend on the order of the records n the data set. The method can be appled after a random sortng of the records; ths yelds stochastc mputatons and s sometmes called random sequental hot deck. Alternatvely, determnstc mputatons may be obtaned by sortng the records on one or more background characterstcs. Ether way, t s recommended to perform some form of explct sortng before applyng ths method, because otherwse the results may be based due to an mplct and unforeseen orderng of the unts n the fle. Typcally, the standard errors of means and totals of y wll be nflated by random (sequental) hot deck mputaton (Lttle and Rubn, 2002). In part, ths may be due to the rsk of outlers beng magnfed, whch can be avoded by excludng outlers from the group of potental donors. More generally, t s desrable to avod that the same unt can be used as a donor for many dfferent recpents. In random hot deck mputaton, ths can be acheved by usng a more elaborate selecton mechansm, so that a repeated use of the same donor s only allowed once all or most of the potental donors wthn an mputaton class have had a turn. In sequental hot deck mputaton, a repeated use of the same donor may occur whenever there are several tem non-respondents close together n the data fle. One way to prevent ths s to consder an extenson of sequental hot deck mputaton. Under ths extenson, one stores the last K observed values wthn an mputaton class (for some K > 1). Whenever an tem non-respondent s encountered, t s mputed by choosng at random one of the K potental donor values. 2.3 Nearest-neghbour mputaton In nearest-neghbour mputaton, we drop the restrcton that the donor and the recpent have dentcal scores on all auxlary varables. Instead, the auxlary varables are used to defne a dstance functon D (, between unts and k, where s the recpent and k s a potental donor. The nearest neghbour of unt s defned as the respondent d that mnmses ths dstance functon. Formally, 4

d = arg mn D(,, (2) k obs where obs denotes the set of unts wth y observed,.e., the set of potental donors. Before gong nto the mputaton method tself, we wll brefly dscuss possble choces of the dstance functon n formula (2). Assumng for now that the auxlary varables ( x, K, x ) are all quanttatve (but see Secton 2.5), a frequently used famly of dstance functons s gven by: 1 q q Dz (, = x = 1 x k 1/ z z (3) wth z > 0. For z = 2, formula (3) yelds the well-known Eucldean dstance. For z = 1, t s ust the sum of the absolute dfferences x x ; ths s sometmes called the cty-block or Manhattan k dstance. As z becomes larger, formula (3) places a hgher penalty on large dfferences for ndvdual auxlary varables. In fact, by lettng z tend to nfnty n (3), we obtan the so-called mnmax dstance gven by D (, = max = 1, K, q x x k. (4) Accordng to dstance (4), the nearest neghbour should not devate strongly from the recpent on any auxlary varable x. Practcal applcatons of nearest-neghbour mputaton that nvolve dstance functon (3) wth choces other than z = 1, z = 2, or z are rare. A generalsaton of (3) s obtaned by ncludng weght factors γ that express the mportance of each auxlary varable for the purpose of fndng accurate mputatons: 1/ z q z Dz, (, x x γ = γ k. (5) = 1 In addton, note that the contrbutons of the auxlary varables to (3) or (5) are mplctly weghted f these varables are measured on dfferent scales. For nstance, f x 1 represents last year s turnover n Euros and x 2 represents the number of employees, then the value of D1 (, = x1 x1k + x2 x2k wll depend almost exclusvely on the frst term n practce. To prevent ths, one should frst standardse the auxlary varables so that ther varances are equal to 1. Alternatvely, the so-called Mahalanobs dstance could be used whch also takes correlatons between varables nto account (see, e.g., Lttle and Rubn, 2002); ths can be seen as a generalsaton of the Eucldean dstance D (, ). In ts basc form, the nearest-neghbour method mputes an tem non-respondent by usng ts nearest neghbour as donor. Ths yelds a determnstc mputaton. As before, the underlyng dea s that two unts that are closely matched on relevant background characterstcs [.e., for whch D (, has a small value] are lkely to also have a smlar score on the target varable. A stochastc generalsaton of nearest-neghbour mputaton frst selects the K unts that are closest to unt n terms of D (,.e., the K nearest neghbours as potental donors and then draws one of these unts at random. In some applcatons, unequal drawng probabltes are assgned to the K nearest neghbours so that wthn ths group the unts wth smaller values of D (, are more lkely to 2 k 5

be selected as donor. Followng Banker et al. (2000), an approprate choce of drawng probablty for the th k potental donor s then gven by: t Dmn p k D k ( ), ( k = 1, K, K), (6) (, ) where D = mn D(, denotes the dstance of the nearest neghbour and t 0 s a parameter mn k obs determnng the selecton mechansm. Equal-probablty selecton s obtaned as a specal case of (6) wth t = 0. The method concdes wth ordnary determnstc nearest-neghbour mputaton n the lmt t. 2.4 Predctve mean matchng Lttle (1988) descrbed a varant of donor mputaton known as predctve mean matchng. In ths mputaton method, a lnear regresson s frst performed of the target varable y on some auxlary varables x, K, x. The regresson model s ftted on the data of unts wthout tem non-response. 1 q Next, the resultng regresson equaton s used to obtan predcted values ŷ for all records, n accordance wth formula (4) n the module Imputaton Model-Based Imputaton. For tem nonrespondent wth predcted value predcted value ŷ d s as close as possble to ŷ, we select as donor the tem respondent d for whch the ŷ. Fnally, the observed value y d of the donor s mputed, n accordance wth formula (1) above. The latter feature makes ths method a form of donor mputaton rather than model-based mputaton. It should be noted that predctve mean matchng s actually a specal case of nearest-neghbour mputaton. Ths s easly seen by consderng the dstance functon D pmm (, = yˆ yˆ k and choosng the donor accordng to formula (2). Alternatvely, ths dstance functon can be expressed as a weghted sum of dfferences between the auxlary varables used n the regresson (De Waal et al., 2011, p. 253). 2.5 Practcal ssues Random and sequental hot deck mputaton requre that the auxlary varables are categorcal, because these varables are used to construct mputaton classes. Quanttatve auxlary varables can be ncluded by frst dervng categorsed versons of them (e.g., a sze class varable based on the number of employees). Nearest-neghbour mputaton s used manly wth quanttatve auxlary varables. It s also possble to nclude categorcal auxlary varables, but ths requres an approprate extenson of the dstance functon. One way to do ths s to assgn, for each categorcal varable separately, a dstance to each possble par of values. For an auxlary varable can be summarsed n the form of an x wth m categores, ths local dstance functon m m matrx A. Next, we can defne a global dstance functon of the form (3) or (5), by replacng the absolute dfference x x by the value k 6

A x, x ) n these expressons. Smlarly, a combnaton of quanttatve and qualtatve auxlary ( k varables can also be handled n nearest-neghbour mputaton. An alternatve way to handle a combnaton of quanttatve and qualtatve auxlary varables s to combne the random and nearest-neghbour hot deck methods. That s, we frst use the categorcal varables to construct mputaton classes. Next, wthn each mputaton class, we apply the nearestneghbour method usng a dstance functon of quanttatve varables. In ths case, the donor has to match the recpent exactly on the categorcal varables but ther scores on the quanttatve varables may be dfferent. The approach n the prevous paragraph offers more flexblty. It s possble to take samplng weghts nto account n the selecton of the donor; see Kalton (1983) and Andrdge and Lttle (2009). As dscussed n Imputaton Man Module, there s no consensus of opnon on the necessty n general of ncorporatng samplng weghts nto mputaton procedures. However, t s often useful to ensure that recpents are mputed from donors wth smlarly-szed weghts. Effectvely, donor mputaton ncreases the weght of a donor by addng the weghts of ts recpents (Kalton, 1983). Therefore, f a donor wth a small weght s used to mpute a recpent wth a much larger weght, the nfluence of that donor on the survey estmates ncreases dsproportonally; as a result, the varances of these estmates wll be nflated. To prevent ths, the weghtng varable or the desgn varables that consttute the weghtng model may be ncluded as auxlary varables n the donor selecton. Andrdge and Lttle (2009) compared the performance of hot deck mputaton wth and wthout the ncluson of samplng weghts n a smulaton study. 3. Desgn ssues 4. Avalable software tools Several R packages are avalable that can perform hot deck donor mputaton, ncludng StatMatch and mce. The Banff system by Statstcs Canada performs nearest-neghbour mputaton for quanttatve data. CANCEIS, another tool by Statstcs Canada, offers more advanced nearestneghbour mputaton functonalty for quanttatve and qualtatve data. It should be noted that CANCEIS s manly amed at socal statstcs, n partcular the populaton census. 5. Decson tree of methods 6. Glossary For defntons of terms used n ths module, please refer to the separate Glossary provded as part of the handbook. 7. References Andrdge, R. R. and Lttle, R. J. (2009), The Use of Samplng Weghts n Hot Deck Imputaton. Journal of Offcal Statstcs 25, 21 36. 7

Banker, M., Lachance, M., and Porer, P. (2000), 2001 Canadan Census Mnmum Change Donor Imputaton Methodology. Workng Paper, UN/ECE Work Sesson on Statstcal Data Edtng, Cardff. De Waal, T., Pannekoek, J., and Scholtus, S. (2011), Handbook of Statstcal Data Edtng and Imputaton. John Wley & Sons, New Jersey. Israëls, A., Kuvenhoven, L., van der Laan, J., Pannekoek, J., and Schulte Nordholt, E. (2011), Imputaton. Methods Seres Theme, Statstcs Netherlands, The Hague. Kalton, G. (1983), Compensatng for Mssng Survey Data. Survey Research Center Insttute for Socal Research, The Unversty of Mchgan. Kalton, G. and Kasprzyk, D. (1986), The Treatment of Mssng Survey Data. Survey Methodology 12, 1 16. Lttle, R. J. A. (1988), Mssng-Data Adustments n Large Surveys. Journal of Busness & Economc Statstcs 6, 287 296. Lttle, R. J. A. and Rubn, D. B. (2002), Statstcal Analyss wth Mssng Data, second edton. John Wley & Sons, New York. 8

Interconnectons wth other modules 8. Related themes descrbed n other modules 1. Imputaton Man Module 2. Imputaton Model-Based Imputaton 9. Methods explctly referred to n ths module 1. 10. Mathematcal technques explctly referred to n ths module 1. 11. GSBPM phases explctly referred to n ths module 1. GSBPM Sub-process 5.4: Impute 12. Tools explctly referred to n ths module 1. Banff 2. CANCEIS 3. R 13. Process steps explctly referred to n ths module 1. Imputaton,.e., determnng and fllng n new values for occurrences of mssng or dscarded values n a data fle 9

Admnstratve secton 14. Module code Imputaton-T-Donor Imputaton 15. Verson hstory Verson Date Descrpton of changes Author Insttute 0.1 28-03-2013 frst verson Sander Scholtus CBS (Netherlands) 0.2 15-07-2013 mprovements based on Swedsh revew 0.3 07-10-2013 mprovements based on Norwegan revew 0.3.1 21-10-2013 prelmnary release 1.0 26-03-2014 fnal verson wthn the Memobust proect Sander Scholtus Sander Scholtus CBS (Netherlands) CBS (Netherlands) 16. Template verson and prnt date Template verson used 1.0 p 4 d.d. 22-11-2012 Prnt date 21-3-2014 18:16 10