Article. A nonparametric method to generate synthetic populations to adjust for complex sampling design features

Size: px

Start display at page:

Download "Article. A nonparametric method to generate synthetic populations to adjust for complex sampling design features"

Martina Lawrence
6 years ago
Views:

1 Component of Statstcs Canada Catalogue no X Busness Survey Metods Dvson Artcle A nonparametrc metod to generate syntetc populatons to adust for complex samplng desgn features by Q Dong, Mcael R. Ellott and Trvellore E. Ragunatan June 204

2 How to obtan more nformaton For nformaton about ts product or te wde range of servces and data avalable from Statstcs Canada, vst our webste, You can also contact us by emal at telepone, from Monday to Frday, 8:30 a.m. to 4:30 p.m., at te followng toll-free numbers: Statstcal Informaton Servce Natonal telecommuncatons devce for te earng mpared Fax lne Depostory Servces Program Inqures lne Fax lne To access ts product Ts product, Catalogue no X, s avalable free n electronc format. To obtan a sngle ssue, vst our webste, and browse by ey resource > Publcatons. Standards of servce to te publc Statstcs Canada s commtted to servng ts clents n a prompt, relable and courteous manner. To ts end, Statstcs Canada as developed standards of servce tat ts employees observe. To obtan a copy of tese servce standards, please contact Statstcs Canada toll-free at Te servce standards are also publsed on under About us > Te agency > Provdng servces to Canadans. Publsed by autorty of te Mnster responsble for Statstcs Canada Mnster of Industry, 204. All rgts reserved. Use of ts publcaton s governed by te Statstcs Canada Open Lcence Agreement (ttp://www. statcan.gc.ca/reference/lcence-eng.tml). Cette publcaton est auss dsponble en franças. Note of apprecaton Canada owes te success of ts statstcal system to a long standng partnersp between Statstcs Canada, te ctzens of Canada, ts busnesses, governments and oter nsttutons. Accurate and tmely statstcal nformaton could not be produced wtout ter contnued co operaton and goodwll. Standard symbols Te followng symbols are used n Statstcs Canada publcatons:. not avalable for any reference perod.. not avalable for a specfc reference perod... not applcable 0 true zero or a value rounded to zero 0 s value rounded to 0 (zero) were tere s a meanngful dstncton between true zero and te value tat was rounded p prelmnary r revsed x suppressed to meet te confdentalty requrements of te Statstcs Act E use wt cauton F too unrelable to be publsed sgnfcantly dfferent from reference category (p < 0.05)

3 Survey Metodology, June Vol. 40, No., pp Statstcs Canada, Catalogue No X A nonparametrc metod to generate syntetc populatons to adust for complex samplng desgn features Q Dong, Mcael R. Ellott and Trvellore E. Ragunatan Abstract Outsde of te survey samplng lterature, samples are often assumed to be generated by a smple random samplng process tat produces ndependent and dentcally dstrbuted (IID) samples. Many statstcal metods are developed largely n ts IID world. Applcaton of tese metods to data from complex sample surveys wtout makng allowance for te survey desgn features can lead to erroneous nferences. Hence, muc tme and effort ave been devoted to develop te statstcal metods to analyze complex survey data and account for te sample desgn. Ts ssue s partcularly mportant wen generatng syntetc populatons usng fnte populaton Bayesan nference, as s often done n mssng data or dsclosure rsk settngs, or wen combnng data from multple surveys. By extendng prevous work n fnte populaton Bayesan bootstrap lterature, we propose a metod to generate syntetc populatons from a posteror predctve dstrbuton n a fason nverts te complex samplng desgn features and generates smple random samples from a superpopulaton pont of vew, makng adustment on te complex data so tat tey can be analyzed as smple random samples. We consder a smulaton study wt a stratfed, clustered unequal-probablty of selecton sample desgn, and use te proposed nonparametrc metod to generate syntetc populatons for te 2006 Natonal Healt Intervew Survey (NHIS), and te Medcal Expendture Panel Survey (MEPS), wc are stratfed, clustered unequalprobablty of selecton sample desgns. ey Words: Syntetc populatons; Posteror predctve dstrbuton; Bayesan bootstrap; Inverse samplng. Introducton Statstcal metods outsde te survey metodology settng ave usually been developed wtout careful consderaton for sample desgn, often mplctly assumng smple random samples, or, occasonally, one-stage cluster samples. Maor efforts of modern survey statstcs focus on extendng metods to analyze complex survey data (Sknner, Holt and Smt 989), accommodatng ssues suc as stratfcaton, unequal probablty of selecton, nonresponse bas or calbraton. Hnkns, O and Sceuren (997) proposed an nverse samplng desgn algortm tat connects te survey statstcs and te classcal statstcs from anoter perspectve. Ter basc dea s to coose a subsample tat as a smple random sample structure uncondtonally. Te subsample s often muc smaller tan te orgnal sample, so tey propose to repeat te process ndependently many tmes and average te results to ncrease te precson. Tey also descrbed exact or approxmate nverse samplng scemes for stratfed smple random samplng, one-stage cluster samplng, and two-stage cluster samplng. However, ts new dea s not used wdely n practce, peraps because t s extremely computonally ntensve and te precson losses are often substantal. Smlarly, generatng syntetc populatons from a posteror predctve dstrbuton of a populaton condtonal on complex sample data n a fason tat accounts for te complex sample desgn s not stragtforward (Lttle 99). However, n recent years demand for syntetc populatons as ncreased, n order to deal wt wegt trmmng or wndorzaton problems (Lazzeron and Lttle 998; Ellott and Lttle 2000; Ellott 2007; Cen, Ellott and Lttle 200), dsclosure rsk settngs (Lttle 993;. Q Dong, Netflx, Inc. 00 Wncester Cr, Los Gatos, CA E-mal: qdong@umc.edu; Mcael R. Ellott, Department of Bostatstcs, Unversty of Mcgan, 420 Wasngton Hegts, Ann Arbor, MI 4809, Survey Metodology Program, Insttute for Socal Researc, Unversty of Mcgan, 426 Tompson St., Ann Arbor, MI E-mal: mrellot@umc.edu; Trvellore E. Ragunatan, Department of Bostatstcs, Unversty of Mcgan, 420 Wasngton Hegts, Ann Arbor, MI 4809, Survey Metodology Program, Insttute for Socal Researc, Unversty of Mcgan, 426 Tompson St., Ann Arbor, MI E-mal: teragu@umc.edu.

4 30 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features Ragunatan, Reter and Rubn 2003; Reter 2004, 2005), or combnng data from multple surveys (Ragunatan, Xe, Scenker, Parsons, Davs, Dodd and Feuer 2007; Dong 202). Often te syntetc populatons are generated under a dstrbutonal assumpton (normal, bnomal, Posson), wt te posteror dstrbuton of te model parameters approxmated by te asymptotc normal dstrbuton. Te mean and covarance matrx of te normal dstrbuton are estmated after complex samplng desgn features are taken nto account (Ragunatan et al. 2007). A maor weakness of model-based metods s tat f te model s serously msspecfed, t may yeld nvald nferences (Lttle 2004). In multvarate settngs, we need to consder te relatonsps among te varables of nterest and determne an approprate model tat fts te data, wc may be ard f te data contans dfferent types of varables. In ts paper we propose a nonparametrc metod as a counterpart of te model-based metod to generate syntetc populatons. Ts work extends te fnte populaton Bayesan bootstrap and related Pólya posteror models of Lo (988), Gos and Meeden (983), and Coen (997) to account for complex sample desgns. Snce t aceves te same goal of te nverse samplng tecnque, t can be treated as te Bayesan fnte populaton verson of nverse samplng. To make nference usng ts wegted fnte populaton Bayesan bootstrap, we can eter make use of te draws drectly, or, for computatonal effcency, use results prevously derved n te dsclosure rsk and multple mputaton lterature, snce tese non-parametrcally-generated populatons can be vewed as multple mputatons of te unobserved elements of te populaton. Ts paper s organzed as follows. Secton 2 brefly dscusses syntetc populatons n te context of Bayesan fnte populaton nference. Secton 3 revews and summarzes te Bayesan bootstrap metod and ts fnte populaton extenson, and sows tat, for an unequal probablty of selecton sample, te dstrbuton of syntetc populatons generated under a varant of a Pólya urn sceme matces te posteror predctve dstrbuton of a fnte populaton Bayesan bootstrap. Secton 4 presents te proposed metod under stratfed clusterng samplng wt unequal selecton probabltes. Secton 5 sows tat nference from tese non-parametrcally-generated syntetc populatons can be obtaned usng results from te dsclosure rsk and multple mputaton lterature, were eac syntetc populaton as zero wtn-mputaton varance. Secton 6 provdes a smulaton study to evaluate te performance of te nonparametrc metod n a repeated samplng context. Secton 7 apples te metod to generate syntetc populatons tan can be used to estmate ealt nsurance coverage rates usng te 2006 NHIS and MEPS data, and compares te result wt a parametrc (log-lnear) modelng approac. Concludng remarks are provded n Secton 8. 2 Generatng syntetc populatons from survey data Te basc concept of Bayesan fnte populaton nference nvolves mputng te non-sampled values of te populaton from te posteror predctve dstrbuton based on te observed data. Assume te populaton values are Y Y,, YN and te observed data, Yobs y,, yn s obtaned n a survey wt samplng ndcators I I,, I. N Te Bayesan populaton nference allows for te use of parametrc model Pr Y for populaton data based on te posteror predctve dstrbuton for te unobserved elements of te populaton Pr Y Y : nob obs Pr Y Y Pr Y Y, Pr Y d nob obs nob obs obs Statstcs Canada, Catalogue No X

5 Survey Metodology, June (Ercson 969; Lttle 993; Rubn 987; Scott 977; Sknner et al. 989). Here we use te model Pr Y to approxmate te entre populaton dstrbuton Pr Y and average over te posteror dstrbuton based on te sampled data Pr θ Y. In te case tat tere are desgn varables known for obs te entre populaton avalable, te above model can be naturally extended by condtonng on tese varables. Implct n te dervaton of above s tat te samplng ndcator I need not be modeled. Ts requres gnorable samplng (Rubn 987) (te dstrbuton of I does not depend on unobserved data), as well as a model for te data Pr Y tat s attentve to desgn features and robust enoug to suffcently capture all relevant aspects of te dstrbuton of Y of nterest. Our goal ere s to develop a metod to generate draws from Pr Ynob Y obs tat account for all te desgn features n Y obs so tat draws from te posteror dstrbuton of Ynob Y obs can be treated as a smple random sample n analyss. 3 Wegted fnte populaton Bayesan bootstrap 3. Fnte Populaton Bayesan Bootstrap (FPBB) Assume tat te (scalar) populaton elements Y,,, N are excangeable and can take on N possble values b,, b ; tus Y ~ MULTI ;,,. Furter assumng a conugate Drclet pror for ~ DIR,, yelds (Gos and Meeden 983) nob nob obs obs,,,, P Y y P b N n b N n b n b n nob 0 0 p Y y, p y p d d 0 0 nob p y p d d p Y p y p d d nob p y p d d N n n 0 d d 0 n d d 0 0 N 0 0 N n n 0 (3.) were 0, N, N and n,, n refers to te number of dstnct values we observe from our sample y y,, y, n n n. If 0 ten p Ynob y reduces to N n N n. Statstcs Canada, Catalogue No X

6 32 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features To ease mplementaton, Lo (988) proposed makng draws from te FPBB posteror predctve dstrbuton usng a Pólya urn sceme procedure. Suppose an urn contans n balls, eac of wc ave a dstnct real number label b,,,. A Pólya sample of sze m s selected by frst selectng a ball at random from te urn and returnng te selected ball nto te urn, ten puttng one same ball nto te urn and repeatng ts process untl m balls ave been selected. It can be sown tat te probablty of gettng m balls of type b s gven by k,, n m n p b m b m (3.2) n m n were n s te number of balls of type b orgnally n te urn. Te dstrbuton of te counts of type b s nvarant under any permutaton of te draws. Note tat ts corresponds drectly to te posteror probablty of a total of m,, m elements of type b,, b n a populaton, gven tat n,, n elements were observed n a (smple random) sample of sze n. n Hence a FPBB replcate sample can be drawn from ts Pólya posteror usng te followng steps: Step. Draw a Pólya sample of sze m N n, y,, yn n from te urn,, ; obs y y n by (3.2), wt mk N k nk draws of value b k for k,,, ts corresponds to a draw of P Ynob y from (3.). denoted by Step 2. Form te FPBB populaton y,, yn, y,, y N n. 3.2 FPBB wt unequal probabltes of selecton Coen (997) extended te FPBB procedure to adust for te unequal probabltes of selecton. Assume y,, y n s a sample from a fnte populaton Y,, Y N wt desgn wegts w,, w, n were w PI and I s te samplng ndcator. Te procedure as two steps: Step. Draw a sample of sze N n, suc a way tat y s selected wt probablty denoted by y,,, yn n by drawng y from y,, y n n k w l, k N n n, N n k N n n were w s te wegt of unt and lk, s te number of bootstrap selectons of y among y y (Te functon wtpolyap n te R package polypost can be used to obtan draws from a,,. k wegted Pólya urn.) Statstcs Canada, Catalogue No X

7 Survey Metodology, June Step 2. Form te FPBB populaton y,, yn, y,, y N n. Altoug Coen (997) dd not provde teoretcal proof for ts procedure, t can be obtaned as a stragtforward extenson of te standard FPBB and Pólya urn equvalency descrbed n Secton 3.. Frst, we determne te posteror dstrbuton of te FPBB sample wt unequal probabltes of selecton mpled by te wegted FPBB procedure. Te multnomal lkelood based on our wegted sample s gven by p y obs w, were n n w I y b w N n s te sum of te desgn wegts mnus one across all sampled elements wt value b,,,, normalzed to sum to n. (Note tat ts removes subects sampled wt wegts equal to one certanty sample elements from te lkelood, as tey ave no cance to be part of te unobserved porton of te populaton, and tus contrbute no nformaton about tese unobserved elements.) Assumng an mproper k Drclet pror p, te wegted fnte populaton Bayesan bootstrap posteror s gven by nob nob,,,,, P Y y w P b r b r w w nob p Y p y p d d nob p y p d d r w 0 d d 0 w d d 0 0 w r N w n (3.3) snce n r N n and n w n. Next, we sow te dstrbuton of samples obtaned from te unequal probablty of selecton Pólya Urn sceme of Coen (997) s equal to te posteror dstrbuton of te FPBB sample wt unequal probabltes of selecton. Gven te observed data, te probablty tat we draw N n balls and tat te frst r balls ave value b troug te last r k balls ave value b k s: Statstcs Canada, Catalogue No X

8 34 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features w w w r w w r P b r,, b r k k n n n r n r n r w r N w n were te frst equalty follows from te fact te dstrbuton of te counts of type b s nvarant under any permutaton of te draws, as n te unwegted settng, and te second equalty from te dentty x x x for x 0. Tus, notng tat w l, k N n n w l, k, N n k N n n n k a draw from te unequal probablty of selecton Pólya Urn sceme yelds a draw from P Y y w n (3.3). nob, 4 Nonparametrc metod to generate syntetc populatons In ts secton, we extend te fnte populaton Bayesan bootstrap metods to a stratfed, clustered, unequal probablty sample desgn settng to develop a nonparametrc metod to generate syntetc populatons tat adusts for te complex samplng desgn features. Te dea s to treat te unobserved part of te populaton as mssng data and mpute t by makng draws from te actual data. We do te mputaton n suc a fason tat te resultng draws from te posteror dstrbuton of te populaton wll capture te complex desgn features and can be used n a standard fason to compute posteror dstrbutons of te populaton quanttes of nterest. 4. Use te Bayesan bootstrap to adust for stratfcaton and clusterng For a stratfed clusterng samplng, we frst need to resample clusters wtn te strata. Denote c as te total number of clusters n te actual data, populaton, H H c c, and C as te number of clusters n te C C. One approac s to frst apply FPBB Pólya urn sceme to mpute te unobserved clusters wtn eac stratum,, C c c,, c wc togeter wt te observed clusters provde te clusters n stratum n te populaton. However, we typcally do not know te number of clusters n a stratum from avalable publc use data. Tus we suggest as an alternatve to FPBB sample drawng a standard Bayesan bootstrap sample of te clusters wtn eac stratum. Consderng te equvalence between te classcal bootstrap and Bayesan bootstrap, we follow Rao and Wu (988), wo suggested drawng a smple random sample wt replacement (SRSWR) of m from te c clusters and wtn eac stratum calculatng replcate wegts for computaton for eac bootstrap sample as l l k w w,,, H,,, c, k,, N, Statstcs Canada, Catalogue No X

9 Survey Metodology, June were m m c wk wk m c c m and m denotes te number of tmes tat cluster,,, c s selected. To ensure all te replcate wegts are non-negatve, m c ; ere and below we take m c. Note tat, wen clusterng s not present, we smply draw a standard Bayesan bootstrap sample from te sampled data wtn eac stratum (wen stratfcaton s present) or from te entre sample (f stratfcaton not present, so tat H ) and calculate te replcate wegts as w w m. k k Ts procedure s repeated L tmes to produce L Bayesan bootstrap (BB) samples denoted by S,,. S L Ts step generates L Bayesan bootstrap samples wc essentally are L draws from te posteror predctve dstrbuton of te unobserved clusters gven te actual data. However, te unts for te L Bayesan bootstrap samples stll ave wegts and cannot be analyzed as smple random samples. 4.2 Use wegted FPBB Pólya urn sceme to adust for wegtng Once we ave L BB samples wt replcate wegts, te second step mputes te unobserved unts t usng te wegted FPBB Pólya urn sceme. In practce, te probablty of selectng te k unt, y depends on te selecton of te frst k unts, y y In oter words, to determne te probablty,,. k of selectng a new unt, we ave to count te number of tmes tat eac unt n te sample as been selected among te prevous selectons. In settngs were te populaton sze s extremely large, we need only generate syntetc populatons of sze T n, were T s suffcently large to overwelm te sample sze (e.g., 20-00). To furter computatonal effcency, we could also draw a moderate szed populaton F tmes and ten pool tese F populatons to produce one syntetc populaton, S l. Te sze of S l ten s F T n. Note tat our metod only requres knowledge of te fnal wegts n multstage cluster samples, snce all stages of unequal probabltes of samplng wll be corrected by use of te wegted FPBB Pólya urn sceme. Ts s a partcularly useful feature of te proposed metod, as n many publc use datasets te components of te probabltes of selecton (e.g., cluster-level selecton probabltes, non-response wegts) are not avalable., k 5 Inference from multple nonparametrc syntetc populatons Assume we generate L syntetc populatons, S, l,, L usng te nonparametrc metod descrbed n Secton 4, and tat our nferental target s Q Q Y, a functon of te populaton data (e.g., populaton mean, correlaton, populaton maxmum lkelood estmator of a regresson parameter, etc.). We can compute Q as te estmate of Q obtaned from poolng te F syntetc populatons tat l mpute te unobserved unts of S l ; snce tese are drect draws from te posteror predctve dstrbuton of te populaton, we can compute posteror means, quantles, and credble ntervals from te correspondng emprcal estmates from te draws, f L s suffcently large. l Statstcs Canada, Catalogue No X

10 36 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features However, n many settngs, te computatonal effort requred to mpute te populaton may be very large, even f te full populaton s not requred to be synteszed. Hence an alternatve approac for nference s to approxmate te posteror predctve dstrbuton of a scalar populaton statstc Q va a t dstrbuton: were Q S S t Q L V,, L ~ L L, L Q L L L F Q Q l l f lf l L and V 2 L Ql QL L LF L l. Te result follows mmedately from Secton 4. of Ragunatan et al. 2003, and s based on te standard Rubn (987) multple mputaton combnng rules, treatng te unobserved unts of S as mssng data and te sampled unts as observed data. Te average wtn mputaton varance s zero, snce te entre populaton s beng synteszed; ence te posteror varance of Q s entrely a functon of te betweenmputaton varance, and te degrees of freedom s smply gven by te number of FPBB samples. (Wen te populaton s extremely large, we need only syntesze a draw suffcently large for average wtn mputaton varance to be trval relatve to te between mputaton varance V L. ) Te result assumes tat E Q Q - a result guaranteed by our wegted FPBB estmator - as well as a a suffcently large lf sample sze for Bayesan asymptotcs to apply. l 6 Smulaton studes In ts secton, we conduct two smulaton studes to evaluate te repeated samplng propertes of te populaton estmators constructed usng te nonparametrc metod tat generates syntetc populatons wle adustng for te complex samplng desgn features. Te frst of tese consders a one-stage, unequal probablty of selecton desgn were we vary te number of wegted FPBB draws for eac syntetc populaton and te number of syntetc populatons to assess te mpact on nference. Te second compares nferental propertes from observed data and from te posteror dstrbuton obtaned from syntetc populaton n a stratfed, multstage, unequal probablty of selecton sample, ts tme fxng te posteror sample sze wle consderng bot populaton means and populaton regresson parameters as targets of nferences. 6. Sngle stage, unequal probablty of selecton sample desgn We generated outcome data Y n a populaton of N subects from a moderately skewed gamma dstrbuton, condtonal on unformly dstrbuted covarate X : X ~ UNI 0.05; 0.65,,, N Y X x ~ GAMMA 0 x, Statstcs Canada, Catalogue No X

11 Survey Metodology, June We assume X s fully observed for te populaton, and tat te probablty of selecton s proportonal to X, so tat nx x n a wtout-replacement sample desgn as long as n N. Te estmand of nterest s te populaton mean Y N N y Note tat corr Y, X , so tat unwegted sample means wll be postvely based, and use of desgn wegts w are requred to obtaned unbased estmates of Y. We generated a populaton of sze N,000 from wc we sampled n 00; bas, emprcal and estmated varance, 95% nterval lengt, and nomnal 95% coverage are ten estmated from 200 ndependent samples from te populaton. We vared te total number of smulated populatons L as 5, 20, 00, and,000, and te number of FPBB draws F of sze N n (so tat 9 ) as, 20, and 00, n full factoral desgn. Varance, nterval lengt, and nterval coverage are obtaned va te normal approxmaton; for L 00 and,000, we also obtaned varance, nterval lengt, and nterval coverage usng te drect draws from te posteror predctve dstrbuton, snce a suffcent number of draws from te posteror were avalable to make suc estmates. Table 6. sows te results of te smulaton study. In all cases te pont estmate Q L of te populaton mean was approxmately unbased, reflectng te ablty of te wegted FPBB to undo te samplng wegts n te generaton of te syntetc populaton. Under te normal approxmaton, larger numbers of te syntetc populaton were assocated wt smaller varances and narrower nterval lengts, as expected wt larger numbers of degrees of freedom, altoug te dfference between 20 and 00 was mnmal, ust as te t 20 dstrbuton begns to approxmate a standard normal. Fnally, usng only a sngle FPBB draw of sze N n appeared to overestmate te varance and lead to overcoverage, especally for small values of L. Values of L and F of 20 or greater appeared to yeld reasonable results. Use of te drect draws for L 00 and,000 yelded to varance and credble nterval estmates tat were very smlar to tat of te normal approxmaton, wt slgtly narrower nterval lengts and somewat less conservatve coverage. Table 6. Bas, emprcal varance, mean of estmated varance, nterval lengt and coverage of 95% nomnal confdence nterval of a populaton mean as a functon of te number of syntetc populatons L and te number of wegted fnte Bayesan bootstraps tat make up te syntetc populaton F. Interval lengt and coverage obtaned va t approxmaton and emprcally va drect smulaton. One stage unequal probablty of selecton sample desgn. Results from 200 smulatons. L ,000 F Bas Emp. Varance Est. Varance: t Interval Lengt: t % Coverage: t Est. Varance: Emprcal Interval Lengt: Emprcal N/A N/A N/A N/A N/A N/A % Coverage: Emprcal N/A N/A N/A N/A N/A N/A Statstcs Canada, Catalogue No X

12 38 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features 6.2 Stratfed, multstage, unequal probablty of selecton sample desgn We generated a populaton wt strata and clusters wtn eac stratum from te followng bvarate normal dstrbuton: X k u ~ N,, X u k were : 50 denotes te stratum effect, u ~ N 0,0 denotes te random cluster effect, a ~ unform 2, 52 s te number of clusters wtn stratum, b ~ unform 0, 20 s te number of unts wtn cluster of stratum. Te populaton for te smulaton study as 6,324 subects. We draw a stratfed clusterng samplng wt unequal probabltes of selecton. Specfcally, we select two clusters from eac stratum wt probabltes proportonal to cluster sze (PPS) gven by b a b. Wtn eac selected cluster, we select approxmately 5 of te populaton. Tus, te probablty tat unt s selected s gven by 2b b a b b for all elements n cluster wt correspondng wegt 5 w a b b. 2b b 5 Snce te number of clusters and unts are random, te complex sample sze s slgtly dfferent across replcatons, averagng approxmately 770. Because of te large sample and populaton sze, we focus on nference usng t approxmatons. We generate L 00 syntetc populatons usng F wegted FPBB samples of sze 00 n. Te estmands of nterest are te populaton margnal mean for x N X N X and smlarly for x, and te populaton regresson coeffcents of 2 x on x 2 gven by N X 2 2 X X X B0 X BX 2, B. N 2 X X 2 2 We drew 200 ndependent samples from te populaton and used te sample data drectly to compute wegted sample means and lnear regresson coeffcents along wt assocated varance estmates and Statstcs Canada, Catalogue No X

13 Survey Metodology, June % nomnal confdence ntervals usng Taylor Seres approxmatons, and compared tese wt te equvalent estmates obtaned usng te nonparametrc syntetc data. Results are gven n Table 6.2. (Snce te margnal means ave te same superpopulaton value, we combne te results n Table 6.2.) Fgure 6. dsplays te scatter plot of te pars of estmated mean, ntercept and slope from te actual samples and te correspondng syntetc populatons along wt a 45-degee lne. Te samplng dstrbutons of te actual sample and syntetc populaton estmates closely correspond. Te pont estmates and standard errors for bot te means and regresson parameters closely correspond. Te 95% confdence nterval coverage rates for all tree statstcs also closely correspond, and are close to nomnal values. Table 6.2 Descrptve and analytc statstcs estmated from te actual data and te syntetc populatons n a smulaton evaluaton of te nonparametrc metod. Two-stage, unequal probablty of selecton stratfed sample desgn. Results from 200 smulatons. Type Actual Data Syntetc Populatons Estmate SE SD Coverage (%) Estmate SE SD Coverage (%) Mean X Intercept B Slope B mean ntercept syntetc estmates syntetc estmates actual sample estmates actual sample estmates slope actual sample estmates Syntetc estmates Fgure 7. Scatter plot of te descrptve and analytc statstcs from te actual and syntetc populatons Statstcs Canada, Catalogue No X

14 40 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features 7 Applcaton In ts secton, we use data from te 2006 Natonal Healt Intervew Survey (NHIS) and te 2006 Medcal Expendture Panel Survey (MEPS) to evaluate te performance of te nonparametrc metod n a stratfed clusterng samplng desgn. Te Natonal Healt Intervew Survey (NHIS) s a natonwde, faceto-face ealt survey based on a stratfed multstage desgn, wt oversamples of black, Hspanc, and elderly populatons. For confdentalty purposes, te true stratfcaton and prmary samplng unt (PSU) varables are not publcly-released; nstead pseudo-strata and PSUs (two per stratum) are released. Te MEPS s a subsample of te prevous year s NHIS sample, and retans te same stratfed multstage desgn. Bot NHIS and MEPS ask respondents weter tey are covered by any ealt nsurance and, f so, wat type ealt nsurance tey are usng (prvate versus government-sponsored suc as Medcare or Medcad). We estmate overall ealt nsurance coverage rates as well as coverage rates n subpopulatons defned by demograpc varables suc as gender, race, ncome level, or combnatons tereof: specfcally, we estmate ealt nsurance coverage for males, non-hspanc wtes, and non- Hspanc wtes wt ouseold ncome between $25,000 and $35,000 per year. We delete te cases wt tem-mssng values and focus on our smulaton on te complete cases. Ts results n 20,47 and 20,893 cases n te NHIS and MEPS data respectvely. 7. Estmaton of ealt nsurance coverage from te NHIS and MEPS In ts smulaton study, we wll use te nonparametrc metod to adust for te stratfed clusterng samplng used by te 2006 NHIS and MEPS and generate syntetc populatons tat can be analyzed as smple random samples. We also consder a model-based approac for generatng syntetc populatons usng a log-lnear model for te ealt nsurance status by sx ndependent demograpc varables: gender, race, census regon, educaton level, age (categorcal), and ncome level (categorcal). Ten we evaluate te metod by comparng te estmates of te ealt nsurance coverage rate for te wole populaton and selected subdomans obtaned from bot te non-parametrc and log-lnear model syntetc populatons to tose obtaned from te actual data. 7.. Generatng nonparametrc syntetc populatons Usng te nonparametrc metod developed n Secton 3, we generate 200 syntetc populatons for eac survey. Specfcally, we generate B 200 BB samples and for eac BB sample, we generate F 0 FPBB of sze 5n 5. Tus, eac syntetc populaton s 50 tmes as bg as te actual sample (,007,350 for NHIS,,044,650 for MEPS). Eac syntetc populaton s analyzed as a smple random sample and te estmates are combned as descrbed n Secton Generatng syntetc populatons va log-lnear models In te common stuaton tat te survey data of nterest are n te form of a multdmensonal contngency table, a log-lnear model mgt be consdered as a parametrc approac to generate draws from a posteror predctve dstrbuton. For smplcty of exposton, assume Y s te varable of our nterest wt m levels, and Z s a desgn varable wt n levels (e.g., gender, race, etc.) wose margnal Statstcs Canada, Catalogue No X

15 Survey Metodology, June dstrbuton s known for te populaton. Assume,,, m,,, n, represents te cell proporton of te 2002): were log Z t cell, m n. A fully saturated log-lnear model s gven by (Agrest 0 log Z Y ZY,,, m,,, n, s te log of te probablty tat one observaton falls n cell of te contngency table, s te man effect for Z, s te man effect for Y and Y ZY s te nteracton effect for Z and Y. Ts model ncludes all possble one-way and two-way effects and tus s saturated as t as te same number of effects as cells n te contngency table. To avod over-fttng te data n te example, we can consder non-saturated models tat exclude some or all of te nteracton terms, coosng te model based on lkelood rato tests or AIC or BIC crtera. Te syntetc populatons can be generated from te posteror predctve dstrbuton from te model. However, wen te data s collected under a complex samplng desgn, we are not aware of standard statstcal software tat can produce bot te pont estmate and covarance estmate of te regresson coeffcents. Instead, we ave to use a ackknfe replcaton metod to adust for stratfcaton, clusterng and wegtng. Specfcally, te parametrc syntetc populatons can be generated from te followng steps:. Estmate coeffcents and covarance matrx: Under te selected model (assume te two-dmensonal saturated model ere ust for llustraton), Z Y ZY,,,,,, m,,, n and te covarance estmate te coeffcents matrx of te estmates ˆ ˆ ˆ Z ˆ Y 0,,, ˆ ZY 0 after takng nto account te complex desgn features usng ackknfe repeated replcaton (JRR): For eac replcaton, wtdraw one cluster, and nflate te wegts for te respondents n te oter clusters wtn te same stratum by c c (replcaton wegts), were c H denotes te number of clusters wtn stratum. Assume we ave c C clusters n total, ten we ave C replcatons. For eac replcaton, we ft te log-lnear model and obtan te maxmum lkelood estmates (MLE) of te coeffcents, Z Y ZY,,,,,, m,,, n. 0 For eac replcaton, use te replcaton wegts to ft te log-lnear model. Specfcally, use te replcaton wegts to calculate te sze of eac cell of te contngency table, wc s t used to ft te log-lnear model. We denote te MLE for te r replcaton by a column vector, ˆ Z Y ZY, r,, c for stratum. Notce tat,,,,,, m, r,, n s a mn by column vector. We denote, Z, Y, ZY 0,,,. Smlarly,, r,, c,,, H are also mn by column vectors 0 denoted by mn r 0, r,, r ˆ r ˆ ˆ ˆ mn. 0 Statstcs Canada, Catalogue No X

16 42 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features Z Y ZY Te MLE of te coeffcents,,,,,, m,,, n can be H c obtaned by ˆ ˆ MLE. r r C t estmate of te pq p, q,, mn element s te covarance between te wc s gven by: 0 For te mn by mn covarance matrx, te ackknfe replcaton H c c ˆ r ˆ ˆ r ˆ p p q q c r, t p and t q coeffcents, were ˆ p H c ˆ C and r r p ˆ q H c ˆ r q C. Ts gves us te correct varance estmate of ˆ MLE. r 2. Approxmate te posteror dstrbuton of te coeffcents: t Let T denote te Colesky decomposton suc tat TT ˆ MLE random normal devates and defne ˆ MLE Tz. 3. Impute te unobserved values of te populaton: cov. Generate a vector z of Suppose L draws,,,, L are made from te approxmate posteror dstrbuton of. For eac l,, L,,,,,,, m,,, n, l X l Y l XY l l 0 we can generate one syntetc table usng te assumed model: l l X l Y l XY l 0 log,,, m,,, n. Once te cell proportons are determned, we can generate te syntetc table of any sze. Te results below are based on a seven-dmenson contngency table (see Table 7. for te specfc covarate categores). BIC measures ndcated tat a model wt all 2-way but no 3-way nteractons provded te most parsmonous ft. Table 7. Varables and response categores for te 2006 NHIS and MEPS used n log-lnear model. Varables of Interest Response Categores Age : [8; 24]; 2: [25; 34]; 3: [35; 44]; 4: [45; 54]; 5: [55; 64]; 6: >= 65 Census Regon Educaton Gender : Norteast; 2: Mdwest; 3: Sout; 4: West : Less tan g scool; 2: Hg scool; 3: Some college; 4: College : Male; 2: Female Healt Insurance Coverage : Any Prvate Insurance; 2: Publc Insurance; 3: Unnsured Income : (0; 0,000); 2: [0,000; 5,000); 3: [5,000; 20,000); 4: [20,000; 25,000); 5: [25,000; 35,000); 6: [35,000; 75,000); 7: >= 75,000 Race : Hspanc; 2: Non-Hspanc Wte; 3: Non-Hspanc Black; 4: Non-Hspanc All oter race groups Statstcs Canada, Catalogue No X

17 Survey Metodology, June Results Te results are summarzed n Table 7.2. For te total populaton and te larger subpopulatons, we can see tat te pont estmates (posteror mean) of ealt nsurance rates are te same for bot te nonparametrc and log-lnear approac, and are almost dentcal to tose obtaned from te actual data after complex samplng desgn features are accounted for. Bot metods yeld syntetc populatons wt slgtly ger (posteror) varances tan te actual data, reflectng te nformaton loss n te syntess. In te NHIS, te loss for te non-parametrc estmator averaged a lttle over 20% and was slgtly greater tan for te log-lnear model, wc averaged around 0%. Bot ad losses of about 0% over te actual data n MEPS. However, for te smaller subpopulaton (non-hspanc wtes earnng $25,000-$35,000 per year), te log-lnear model produced based results, due to te fact tat te log-lnear model dd not nclude all possble nteractons. Te nonparametrc metod yelds estmates almost dentcal to tose obtaned from te actual data after complex samplng desgn features are accounted for. Te log-lnear model also substantally underestmated te varance of nsurance coverage by 30-40% n tese cells, versus an overestmaton n te nonparametrc approac of 0-40%. Table 7.2 Estmates from actual data and from te syntetc populatons (Nonparametrc and log-lnear model) for te 2006 NHIS and MEPS. Doman Wole Populaton Male Non-Hspanc Wte Non-Hspanc Wte & Income [25,000; 35,000) Actual Data (Complex Desgn) Syntetc Populatons Nonparametrc Log-lnear Model Types NHIS MEPS NHIS MEPS NHIS MEPS Proporton Prvate Publc Unnsured Varance Prvate 2.46E E E E E E-05 Publc 6.29E-06.44E E-06.59E E-06.77E-05 Unnsured.84E-05.4E E-05.7E-05.8E-05.56E-05 Proporton Prvate Publc Unnsured Varance Prvate 3.32E E E E E E-05 Publc 6.82E-06.53E E-06.63E E-06.9E-05 Unnsured 2.94E E E E E E-05 Proporton Prvate Publc Unnsured Varance Prvate 2.99E E E E E E-05 Publc 8.20E-06.8E-05.04E E-05.0E E-05 Unnsured 2.02E-05.5E E-05.80E-05.82E-05.82E-05 Proporton Prvate Publc Unnsured Varance Prvate.00E-04.39E-04.48E-04.63E E E-05 Publc 2.82E E E E-05.79E E-05 Unnsured 7.24E E E-05.E E E-05 Statstcs Canada, Catalogue No X

18 44 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features 8 Dscusson In ts paper, we propose and evaluate a nonparametrc metod to generate syntetc populatons. Ts metod adusts for te complex samplng desgn features wtout assumng any models to te observed data so t s robust to model-msspecfcaton. Also, unlke model-based metods tat needs to develop separate mputaton models for dfferent varables of nterest, te nonparametrc metod only uses te desgn varables to generate syntetc populatons and tus s not varable-specfc. We consdered te repeated samplng propertes of our non-parametrc syntetc estmators n a unvarate gamma and bvarate normal settng, estmatng means, slopes, and ntercepts. Pont estmates were unbased, ntervals ad approxmately nomnal coverage, and losses of effcency relatve to te actual data were trval. We also consdered a real world settng, generatng a predctve dstrbuton for te 2006 NHIS and MEPS and estmatng rates and assocated varance estmates of ealt nsurance coverage usng bot te nonparametrc metod and a fully parametrc log-lnear modelng approac. Wen te model fts te data well, te model-based metod s more effcent tan te nonparametrc metod. However, wen te assumed model does not ft te data well, as was te case n certan small domans, te model-based metod may produce nvald nference. In suc stuatons, te nonparametrc metod s robust to model msspecfcaton. In addton to robustness to model msspecfcaton, anoter advantage s tat te nonparametrc metod only uses te desgn varables suc as stratum, cluster and wegt to mpute te unobserved part of te populaton. Unlke model-based metods, t does not need to model te complcated relatonsps among te varables of nterest, wc becomes mpossble f tere are tem mssng values n te actual data. Te syntetc populatons generated by te nonparametrc metod stll preserve te tem mssng values n te actual data. Ts potentally flls n a gap n te multple mputaton area n tat exstng mputaton metods typcally gnore te complex samplng desgn features n te data and mpute te mssng values as f tey are smple random samples. A related advantage s tat, wle desgn varables are used n te nonparametrc generaton of te syntetc populatons, te syntetc populatons temselves do not need to contan tem, snce tey can be analyzed as smple random samples. Hence, dsclosure rsk assocated wt release of desgn varables can be elmnated (De Waal and Wllenborg 997; Mtra and Reter 2006; Reter and Mtra 2009). A fourt practcal advantage of te nonparametrc metod s tat t s easer to mplement n exstng statstcal software packages because t focuses on te desgn varables; tus specfc strateges for varous types of varables and data structures do not need to be developed. Because use of te wegted FPBB does not requre nformaton about te number of clusters n te populaton or condtonal probabltes of selecton at eac stage of selecton n a multstage sample settng, we use an approxmate Bayesan bootstrap metod to adust for stratfcaton and clusterng. We vew ts as advantageous n many ways, snce publc use datasets typcally do not break out wegts for eac stage of te sample. However, t does ave te dsadvantage tat, to ensure postve replcate wegts, te Bayesan bootstrap metod produces fewer clusters wtn strata tan n te actual data. In te settng were te probabltes of selecton are known for all stages of te sample, t seems lkely tat te wegted FPBB can be mplemented at eac stage, wt te populaton of unobserved clusters and te populaton of elements wtn eac cluster mputed n a two-stage fason, parallelng Meeden (999) ust as te one-stage FPBB parallels Gos and Meeden (983). Ts remans an area for future researc. Statstcs Canada, Catalogue No X

19 Survey Metodology, June Acknowledgements Ts researc was supported by NCI grant R0CA290. Te autors ws to tank te Edtor, Assocate Edtor, and two anonymous revewers for ter comments. We are especally ndebted to te revewer tat elped us to better understand and explan te lnks between te fnte populaton Bayesan bootstrap and Pólya posteror dscussed n Secton 3. References Agrest, A. (2002). Categorcal Data Analyss, New York: Jon Wley & Sons, Inc. Cen, Q., Ellott, M.R. and Lttle, R.J.A. (200). Bayesan penalzed splne model-based nference for fnte populaton proporton n unequal probablty samplng. Survey Metodology, 36,, Coen, M.P. (997). Te Bayesan bootstrap and multple mputaton for unequal probablty sample desgns. Proceedngs of te Survey Researc Metods Secton, Amercan Statstcal Assocaton, de Waal, A.G., and Wllenborg, L.C.R.J. (997). Statstcal dsclosure control and samplng wegts. Journal of Offcal Statstcs, 3, Dong, Q. (202). Combnng Informaton from Multple Complex Surveys. Unpublsed Tess. Ellott, M.R. (2007). Bayesan wegt trmmng for generalzed lnear regresson models. Survey Metodology, 33,, Ellott, M.R., and Lttle, R.J.A. (2000). Model-based approaces to wegt trmmng. Journal of Offcal Statstcs, 6, Ercson, W.A. (969). Subectve Bayesan modelng n samplng fnte populatons. Journal of te Royal Statstcal Socety, B3, Gos, M., and Meeden, G. (983). Estmaton of te varance n fnte populaton samplng. Sankyā: Te Indan Journal of Statstcs, B45, Hnkns, S., O, H.L. and Sceuren, F. (997). Inverse samplng desgn algortms. Survey Metodology, 23,, -2. Lazzeron, L.C., and Lttle, R.J.A. (998). Random effects models for smootng poststratfcaton wegts. Journal of Offcal Statstcs, 4, Lttle, R.J.A. (99). Inference wt survey wegts. Journal of Offcal Statstcs, 7, Lttle, R.J.A. (993). Statstcal analyss of masked data. Journal of Offcal Statstcs, 9, Lttle, R.J.A. (2004). To model or not to model? Competng modes of nference for fnte populaton samplng. Journal of te Amercan Statstcal Assocaton, 99, Lo, A.Y. (988). A Bayesan bootstrap for a fnte populaton. Annals of Statstcs, 6, Statstcs Canada, Catalogue No X

20 46 Dong et al.: A nonparametrc metod to generate syntetc pop. to adust for complex samplng desgn features Meeden, G. (999). A nonnformatve Bayesan approac for two-stage cluster samplng. Sankyā: Te Indan Journal of Statstcs, B6, Mtra, R., and Reter J.P. (2006). Adustng survey wegts wen alterng dentfyng desgn varables va syntetc data. Prvacy n statstcal databases: Lecture Notes n Computer Scence, 4302, Ragunatan, T.E., Reter, J.P. and Rubn, D.B. (2003). Multple mputaton for statstcal dsclosure lmtaton. Journal of Offcal Statstcs, 9, -6. Ragunatan, T.E., Xe, D.W., Scenker, N., Parsons, V.L., Davs, W.W., Dodd,.W. and Feuer, D.J. (2007). Combnng nformaton from two surveys to estmate county-level prevalence rates of cancer rsk factors and screenng, Journal of te Amercan Statstcal Assocaton,02, Rao, J.N.., and Wu, C.F.J. (988). Resamplng nference wt complex survey data. Journal of te Amercan Statstcal Assocaton, 83, Reter, J.P. (2004). Smultaneous use of multple mputaton for mssng data and dsclosure lmtaton. Survey Metodology, 30, 2, Reter, J.P. (2005). Releasng multply mputed, syntetc publc use mcrodata: An llustraton and emprcal study. Journal of te Royal Statstcal Socety, A68, Reter, J.P., and Mtra, R. (2009). Estmatng rsks of dentfcaton dsclosure n partally syntetc data. Journal of Prvacy and Confdentalty,,, Artcle 6. Rubn, D.B (987). Multple Imputaton for Non-Response n Surveys, New York: Jon Wley & Sons, Inc. Scott, A.J. (977). Large sample posteror dstrbutons n fnte populatons. Te Annals of Matematcal Statstcs, 42, 3-7. Sknner, C., Holt, D. and Smt, T. (989). Analyss of Complex Surveys, New York: Jon Wley & Sons, Inc. Statstcs Canada, Catalogue No X

Machine Learning. K-means Algorithm

Machine Learning. K-means Algorithm Macne Learnng CS 6375 --- Sprng 2015 Gaussan Mture Model GMM pectaton Mamzaton M Acknowledgement: some sldes adopted from Crstoper Bsop Vncent Ng. 1 K-means Algortm Specal case of M Goal: represent a data