An Application of PROC NLP to Survey Sample Weighting

Size: px

Start display at page:

Download "An Application of PROC NLP to Survey Sample Weighting"

Patrick Russell
5 years ago
Views:

1 An Application of PROC NLP to Survey Sample Weighting Talbot Michael Katz, Analytic Data Information Technologies, New York, NY ABSTRACT The classic weighting formula for survey respondents compensates for differences between each cell s proportion of respondents, and its proportion of the target population. Such weighting also can be applied to cells based on variables of interest (beyond the experimental design). If even one cell has no responses, the entire weighting has to be reconsidered. An optimal reapportionment that attempts to preserve row / column marginals is proposed, with a PROC NLP implementation. Keywords: PROC NLP, nonlinear programming, weight adjustment, nonresponse. INTRODUCTION SAS software provides several tools for the design of surveys and the analysis of survey data. But even wellplanned and executed surveys can suffer from nonresponse. Both the PROC SURVEYMEANS and PROC SURVEYREG documentation for SAS/STAT software contain the following passage in their sections on missing values, Once data collection is complete, you can use imputation to replace missing values with acceptable values, and you can use sampling weight adjustments to compensate for nonresponse. You should complete this data preparation and adjustment before you analyze your data with PROC SURVEY[REG/MEANS]. [1] Several methods of weighting adjustment are already in use. One of the simplest methods multiplies each weight by the sum of base weights over all divided by the sum of base weights over responders [2]. Some of the more sophisticated methods use auxiliary data to build predictive models for probability of (non)response [3]. The appropriate weighting method to use may depend on the data available and the goals of the analysis. The method proposed here is useful for situations in which cells of interest are based upon levels of two or more variables, and there is a desire to maintain the marginal weights of the respondents as closely as possible in proportion to the marginal population sums. PROPORTIONAL WEIGHTING Suppose we start with a population of size P and take a sample of size S. Suppose that the population can be split into subgroups, P i, i = 1,,n, and the sample splits into corresponding subgroups S i. In a perfect world, S i / S = P i / P for each i then each sample subgroup has the correct proportion, and each individual in the sample can be given weight 1. In a still-sunny-but-slightly-less-than-perfect world each S i > 0 then each individual in group i can be assigned weight of (P i / P)*(S / S i ). These weights can be used in ANOVA or other modeling to extrapolate back to the original population. This is classic proportional weighting the sum of the individual weights adds up to P. Here is an easy example. Suppose the initial population P = 1000, P 1 = 400, P 2 = 300, P 3 = 200, P 4 = 100. Let S = 100, and S 1 = 20, S 2 = 10, S 3 = 20, S 4 = 50. Then w 1 = 2, w 2 = 3, w 3 = 1, w 4 = 0.2 intuitively, groups 1 and 2 are low in the sample, so their individuals weigh more than 1, group 4 is high in the sample, so its members weigh less than 1, group 3 has the same proportion of the sample as it does of the general population, so its members have unit weight. Proportional weighting breaks down if even a single cell is empty. Even if you decide to give the empty cell a weight of zero, the rest of the weighted individuals do not add up to the original population size. The easiest way to save proportional weighting in the presence of empty cells is to combine the empty cells with non-empty cells, if practical. Here is another easy example, with the same original population as the first example, but in this case S 1 = 20, S 2 = 30, S 3 = 0, S 4 = 50. If we can combine groups 3 and 4, then the weights are w 1 = 2, w 2 = 1, w 3-4 = 0.6. It is not always practical to combine cells, especially when the cells are created by values of two or more underlying variables (such as in a multifactorial design). 1

2 EMPTY CELLS CREATED BY TWO OR MORE VARIABLES Consider a population of 1000 workers who are classified in two ways, as employees (E) or contractors (C), and as SAS users (S) or the Unenlightened (U). Then these classifications can produce four subgroups, e.g., as follows: E C Total S U Total Then a sample of size 60 would get the following perfect weighting: E C Total S U Total What if one of the actual sample quadrants is zero? Suppose the bottom right quadrant, CU, is zero. Then, Combining CU and CS keeps the correct EC split, 36-24, but gives SU split of Combining CU and EU keeps the correct SU split, 42-18, but gives EC split of Combining CU and ES is hard to justify and gives EC split of 42-18, SU split of If the cells are proportionally reweighted, each cell would be multiplied by 60 / 54, giving cell ES weight of 26.67, EU weight of 13.33, CS weight of 20. This gives an EC split of 40-20, and SU split of , sort of a compromise between the first two combinations above. Another possible resolution would be to try to reweight each nonempty cell as close as possible to its proportionate value. This could be done by solving a least squares minimization. In the example above, we would minimize: (24 - ES) 2 + (18 - CS) 2 + (12 - EU) 2 subject to ES + CS + EU = 60. Substituting ES = 24 + d 1, CS = 18 + d 2, EU = 12 + d 3, this transforms to minimizing: (d 1 ) 2 + (d 2 ) 2 + (d 3 ) 2, subject to d 1 + d 2 + d 3 = 6. The solution to this is d 1 = d 2 = d 3 = 2, an even spread of the missing cell s weight to the other cells. This would result in an EC split of 40-20, and SU split of 46-14, a slightly better compromise than proportional reweighting in this case. The cell-by-cell least squares reapportionment example above generalizes to any number of missing cells, and the solution to getting the non-missing cells as close as possible to their proportionate values, in the least squares sense, is to spread the proportionate weight of the missing cells evenly among the remaining cells. This can be done in SAS without using anything as fancy as PROC NLP! Proportional reweighting and even spread as practiced above always affect all the non-empty cells. A more targeted approach would be to leave unmodified the cells that share no values of the underlying variables with the empty cells, and only reweight the guilty cells that share variable values with the empty cells. In our example, the CS and EU cells are guilty and the ES cell is not. Then, guilty proportional reweighting would multiply the two guilty cells by 36 / 30, giving an EC split of and SU split of Guilty even spread gives EC split of and SU split of The two guilty reapportionments are about equal to each other, and slightly better than the global reapportionments. In this example, there actually is a reapportionment solution that perfectly maintains the marginals, ES = 18, CS = 24, EU = 18. However, it does throw off the relative proportions of the individual cells more than the above solutions. Also, in some cases, there is no perfect solution to maintain the marginals. If ES were the empty quadrant in the sample, instead of CU, then a perfect marginal solution would have to satisfy: 2

3 CS + CU + EU = 60, CS + CU = 24, CU + EU = 18 this would require CU = -18, which is impossible. LEAST SQUARES MARGINALS MAINTENANCE Least Squares minimization can be applied to any combination of cells, not just the individual ones. In particular, Least Squares can be used to try to get the closest approximation to the individual variable marginal splits. For the ES = 0 problem in the previous example, minimize: (42 - CS) 2 + (36 - EU) 2 + (18 - (EU + CU)) 2 + (24 - (CS + CU)) 2, subject to EU + CU + CS = 60. The solution to this is CS = 33, EU = 27, CU = 0. Not very comforting, but this is a pretty extreme situation. When there is a missing cell in a 2x2 table, there usually will be a unique solution to the Least Squares problem for the splits on the two individual variables. For larger problems, there may not be a unique solution. For example, in the 2x2 case above with no missing cells, the Least Squares set-up for the SU, EC splits is to minimize: (42 - (CS + ES)) 2 + (36 - (EU + ES)) 2 + (18 - (EU + CU)) 2 + (24 - (CS + CU)) 2, subject to EU + ES + CU + CS = 60. The true values, ES = 24, CS = 18, EU = 12, CU = 6, solve this exactly, but so do ES = 20, CS = 22, EU = 16, CU = 2, and infinitely many other combinations. 2x2 cases can be done by hand, but what can handle more complex minimizations?... PROC NLP SAS has PROC NLP, a nonlinear optimizer, in the SAS/OR software module. It was introduced as an experimental release with SAS 6.08, and was placed into production with SAS 6.09, and has been included with each subsequent release of SAS/OR. The archetypal problem is least squares minimization with linear constraints (of which the sample reweighting problem is an example), but since release 6.11 nonlinear constraints are also allowed. Please note that SAS/IML software also has nonlinear programming capabilities, and PROC NLIN in SAS/STAT uses some of the same techniques. SAS 9 has completely revamped the SAS/OR tool set for release 9.2, and while PROC NLP remains available, the preferred method will be to use PROC NLPC or PROC OPTQP. Here is PROC NLP syntax for the simple 2x2 example above where ES = 0. PROC NLP OUTEST = nlpout1 TECHNIQUE = CONGRA --NOPRINT-- MIN objval PARMS cs = 20, cu = 20, eu = 20 objval = (42 - cs)**2 + (36 - eu)**2 + (18 - (eu + cu))**2 + (24 - (cs + cu))**2 BOUNDS cs cu eu >= 0 LINCON 60 = cs + cu + eu PROC NLP Syntax Notes: OUTEST contains the optimization solution, including optimal parameter values, objective function value, right hand sides of constraints. TECHNIQUE : several solution techniques are available, most (not all) requiring derivative info on objective function (user can supply this independently, like PROC NLIN, but doesn t always need to). CONGRA -- conjugate gradient, converges relatively easily. PARMS : initial parameter values for search. LINCON : linear constraints. BOUNDS : can have upper and lower bounds. Here is an alternative syntax for the same problem: 3

4 PROC NLP OUTEST = nlpout1 TECHNIQUE = CONGRA --NOPRINT-- LSQ fc fe fs fu PARMS cs = 20, cu = 20, eu = 20 fc = (24 - (cs + cu)) fe = (36 - eu) fs = (42 - cs) fu = (18 - (eu + cu)) BOUNDS cs cu eu >= 0 LINCON 60 = cs + cu + eu The key pieces of the PROC NLP set up are the target values, the parameter variables, and the initial parameter variable values. The target values (42, 18, 36, 24, in the example above) are the proportional pieces of the sample for the groups of interest -- in our case, separate groups for each individual trait variable level. There is one parameter variable for each non-empty Cartesian cell in the sample. The initial parameter values can be chosen in many ways one way is to start with the actual sample counts in each cell. To make this all useful, the task is to start with the initial data and go through the following steps : For both the population and sample, compute the total counts, the Cartesian cell counts, and the counts for each individual variable level. Use the counts to determine the number of NLP parameters, initial and target values, and generate the NLP step. Translate the results of the NLP step into weights, and merge back with the initial data. A SAMPLE MACRO FOR THE LEAST SQUARES MARGINAL MAINTENANCE REWEIGHTING This macro has several input parameters, including: inlibp population input data set library indsp population input data set name inlibs sample input data set library indss sample input data set name outlibs output data set library outdst trait value data set name outdss match-weights-to-sample data set name work for work library or other library to save intermediate data sets numtrait number of traits (variables) determining cells trait1, trait2, names of trait variables letter1, letter2 short names of trait variables numctr number of character trait variables (list them first) ncids total sample count wlb weight lower bound wub weight upper bound techneek optimization technique for PROC NLP wgtvar weight variable name * FIND POPULATION TRAIT VALUE PERCENTAGES (TARGETS OF REWEIGHTING SCHEME) %LET maxnval = 0 %* largest number of individual trait values %DO i = 1 %TO &numtrait. 4

5 PROC FREQ DATA = &inlibp..&indsp. NOPRINT TABLES &&trait&i. / OUT = &work..ptr&i. DATA _NULL_ SET &work..ptr&i. END = &last. RETAIN mintgt &ncids. * ncids is total sample count CALL SYMPUT("tv&i._" COMPRESS(_N_),COMPRESS(&&trait&i.)) * individual trait value target = PERCENT * &ncids. / 100 IF target < mintgt THEN DO mintgt = target CALL SYMPUT("tp&i._" COMPRESS(_N_),COMPRESS(target)) * target percentage IF &last. THEN DO CALL SYMPUT("nv&i.",COMPRESS(_N_)) * number of values of trait CALL SYMPUT("mintgt",COMPRESS(mintgt)) * minimum target value %IF &&nv&i. > &maxnval. %THEN %DO %LET maxnval = &&nv&i. % % %* trait i freq %LET ntnv = %SYSEVALF(&numtrait. * &maxnval.) %* upper bound on number of NLP parameters * FIND ALL CELLS REPRESENTED IN SAMPLE PROC FREQ DATA = &inlibs..&indss. NOPRINT TABLES &trait1. %DO i = 2 %TO &numtrait. * &&trait&i. % %* trait i / OUT = &work..smpcel1 * FIND WHICH VARIABLES GO WITH WHICH TRAIT VALUES (ONE VARIABLE PER UNIQUE CELL) DATA &outlibs..&outdst. set &work..smpcel1 END = &last. ARRAY vc{1:&numtrait.,1:&maxnval.} vc1 - vc&ntnv. * array to count number of variables which go with each trait value RETAIN vc1 - vc&ntnv. 0 DROP i j vc1 - vc&ntnv. wlbc wubc CALL SYMPUT("xi" COMPRESS(_N_),COMPRESS(COUNT)) * use actual cell counts as initial variable values wlbc = &wlb. * COUNT wubc = &wub. * COUNT CALL SYMPUT("wl" COMPRESS(_N_),COMPRESS(wlbc)) * to get proper lower bound on weight, have lower bound on cell variable be weight lower bound times cell count CALL SYMPUT("wu" COMPRESS(_N_),COMPRESS(wubc)) * to get proper upper bound on weight, have upper bound on cell variable be weight upper bound times cell count %DO i = 1 %TO &numtrait. %LET li = &&letter&i. SELECT (&&trait&i.) 5

6 %DO j = 1 %TO &&nv&i. WHEN %IF &i. LE &numctr. %THEN %DO ("&&&tv&i._&j.") % %* assume char variables listed first %ELSE %DO (&&&tv&i._&j.) % DO %* create list of vars with level j for trait i vc{&i.,&j.} + 1 CALL SYMPUT("x&li._&j._" COMPRESS(vc{&i.,&j.}),COMPRESS(_N_)) % %* j 1 to nvi OTHERWISE % %* i 1 to numtrait IF &last. THEN DO CALL SYMPUT("numxvar",COMPRESS(_N_)) * number of variables DO i = 1 TO &numtrait. DO j = 1 TO &maxnval. CALL SYMPUT("vc" COMPRESS(i) "_" COMPRESS(j),COMPRESS(vc{i,j})) * j * i * SET UP PROC NLP PROC NLP OUTEST = &work..nlpout1 NOPRINT TECHNIQUE = &techneek. MIN objval PARMS x1 = &xi1. %DO i = 2 %TO &numxvar., x&i. = &&xi&i. % %* i 2 to numxvar BOUNDS %LET numxvar1 = %SYSEVALF(&numxvar. - 1) %DO i = 1 %TO &numxvar1. &&wl&i. <= x&i. <= &&wu&i., % %* i 1 to numxvar1 &&wl&i. <= x&i. <= &&wu&i. %LET notfirst = 0 LINCON &ncids. = x1 %DO i = 2 %TO &numxvar. + x&i. % %* i 2 to numxvar objval = %DO i = 1 %TO &numtrait. %LET li = &&letter&i. %DO j = 1 %TO &&nv&i. %IF &&&vc&i._&j. %THEN %DO %* term irrelevant if no sample cells exist for it %IF &notfirst. %THEN %DO %* plus sign to add successive terms after first + 6

7 % %ELSE %DO %LET notfirst = 1 % %LET wtij = &&trwt&i. &wtij. * (&&&tp&i._&j. - (x&&&x&li._&j._1. %DO k = 2 %TO &&&vc&i._&j. + x&&&x&li._&j._&k. % %* k 2 to vci_j ))**2 % %* vci_j > 0 % %* j 1 to nvi % %* i 1 to numtrait * EXTRACT SOLUTION FOR MATCHING WITH TRANSLATION SET PROC TRANSPOSE DATA = &work..nlpout1 (DROP = _NAME_ WHERE = (_TYPE_ = "PARMS")) OUT = &work..nlparms1 VAR x1 - x&numxvar. * MATCH SOLUTION WITH TRANSLATION SET AND SOLVE FOR WEIGHTS DATA &outlibs..&outdst. MERGE &work..nlparms1 &outlibs..&outdst. * merge had better be one to one DROP COL1 wlb wlbc wub wubc wlb = &wlb. wlbc = &wlb. * COUNT wub = &wub. wubc = &wub. * COUNT IF COL1 < wlbc THEN DO &wgtvar. = &wlb. ELSE IF COL1 > wubc THEN DO &wgtvar. = &wub. ELSE DO &wgtvar. = COL1 / COUNT * SORT SAMPLE TO MATCH WITH TRANSLATION SET AND APPLY WEIGHTS PROC SORT DATA = &inlibs..&indss. OUT = &work..indsrt1 BY %DO i = 1 %TO &numtrait. &&trait&i. % %* i 1 to numtrait * MATCH WEIGHTS TO SAMPLE DATA &outlibs..&outdss. MERGE &work..indsrt1 (IN = ins) &outlibs..&outdst. (IN = ino KEEP = &wgtvar. %DO i = 1 %TO &numtrait. 7

8 &&trait&i. % %* i 1 to numtrait ) END = &last. BY %DO i = 1 %TO &numtrait. &&trait&i. % %* i 1 to numtrait DROP ctm ctn cts cto IF ins THEN DO IF ino THEN DO ctm + 1 OUTPUT ELSE DO cts + 1 * should be zero ELSE IF ino THEN DO cto + 1 * should be zero ELSE DO ctn + 1 * must be zero IF &last. THEN DO PUT "ctm = " ctm PUT "cts = " cts PUT "cto = " cto PUT "ctn = " ctn * * * * * * * * * * * * * * * 8

9 CONCLUSIONS We have seen that reweighting to handle empty cells confronts us with many possible choices different ones may be desirable depending upon circumstances. Many of the reweighting schemes are easy to apply. We showed that under many conditions, it is possible for a more sophisticated reweighting scheme to preserve the marginal distribution. This involves minimizing a quadratic objective function, and may best be accomplished with the assistance of nonlinear optimization software, such as the PROC NLP procedure of SAS/OR. REFERENCES: [1] [2] Department of Energy 1995 Commercial Buildings Energy Consumption Survey [3] Weighting Adjustments for Unit Nonresponse with Multiple Outcome Variables S.L. Vartivarian and R. Little, 2003, University of Michigan Department of Biostatistics Working Paper Series ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Talbot Michael Katz Analytic Data Information Technologies 229 East 21 st Street, #2 New York NY Phone: Fax: topkatz@msn.com * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 9

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA ABSTRACT This paper describes for an intermediate SAS user the use of PROC REPORT to create