Using Taylor s Linearization Technique in StEPS to Estimate Variances for Non-Linear Survey Estimators 1

Size: px

Start display at page:

Download "Using Taylor s Linearization Technique in StEPS to Estimate Variances for Non-Linear Survey Estimators 1"

Jeremy Marsh
5 years ago
Views:

1 Using Taylor s Linearization Technique in StEPS to Estimate Variances for Non-Linear Survey Estimators Roger L. Goodwin, U.S. Bureau of the Census, Washington, DC 033 Katherine J. Thompson, U.S. Bureau of the Census, Washington, DC 033 Abstract: Estimating variances of non-linear functions involving two or more random variables is a challenging problem in survey sample variance estimation. A common approach to solving this problem is to linearize such functions using Taylor series methods, then estimate the variance of the linearized function using () a vector of first-order derivatives evaluated at the point estimates and () the variance-covariance matrix of all the function variables. This document describes the SAS macro used to implement the Taylor s Linearization Technique for variance estimation in the U.S. Census Bureau s Standard Economic Processing System (StEPS). All StEPS input parameters and calculated estimates are stored in SAS data sets using standard file formats. There are several implementation considerations associated with using these standard files, which are discussed in detail in the paper. Our macro uses BASE/SAS to evaluate the derivatives at the point estimates, then builds the variance-covariance matrix in PROC IML. The evaluated derivatives are read into PROC IML, from which we obtain the variance estimates using simple matrix multiplication. This paper is intended for people with an interest in BASE/SAS, SAS macros and PROC IML for calculating variance estimates. KEYWORDS: variance estimation, non-linear functions, SAS macro, SAS data steps, PROC IML Background on Taylor s Linearization Technique: Let f be a non-linear function of two or more random variables, d be a column vector of derivatives of f evaluated at point estimates of the means of the function variables, and S be the variance-covariance matrix of all the function variables. The Taylor Linearization Technique approximates the variance of f evaluated over the function variables with the expression d` S d. (Wolter, 985 and Sarndal, Swensson, Wretman, 99). If f is the ratio of two random variables, then there is a simple expression for the Taylor linearized variance, namely VAR() VAR(Y) COV(, Y) VAR( Y ) ( Y ) + Y Y. This formula is hard-coded into %TAYLOR; all other non-linear functions require derivative formulas along with the associated point estimates, standard errors, and covariances. Ratio estimates are the most common type of non-linear estimator published by the Census Bureau s economic surveys. Hard-coding the formula for ratio estimates reduces implementor burden since the user does not have to key-in (and verify) expressions for several sets of derivatives and also decreases the potential for specification errors. What Is StEPS? The Standardized Economic Processing System (StEPS) is a generalized survey processing system used in the Economic Directorate of the U.S. Census Bureau to process over 00 current economic surveys (Tasky and Ahmed, 999). It is written entirely in SAS and operates in a UNI environment. StEPS contains integrated modules for data-collection support, editing, data review and correction, imputation, calculation of estimates and variances, and system administration. The estimation and variance module consists of a set of SAS macros, each of which performs a specific estimation function (Sigman, 000). StEPS users control estimation via scripts, which are SAS programs that invoke existing StEPS estimates and variances macros in a user-specified sequence. %TAYLOR is one of the StEPS estimates and variances macros. StEPS stores macro-data in estimation results files (ERFs). See Figure. One ERF corresponds to one table, which is the result of StEPS performing calculations on analysis variables for individual values of categorical BY variables. An ERF contains bookkeeping variables such as the date and time of last modification, the name of the program that made the last modification, survey name, This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a Census Bureau review more limited in scope than that given to official Census Bureau publications. This report is released to inform interested parties of ongoing research and to encourage discussion of work in progress.

2 statistical period, etc. In addition to the bookkeeping variables, an ERF contains these key data fields: ITEM and ITEM: the names of the analysis variable(s). For example, a total would only need to use the ITEM column. A covariance would need to use both the ITEM and ITEM columns. BY, BY, : every combination of every level of the BY variables in the table (including a total denoted by.a). TYPE: a string describing the type of estimate (e.g. an estimate of a total denoted by EST, a standard error denoted by STDERR, a covariance denoted by COV, or a coefficient of variation denoted by CV). The TYPE, TYPE3 columns further describe the estimate such as if it is an unadjusted or adjusted estimate and if it is a total or a ratio. NVALUE: the calculated value of the estimate listed in the ITEM column. Where do StEPS estimation macros, such as %TAYLOR, get information on what to estimate and what data to use in the estimation? In part (at least for %TAYLOR), StEPS estimation macros read estimation specification files (ESFs) and estimation formula files (EFFs). Estimation processing information is stored in two files: the estimation specification file (ESF) and the estimation formula file (EFF). Both files are organized by table number. The EFF stores SAS expressions and SAS code used by the estimation macros. The ESF contains all other parameters used in the estimation modules. See Sigman (000) for more details on the ESF and EFF. %TAYLOR uses the following member names to identify the input files for each table: ERxxxx an estimation results file for table xxxx containing all necessary point estimates, standard errors of point estimates, and covariances of the function (s). ESF an estimation specifications file containing specifications for BY variables and derived estimates (estimates that are functions of linear estimates) and associated derivatives. If the expression for a derivative contains less than 9 characters, then the ESF also contains the SAS expression for the derivative function. See Figure. EFF an estimation formulas file containing the expression for all other derivative functions. See Figure 3. %TAYLOR is not limited to processing just one table at a time. A macro call to %TAYLOR invokes a %DO loop that processes every table the user specifies via global macro variables TABLE, TABLE,, TABLEn. Introduction to the Code: Set-Up The first part of %TAYLOR splits-up the derived estimates in a given table into two different processing sets: ) a set of all ratio estimates, and ) a set of all other non-linear estimates. The hard-coded formula for calculating the variance is applied to the ratio estimates and Taylor s Linearization Technique is applied to the non-linear estimates. The ESF data set name is stored in the macro variable &PARMS. The variable TABLE is used to select a particular table. The type of information stored in each ESF record is determined by the contents of OBJ_TYPE as follow: BY A list of the by (classification) variables. TOTALS A list of the survey specific analysis variables. DERIVE A list of variables and expressions derived from the list of TOTALS variables. Derived estimates are calculated from totals estimates. Depending on the contents of the OBJ_TYPE variable, the ESF variables VAL, VAL, VAL3, VAL4, CHAR, CHAR, STRING, etc will be populated differently. See Figure and Figure 4 for some examples of OBJ_TYPE = DERIVE records. In the following code, the macro variable RATIO is set to zero if no ratios exist, and is greater-than zero if ratios exist. /* create a dataset for the table that contains all ratio derived estimates. these records all have char=e, char ne blank, and the first word in string = ratio */ proc sql ; create view ratio as select trim(scan(p.string,)) as num, trim(scan(p.string,3)) as den, p.char3, p.val, p.val3, p.val4 from &parms as p

(ERF) An Example of the OBJ_TYPE = DERIVE

En Example of an Estimation Formula File

3 Figure. An Example of an Estimation Results File (ERF) Figure. An Example of the OBJ_TYPE = DERIVE Records of an ESF for Ratios Figure 3. En Example of an Estimation Formula File (EFF) Figure 4. An Example of the OBJ_TYPE = DERIVE Records of an ESF that Contains Derivatives

4 where p.table eq upcase("&&table&tabno") and trim(scan(p.string,)) eq "RATIO" and upcase(trim( p.obj_type)) eq "DERIVE" and p.char ne " " and p.char eq "E"; quit; proc sql noprint; select count(*) into :ratio from ratio; quit; %if &ratio gt 0 %then %do; /* Use the hard-coded formula for calculating the variances of ratios */ %end; For ratios the standard errors and CV s are calculated for each BY level using the hard-coded formula. The code for the ratio portion of the program is quite straightforward. Consequently, the rest of the paper will deal exclusively with the code for calculating variances of non-ratio, nonlinear functions of random variables. /* use Taylor s Linearization Technique for non-linear (non-ratio) estimators */ %end; If the macro variable OTHER is 0, then either there are no non-linear functions or the non-linear functions are ratios. We verified that the OTHER portion of the program was correct by calculating the variances of some ratios via the hard-coded formula and via Taylor s method. The results matched exactly. Reading the Derivatives from the ESF and EFF: A derivative can reside either in the ESF or the EFF. If the derivative is less than 9 characters long, it resides in the ESF. See Figure 4, observations thru 5. If the derivative is 30 characters or longer, the derivative resides in the EFF. %TAYLOR knows to look in the EFF by scanning the STRING variable in the ESF for the word CODE. For example, suppose you have the following non-linear function: After processing ratios (if any), next %TAYLOR looks for other non-linear, non-ratio functions. The following code sets the macro variable OTHER to if there are nonlinear, non-ratio functions of random variables. As described earlier, the ESF contains the ESF variables VAL, VAL, VAL3, VAL4, CHAR, CHAR, STRING, etc and will be populated differently from the ratio records. See Figure and Figure 4. f = The derivatives are: f =, 3 4. f =, /* look for non-linear functions excluding ratios */ %let other=0; data _null_; set &parms; if(upcase(table)= upcase("&&table&tabno") and upcase(char)='d' and trim(scan(string,)) ne "RATIO") then %let other=; %if &other gt 0 %then %do; f 3 = 4, f 4 = Figure 4 contains an example of an ESF with the above function and the derivatives. The derivatives are stored in OBJ_TYPE = DERIVE records of the ESF that have CHAR = D. Derivatives are defined in terms of variables specified on OBJ_TYPE = TOTALS records or on OBJ_TYPE = DERIVED records that have CHAR = E. Note that the same function name of the non-linear function (stored in the VAL column; in this case it is just F) is used for multiple derivatives. In general, if a function is made-up of two random variables, then the same function name is used for those two derivatives. If a function is made-up of three random variables, then the same function name is used for those 4.

5 three derivatives, and so on. The flag D in the CHAR field is used to distinguish this type of ESF record as a specification for a derivative. The VAL field contains the variable that the function was differentiated with respect to. The %TAYLOR macro does not assume any particular ordering of the derivatives in the ESF. /* Read in the derivatives from the ESF and EFF files. */ data formulas; set &parms (where = (upcase(table)= upcase("&&table&tabno") and upcase(char)='d')); keep val val char3 string; i+; call symput("form" trim(left(put(i,5.0))), upcase(trim(left(string)))); call symput("va" trim(left(put(i,5.0))), val); call symput("ra" trim(left(put(i,0.0))), trim(left(val)) ); call symput("ra" trim(left(put(i,0.0))), trim(left(val)) trim(left(put(i,5.)))); /* the following variable is created to solve the 8 character limitation of variable names. */ call symput("oth" trim(left(put(i,5.0))), "o" trim(left(put(i,5.)))); /* store the last value of the index i */ call symput("evar", trim(left(put(i,0.)))); If the derivatives are in the EFF (not in the ESF), then the word CODE will appear in the STRING field in the ESF. The following SAS code checks for derivatives in the EFF. The matching keys to the EFF in the code below are: ) The table number: field name TABLE. ) The function name: field name VAL. 3) The variable name of the wrt derivative: field name VAL. The following short macro %GETMORE was written to over come SAS s objection to having %DO loops in open code. /* get derivatives from the EFF file if any */ %macro getmore; %do i = %to &evar; %if %bquote(%trim(%left(&&form&i))) = CODE %then %do; data _null_; set &moreform (where=(val="&&ra&i" and val="&&va&i" and upcase(table)= %upcase("&&table&tabno"))); call symput("form" trim(left(put(&i,5.0))), upcase(scan(code_,, '='))); %end; /* of if-then-do condition */ %end; /* of i do loop */ %mend getmore; %getmore; The %GETMORE macro replaces the word CODE (read from the ESF) with the appropriate derivative. Evaluating the Derivatives at Given Point Estimates: After reading the derivatives, to simplify the subsequent BY statements, we decided to create one giant BY variable which concatenates each BY variable at every BY level. For example, suppose you have two BY variables BY = NAICS, and BY = STATE as shown in Table : Table Two BY Variables with Various Levels BY = NAICS BY = STATE VA MD PA VA MD PA

6 600 VA 600 MD 600 PA Instead of typing the following %DO loop for each data step: data dsn; set dsn; by %do i = %to &count; &&by&i %end; ; Create the MYBY variable in a data step as follow: data table; set dsn; keep %do i = %to &count; &&by&i %end; myby; myby= trim(left(put(by, 0.0))) %do i= %to &count; trim(left(put(by&i, 0.0))); %end; which would yield Table : As you will see, marco %SPLIT, which will be described next, will add another column to the MYBY look-up table because SAS v6. had a limitation of 8 characters for a data set name. Even with the expansion of data set name lengths in SAS v8., we still cannot guarantee that all the lengths of the BY, BY, etc variables will be less than 3 characters long when concatenated together. Next, the macro called SPLIT divides the estimates in the ERF into smaller data sets --- one for each value of the MYBY variable since each MYBY level must have its own vector of evaluated derivatives and its own variancecovariance matrix. Ideally, we would have liked to have named these smaller data sets the same as the MYBY level (e.g. B35000VA; see Destiny, 998). But we could not because MYBY can be longer than 8 characters (SAS v6. limitation). Well, once the estimates have been divided-up, the functions read from a previous data step are evaluated. The SPLIT macro has input parameters: ) the data set name containing the point estimates (the ERF) ) the BY variable names (encoded as MYBY) The data set resolved by &DSN was created when reading the point estimates. Table MYBY Look-up Table BY = NAICS BY = STATE MYBY VA 35000VA MD 35000MD PA 35000PA VA 36000VA MD 36000MD PA 36000PA 600 VA 600VA 600 MD 600MD 600 PA 600PA Some Notes: The PUT statement will work on both numeric variables and character variables. The MYBY table will be needed at the very end of the %TAYLOR macro to put the results into the ERF (which has a standard format that includes BY, BY, etc). We must map the concatenated BY values back to the appropriate detached BY variables. /* Divide the original data set into smaller data sets --- one for each "by" value. The macro has been modified from Macros in SAS Software to accommodate evaluating the derivatives at point estimates. Reference page 5 of Macros in SAS software */ %macro split(inputds, byvar); %global numobs; data _null_; set by_table end=eof; if eof then call symput('numobs', put(_n_, 5.)); data %do i = %to &numobs; b&i %end; ; /* of i do loop */ set &inputds; %let else=;

7 %do i = %to &numobs; &else if &byvar = "&&b&i" then output b&i; %let else = else; %end; /* of i do loop */ /* Create macro variables for the estimates. Use two indicies on the macro variable name: ) identify the data set, ) identify the variable */ %do j = %to &numobs; data b&j; set b&j end=eof; if _n_ = then i=0; i+; call symput("var" trim(left(put(&j,5.0))) trim(left(put(i,5.0)), item ); call symput("val" trim(left(put(&j,5.0))) trim(left(put(i,5.0))), trim(put(nvalue, 4.5))); if eof then call symput("nvar" trim(left(put(&j,5.0))), put(_n_,5.0)); %end; /* of j do loop */ The point estimates and the derivatives for each MYBY level are put into the same data step using macro variables. The point estimates are assigned first for each variable in the derivative (e.g = ). Next, the derivatives are evaluated (e.g. VA = /00000). The derivatives were arbitrarily named VA, VA, VA3, etc (the order that the differentiation variables were read-in is most important). This is done for each B, B, B3, data set created in the previous code. Finally, the data sets are transposed for easy PROC IML manipulation. /* put the point estimates in a data step with the derivatives for execution. */ %do j = %to &numobs; data b&j; set b&j; /* initialize the variables */ %do k = %to &&nvar&j; %end; &&var&j&k = &&val&j&k; /* eqtn is a macro to place the derivatives in the data step */ %eqtn; proc transpose data=b&j out=b&j; %end; /* of j do loop */ %mend split; %split(&dsn, myby); Each data set created by macro SPLIT (e.g. B, B, B3, ) contains the evaluated derivatives for each function and the variable that was differentiated withrespect-to. These data sets will be read into PROC IML to form the d vectors for calculating variances. Whatever survey specific variable the derivative of the function was taken with-respect-to must appear in the covariance matrix in the same order. The d vector (in PROC IML) contains the derivatives evaluated with point estimates for each MYBY level for all of the functions. Thus, the d vector gets separated into smaller vectors for each function in PROC IML. Reading the variances and covariances from a StEPS ERF is very similar to reading the point estimates. What Happens to the MYBY Table? The MYBY table gets updated as follow: Table 3 Updated MYBY Look-up Table BY = NAICS BY = STATE MYBY Data Set Name VA 35000VA B MD 35000MD B PA 35000PA B VA 36000VA B MD 36000MD B PA 36000PA B6 600 VA 600VA B7 600 MD 600MD B8 600 PA 600PA B9 Filling the Data Set Name column is very easy. So far in %TAYLOR, no sorting has been done on the ERF or the ESF. Those two files are in the exact same order as the last user left them. The only sorting in %TAYLOR occurs at the very end to update the ERF with the standard errors and CVs via a matched merge.

8 PROC IML: Building the Covariance Matrix Building the covariance matrix involves ensuring that the ordering in the d vector corresponds with the ordering in the covariance matrix S of survey specific variables. The data sets created in the SPLIT macro are brought into PROC IML with the USE command. A macro DO loop executes PROC IML on each of the data sets created in the SPLIT macro. There are &numobs data sets to be processed, one for each level of the BY variable. The data set names are accessed by the macro variable B&i. Each ERF contains the same column names. Generally speaking, however, the column names are not in the same order from one ERF to the next. Thus, the column names must be read into PROC IML. The indices of the ITEM, ITEM, and TYPE column names must be stored in scalar variables. Those scalar variables will be used as indicies when building the covariance matrix. The CC vector contains the indices to the point estimates to be used in calculating the standard error. By creating the CC vector, the user can key in the derivative expressions in any order under StEPS. The vector R contains the names of the variables differentiated withrespect-to (e.g. the VAL StEPS variable). The vector Z contains the standard errors or covariances (actual numbers in this one) of the variables that were differentiated with-respect-to. /* Fill the main diagonal of the covariance matrix */ m = nrow(w); /* the matrix W contains non-numeric ERF data */ S = J(&evar, &evar, 0); /* this will be the covariance matrix with actual numbers in it. */ B = J(&evar, &evar," "); /* The matrix B is used to store variable names used on the diagonal of S. From that, the off-diagonal elements of B are thus set. Without these names, it s impossible to fill-in the rest of the S matrix. */ do p = to m; /* loop thru the ERF column names for matches */ do q = to k-; /* k is k-plus- number of diagonal elements */ if ((trim(w[p, item]) = trim(r[cc[q]])) & (trim(w[p, type]) = "STDERR")& (trim(ratio) = trim(y[cc[q]]))) then do; S[q, q] = z[p]**; /* put the variance on the diagonal */ B[q,q] = r[cc[q]]; /* store diagonal variable names */ do a = to q-; B[q,a] = r[cc[a]] ; /* store offdiagonal variable names */ end; /* of a loop */ end; /* of if-then-do statement */ end; /* of q loop */ end; /* of p loop */ The technique is very simple to do by hand. Let s say you took derivatives of the function f wrt the variables,, 3, and 4, and you wish to form a covariance matrix from a column of standard errors and covariances. The dimensions of the covariance matrix is 4 4. So when the DO p and DO q nested loops execute, we have: B = 3 4 When the DO a loop executes, we have: B = Thus, the covariance matrix S should look something like: S = VAR() COV(, ) VAR() COV(, 3) COV(, 3) VAR(3) COV(, 4) COV(, 4) COV(3, 4) VAR(4) The matrix S contains the variances and covariances of the variables involved in the derivatives. Note that the IF-THEN conditions use the scalar variables that identify the columns for the ITEM (e.g. variable names of estimates), ITEM (e.g. more variable names of estimates when appropriate), and TYPE(type of estimate. e.g. standard error, covariance, etc.) non-numeric data in the ERF. The conditions looks slightly complicated because for example COV(, ) = COV(, ), and no particular ordering is assumed. do k = to q-; /* loop thru the rows

9 of B */ do p = to k; /* loop thru the columns of B */ do a = to m; /* loop thru the rows of W. W contains nonnumeric ERF data */ if ((trim(b[k, p]) = trim(w[a,item])) & (trim(b[k, k]) = trim(w[a, item]))) ((trim(b[k, k]) = trim(w[a, item])) & (trim(b[k, p]) = trim(w[a, item])))& (trim(w[a, type]) = "COV") then do; /* fill-in the lower portion covariances */ S[p, k] = z[a]; /* fill-in the upper portion covariances */ S[k, p] = z[a]; end; /* of the if-then-do statement */ end; /* of a do loop */ end; /* of p do loop */ end; /* of k do loop */ The Final Calculation: With the d vector and the S matrix in hand, the final calculation in PROC IML becomes: value = sqrt(d`*s*d); PROC IML recognizes the asterisk (*) as matrix multiplication and the back-single-quote (`) as transposition. VALUE is stored in a vector with the other standard errors and outputted to a SAS data set after all the standard errors have been calculated. Forming the covariance matrix S is done for every level of the MYBY variable for every non-linear function (excluding ratios). After calculating the standard errors in PROC IML, the results are merged back into an ERF. Also, %TAYLOR looks for the associated derived estimates in the ERF of each non-linear function. If present, %TAYLOR calculates the coefficients of variation (CV s) and merges them into the ERF. Concluding Remarks: The Manufacturing Energy Consumption Survey (MECS) and the Plant Capacity Utilitization (PCU) Survey were the first two surveys to use %TAYLOR. MECS used %TAYLOR to calculate standard errors and CV s of ratios. The ratio specifications were entered into StEPS by a survey statistician with very little mathematical/statistical experience. As outlined in the Introduction, no derivatives were required. PCU used %TAYLOR to calculate standard errors and CV s of year-to-year change in utilization rates. Their input functions were differences of ratios reflecting year to year change. Each of those functions required 4 derivatives and 6 covariances prior to processing (in addition to point estimates and variances of the survey variables involved). There were a total of 48 levels of MYBY and consequently 48 small B, B, etc. data sets created in %SPLIT. References: Ahmed, Shirin A. and Tasky, D. L. (000), An Overview of the Standard Economic Processing System (StEPS), The Second International Conference on Establishment Surveys,, Destiny Corporation (998), Macros in SAS Software, 3 Silas Deane Highway #A, Wethersfield, CT, Sarndal, Carl-Erik, Swensson, Bengt, Wretman, Jan (99), Model Assisted Survey Sampling, New York: Springer-Verlag. Sigman, Richard (000), Estimation and Variance Estimation in a Standard Economic Processing System, The Second International Conference on Establishment Surveys,, Tasky, Deborah, Linonis, A., Ankers, S., Hallam, D., Atmayer, L., and Chew, D. (999), Get in Step with StEPS: Standard Economic Processing System, Proceedings of the North East SAS Users Group, pp Wolter, Kirk M. (985), Introduction to Variance Estimation, New York: Springer-Verlag.

10 Contact Information: Roger L. Goodwin U.S. Bureau of the Census 4700 Silver Hill Road BLDG Suitland, MD Katherine J. Thompson U.S. Bureau of the Census 4700 Silver Hill Road BLDG Suitland, MD

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys Richard L. Downs, Jr. and Pura A. Peréz U.S. Bureau of the Census, Washington, D.C. ABSTRACT This paper explains