PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University

Size: px

Start display at page:

Download "PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University"

Cynthia Simpson
5 years ago
Views:

1 ABSTRACT PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University Daniel M. Chl1ko, West Virginia UniverSity Gerry Hobos, West Virginia University PROC JACKREG allows the user to calculate robust estimates of regression coefficients in simple or multiple regression problems by use of the Jackknife method. See Mosteller & Tukey for a good description of the Jackknife. PROC JACKREG may also be used as a validation tool. The Least Squares Regression equation is validated by comparing each observation to the equation generated when that observation is left out. Output includes the individual regression coefficients at each step of the leave-one-out process with their corresponding R-square values. A further table of the final Jackknife coefficients and the ordinary least squares coefficients. with their estimated standard deviations, t-stati8tic8 and P-values are a180 displayed. the sample size. The Jackknife estuastes are the average of the pseudo-values. SPECIFICATIONS The following statements are used with the JACKREG procedure: I I PROC JACKREG options; I MODEL dependent = regressors / options; I OUTPUT OUT: SASdataset keyword~name ; I BY variables; The procedure must always be accompanied by a MODEL statement to specify the regression model. An OUTPUT statement may be specified anywhere after the PROC statement to request an output data set. The purpose of each statement is: Output data sets may be created optionally upon request. One of these data sets may contain the individual Regression coefficients at each step of the leave-one-out proeess (OUTEST=). The other data set may contain the Predicted/Residuals for the ordinary least squares model. Predicted/Residuals for the Jackknife estimates and Predicted/Residuals for the leave-one-out process. INTRODUCTION Suppose that a response variable Y can be predicted by a linear combination of some regressor variables. say Xl and X2. Then you can estimate the parameters in the equation: Y : BO + Bi x Xl + B2 x X 2 + error To fit this model with JACKREG specify: PROC JACKREG; MODEL Y = Xl X2; procedure JACKREG uses the principle of least squares to produce estimates that are the ~Best~ linear unbiased estimates under the classical statistical assumptions. JACKREG calculates also the Jackknife estimates for the model. The Jackknife estimates are defined as follows: Let b be an estimate for a parameter B based on a sample of size N. Now delete one of the observations, say the j-th, re-calculate the Least-squares estimate, and denote the new estimate b-j. Delete, in turn, each of the observations so that there are N estimates each based on N-I observations. Define the pseudo~ values as: The product of the b-j times N-l subtracted from the product of the ordinary least squares estimates. times N.where N is The MODEL statement specifies the dependent and independent variables in the regression model and the options to be used. The OUTPUT statement requests an output data set and names the variables to contain predicted values. residuals and other output values. The BY statement specifies variables to defiue subgroups for the analysis. PROC JACKREG Statement PROC JACKREG options; The following options may be specified on the PROC JACKREG statement: DATA = SASdataset This parameter specifies the input data set the procedure will operate on. If this parameter is omitted the procedure will, by default. use the most recently created SAS data set. OUTEST : SASdataset Requests that the parameter estimates, of the Leave-one-out regression coefficients, be output to this data set. If you want to create a permanent SAS data set, you must specify a twolevel name (see Chapter 12, ISAS Data Sets', in SAS Userls Guide: Basics, 1982, for more information on permanent SAS data sets). NOPRINT Suppresses the normal printed output. 864

2 Using this option. you can create SAS data sets to be used for further analysis in subsequent procedures, without generating any printed output. Using this option is equivalent to specifying NOPRINT on the MODEL statement. MODEL Statement MODEL dependent = regressors I options; After the keyword MODEL, the dependent (response) variable is specified, followed by an equal sign and the regressor (independent) variab les Variables specified in the MODEL statement must be numeric variables in the data set to be analyzed. The following options may appear in the MODEL statement. The slash is used to separate the independent variables from the options. If no options are specified the slash is not needed. GROUP- n Specifies the size of the group of observations to be left out when calculating the pseudo-values. n may be any integer number between 1 and the number of observations in the input data set. If this option is omitted it defaults to one. INVERSE Requests that the inverse of the crossproduct matrix INV(X'X) be printed out. If this option is omitted the inverse is not printed. NOPRINT suppresses the normal output of the procedure. Using this option is the same as specifying the HOPRINT option on the PROC statement. USSCP prints the uncorrected sums of squares and crossproducts matrix X~X. NOINT requp.sts that the intercept parameter not be included in the model. By default the intercept is included. OUTPUT Statement OUTPUT OUT -' SASdataset PREDICTED RESIDUALS JPREDICTED JRESIDUALS VPREDICTED VRESIDUALS variable name = variable_name = variable_name '" variable_name '" variable_name '" variable_name The variable names used must conform to the SAS naming convensions. The OUTPUT statement specifies that JACKREG create a new SAS data set. Predicted and residual values as well as all the variables in the original data set are included in the new data set. If you want to create a permanent SAS data set, you must specify a two-level name (see chapter 12, 'SAS Data Sets' in SAS User's Guide; Basics, 1982, for more information on permanent SAS data sets). If only the keyword OUTPUT is specified without any options, a copy of the input data set is created. The options below may be used OUTPUT statement. in the OUT z gives the name of the new data set. If OUT = is omit ted, the new data set is named using SAS's DATAn convention. PREDICTED =gives the name of the variable whose values are the predicted values of the dependent variable, using the ordinary least-squares technique. RESIDUALS =gives the name of the variable whose values are the residuals, calculated as ACTUAL minus PREDICTED, using the ordinary least-squares technique. JPREDICTED=gives the name of the variable whose values are the predicted values of the dependent variable, using the JACKKNIFE estimates. JRESIDUALS=gives the name of the variable whose values are the residuals, calculated as ACTUAL minus JPREDICTED, using the JACKKNIFE estimates. VPREDICTED=gives the name of the variable whose values are the predicted values of the dependent variable, using the leave-one-out estimates, substituting for the values of independent variables the ones left out. This may be used for validating the model. VRESIDUALS=gives the name of the variable whose values are the residuals calculated as the ACTUAL minus the VPREDICTED. This may be used for validating the model. The above options may be specified in any order and any combination. Any unique abbreviation of the keywords is valid. 865

3 BY St:atement: BY variables; A BY statement: may be used with PROC JACKREG to obtain separate analyses on observat:ions in groups defined by t:he BY variables. When a BY st:at:ement: appears, t:he procedure expect:s the input data set to be sorted in order of t:he BY variables. If your input dat:a set is not: sorted in ascending order, use t:he SORT procedure wit:h a similar BY st:at:ement: to sort the data, or. if appropriate. use the BY statement options NOTSORTED or DESCENDING. For more information, see the discussion of t:he BY st:atemellt in Chapter 8, 'Statements Used in the PRoe Step' in SAS User's Guide: Basics, DETAILS Missing Values If an observation has a missing value for any of the variables in the analysis. that observation is omitted from the analysis. Printed Output The JACKREG procedure produces the following printed output by default : Al. ESTIMATES, the corresponding estimated regression coefficients using the ordinary least squares technique. A2. The names of t:he independent variables included in the model. variable can be accounted for by the model using the ordinary least squares technique. R-SQUARE, which can range from 0 to I, is the ratio of the sum of t:he Squares for the model divided by the tot:al sum of Squares. In general. the larger the value of R-SQUARE, the better the model's fit. A7. The value given in the table for FROB>!T! answers the question 'If the parameter is really equal zero, what is the probability of getting a larger value of T1' Thus, a very small value for this probability indicates that the parameter is not likely to equal zero, and therefore that the independent variable contributes significantly to the model. HI. H2. B3. B4. ESTIMATES, the corresponding Jackknife regression coefficients, calculated as the average of the difference of the weighted least squares coefficients and the psuedo values (coefficients of the regression equation after a group of observat:ions is removed from the analysis). The names of the independent variables included in the equation. STD_ERROR, the standard error of the Jackknife regression coefficients. T_RATIO, which is t:he rat:io of the Jackknife regression coefficients to the standard error of the estimates., f, I ~ i! I A3. STD_ERROR, the standard error of the estimat:ed parameters. A4. T_RATIO, which is the ratio of the estimated regression coefficients to the standard error of the estimates. AS. Th. source of variat:ion OLS SSE, which i. the residual variation that is not accounted for by the model. A6. R-SQUARE, measures how much variat:ion io the dependent 866 B5. JACK SSE. the sum of the square of the residuals when the Jackknife equation is fitted to the data set. B6. R-SQUARE, measures how much variation in the dependent variable can be accounted for by the model using the Jackknife technique. R SQUARE, which can take any values is the ratio of the Sum of the squares for the Jackknife model divided by the corrected total sum of

4 negative). What we learn from this information is that the data set and the fitted model need to be scrutinized further. This suggestion is gived further credance in the second part of the output. There it can be seed that some coefficients change rather markedly when certain observations are left out. For example several of the coefficients change 1n a significant way when observation 2 is left out. It should also be noted that the leave~ne-out R-SQUARE value for observation 2 is much larger than the overall R-SQUARE of This means that the model fits much better if observation 2 is left out and that perhaps we should pay some special attetion to that observation. Note also that if either observations 1 or 24 are left out the value of R-SQUARE is decreased substantially. This suggests that perhaps that those observations are remote in independent variable space. As such they may be 'high leverage~ points. That is, they are situated in such a way as to have a high potential for influencing one or more parameter estimates. In the third part of the output predicted values and residuals are given. By comparing the columns headed ~R_RAIN~ (regular residuals) and 'J_RESID' (jackknife residuals) we can see which observations are better fitted by OLS than the jackknife method and vice-versa. In particular we note that the jackknife method 'misses' the second data point by quite a long way (residual is 9.57). It is sometimes claimed that jackknife methods fit the bulk of the data well but pay scant attention to outliers while OLS methods do the opposite. The 'V RESID' and'v PRED' are particularly useful -as validation-tools. The predicted value corresponds to what we get when each observation is left out of the estimation process and then the values of the independent variables are plugged in to the resulting equation. A couple of things are ~ediately clear. First, observation 2 is most unusual. Second, these residuals are, on average, larger than the ordinary residuals. It is the case that we should expect data points used in the estimation process to fit the derived models better than data not used in that process. Since the validation residuals are formed by taking the difference between a predicted value and an observation which is independent of the predictor set we have a more honest assessment of how well we might expect 'future' observations to fit the given model. Other authors have ~hown the need for transformations on this data set. We, by no means, suggest that either of the models presented here is particularly useful. We simply use it to exhibit the output from PROC JACKREG. In our experience jackknife regressions provide useful diagnostic information (particularly data criticism) which can be incorporated into the overall model Example squares. B7. The value given in the table for PROB>!T! answers the question 'If the parameter is really equal to zero, what is the probability of getting a larger value of T?- Thus, a very small value for this probability indicates that the parameter is not likely to equal to zero. and therefore that the independent variable contributes significantly to the model. 8. The estimated regression coefficients after leaving out a group of observations. 9. R-SQUARE, calculated as the ratio of the sum of the squares for the model divided by the total sum of squares, after a group of observations is removed. The following example has been used many times. See Cook and Weisberg (1982) for a list of the raw data. The observations were collected in 1975 and have to do with weather modification, specifically the use of silver iodide to increase rainfall. Twenty-four days in the summer of 1975 were judged suitable for cloud seeding. On twelve of those days the clouds were seeded and on twelve they were not. The response, Y, is the measure amount of rain that fell in the targeted area in a 6 hour period on a given day. The independent variables are, C % cloud cover in the experimental area P total rainfall 1 hour prior to seeding E = echo motion (8 binary variable having to do with the radar patterns) A = action (seeding; 1, no seeding: 0) Further a suitability measure, S-Ne, is used as an independent variable. Suitable days for seeding are defined as those days on which the suitability measure is greater than or equal to 1.5. In the sample output the standard information concerning OLS estimates is given along with similar information for the jackknife estimates. It may readily be seen that the estimates of the regression coefficients are quite a bit different when we use the jackknife method than they are using OLS. Also the 'R-SQUARE' value associated with the jackknife model is much smaller than the corresponding OLS measure {in fact it is 867

5 building process. be considered as produces 'correct' No regression scheme should a magical black box which results. Jackknifing is a calculation intensive process and for intermediate to large data sets may be impractical even with good hardware. PROC JACKREG contains an option whereby groups of observations can be left out at each step. The procedure leaves out blocks of consecutive observations so the user can arrange the data in such a way as to determine the groups which are left out as part of the jackknife strategy. In most instances the user would rather like that the groups be determined at random. One scheme for accomplishing this task involves using a random number generator and PRoe SORT. Briefly, what needs to be done is to associate a random number with each observation and then sort the dataset by the values of the random number. For a dataset named OLD we can define a new variable called NEWVAR and proceed as follows. DATA NEW; NEWVAR "" PROC SORT; SET OLD; UNIFORM(O); BY NEWVAR; The above statements put the dataset into an essetially 'random order' so that the GROUPS option utilizes random selected groups. ORDINARY LEAST SQUARE DEP VARIABLE : Y ~ I3J VARIABLE ESTIMATES STD ERROR T RATIO PROB>!T! XO ~ A S Q C P E OLS SSE: (jf) R-SQUARE: ~ JACKKNIFE REGRESSION COEFFICENTS DEP VARIABLE : Y i) 31 ".'l '1 VARIABLE ESTIMATES-STD ERROR T_RATIO PROB>!T! XO A S C P )) E JACK SSE: R-SQUARE: ) LEAVE-ONE-OUT REGRESSION COEFFICIENTS LEAVE_OUT XO A S GROUP GROUP GROUP GROUP CROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP i t t l,, I ~ 868

6 LEAVE-ONE-oUT REGRESSION COEFFICIENTS LEAVE-ONE-OUT REGRESSION COEFFICIENTS LEAVE_our c p E LEAVE_OUT R-SQUARE GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP CROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP GROUP RESIDUALS AND PREDICTED VALUES P_RAIN R_RAIN J_PRED J_RESID V_PRED V_RESID J B B BB B _

Two-Stage Least Squares

Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes