Session 8. Statistical analysis Using Gauss Applications

Size: px

Start display at page:

Download "Session 8. Statistical analysis Using Gauss Applications"

Jennifer Ward
6 years ago
Views:

1 Session 8 Statistical analysis Using Gauss Applications page 1. Descriptive Statistics 8-2 Example: Frequencies 8-2 Example: Histogram Linear Regression 8-3 Linear regression Options 8-3 Practical Session 8a Quantal Response 8-6 Multinomial logit Multinomial logit Options for Ordered (see src directory) 8-7 Example for ordered logit 8-12 Example for ordered probit 8-12 Example for probit 8-13 Example for poisson 8-13 Example for Qtest 8-14 Practical Session 8b

2 Statistical analysis Using Gauss Applications 1. Descriptive Statistics The Descriptive Statistics module is a set of procedures which generate basic sample statistics of the variables in given GAUSS data set. These statistics describe the numerical characteristics of the random variables, and provide information for further statistical analysis. In GAUSS you can write functions or procedures which can be accessed by your programs as if they were GUSS functions. To maintain control over the name GAUSS uses the library keyword. The gauss and user libraries are always active, while all other libraries must be activated with the library command as: Library dstat; This statement opens the dstat library, which contains functions for the description of data. Note that the library statements are not cumulative. I.e. a subsequent library statement inactivates previously activated libraries (except for Gauss and user) Example: Frequencies library dstat; dstatset; title="fr1.e: without weights"; dataset = "d:/gauss/examples/freq"; output file = fr1.out reset; { cats,ncats,freqs } = freq(dataset,1 2 3); Example: Histogram library dstat,pgraph; dstatset; graphset; title="fr4.e: Using GETFREQ and HISTF"; dataset = "d:/gauss/examples/freq"; output file = fr4.out reset; miss = 1; { cats,ncats,freqs } = freq(dataset,1 2 3); { f,c } = getfreq(1,cats,ncats,freqs); If miss were 0 and missing was a category with cases, you must: c = packr(c~f); 2

3 f = c[.,2]; c = c[.,1]; before running histf(f,c); histf(f,c); 2. Linear Regression library lr; dataset = "d:/gauss/examples/scigau"; datalist ^dataset; dep = { pub3 }; indep = { const,pub1,cit1,cit3 }; call lreg(dataset,dep,indep,0); indep = { pub1,cit1,cit3 }; end; Linear regression Options LREG Purpose: To compute ordinary least squares coefficients. Format: Q = LREG(dataset,depvar,indvars,Restrict) Input: dataset -- string, name of GAUSS data set. depvar -- character, name of the dependent variable. Example: depvar = { consume }; indvars -- character vector of all independent variable names. If constant term is needed, specifies "CONST" in the indvars list. Example: rhs = { const,p,plag,income }; Restrict -- string, constrainted information on parameters to perform restricted estimation. The syntax of Restrict is as follows: Restrict="rest1, rest2,..., restn"; More than one restriction is allowed provided each is separated by commas. Each restriction must be written as a linear equation with all variables in the left-hand side and the constant in the right-hand side (i.e., x1+x2=1). 3

4 Variables shown in each restriction must be variables in the right-hand side of the model. Restrictions in the RESTRICT argument must be consistent and not redundant otherwise error messages will be given. Users should note that only the parameters associated with the variables are restricted, and not the variables in the model themselves. Examples of some VALID restrict arguments: 1) Restrict="x3+x4+x5=1"; 2) Restrict="constant=0, x4=0, x5=0, x6=0"; Examples of some INVALID restrict arguments: 1) Restrict x2-x3=0,x3-x4=0,x2-x4=0; 2) Restrict x2-x3=0,x2-x4=0,x3-x4=1; Both invalid examples show redundancy in example 1 and inconsistency in example 2. Output: Q -- a "COMPACT" output vector containing all calculated statistics. See manual for more details on extracting information from it. Varaibles contained in Q are: nms -- name of the regressors. b -- regression coefficients. vc -- varaiance-covariance matrix of b. se -- standard error of b. s2 -- variance of the error. cx -- correlation matrix of b. rsq -- coefficient of determination. rbsq -- adjusted R-squared. dw -- Durbin-Watson statistic. nobs -- number of observations. xtx -- cross-moment matrix of X. sse -- residual sum of square. Globals: _lregcol -- scalar. If 1, perform collinearity diagnostics. See manual for more details. Default = 0. _lreghc -- scalar. If 1, the heteroskedastic-consistent covariance matrix estimator will be calculated. Default = 0. _lregres -- string, a file name to request influence diagnostics. Statistics generated from the diagnostics are saved under this file name. Besides the diagnostic statistics, the predicted values, dependent variable and independent 4

5 variables are also saved. They are saved in the following order: COL. NAME DESCRIPTIONS 1 RES Residuals = (observed-predicted) 2 HAT Hat Matrix Values 3 SRES Standardized Residuals 4 RSTUDENT Studentized Residuals 5 COOK Cook Influence Statistics 6 YHAT Predicted Values 7 <depname> Dependent Variable 8 + <indname> Independent Variables _lrpcor -- scalar. If 1, print the correlation matrix of coefficients. Default = 0. _lrpcov -- scalar. If 1, print the covariance matrix of coefficients. Default = 0. range -- a 2 x 1 vector. Specifies the range of the data set to be used in estimation. The first element specifies the beginning observation while the second element specifies the ending observation. Example: range = { 100,200 }. Default is { 0,0 } and uses the whole data set. output -- scalar. If nonzero, results are printed. Default = 2. weight -- string, name of the weight variable. By default, unweighted least sqaures will be calcuated. Users should realize the weights are assumed to be inversely proportional to the error variances and are greater than zero. Details are mentioned in the manual. title -- string, message printed at the top of the results. Default =""; Practical Session 8a 1. Repeat the above analysis and include all other variables in the model, with and without the constant. Remove the non-significant ones. By changing the options impose a restriction on the covariates, perform collinearity diagnostics, calculate the heteroskedastic-consistent covariance matrix estimator, print the covariance matrix of coefficients. 5

6 2. The data in the gauss data set solve3.dat is the time taken to solve four block design problems by 24 fifth-grade children, with EFT value for the embedded figures test, measure of difficulty in abstracting logical structure of a problem from its context, and COR_GRP is (group) classification by type of problems presented first, i.e. those solved by row (group 1) or formation strategy. The solv3.prg program fit a linear model to this data and test for Heteroskedasticity. Modify this program and fit a linear model with interaction only. 3. Quantal Response The quantal response models are special regression models in which the dependent variable is qualitative in some way. The common structure of these models is to relate the conditional probability of each response to some exogenous variables. However, using simple regression procedures such as OLS to estimate quantal response model is inappropriate, because of heteroskedasticity and other statiostical problems. The quantal response module is a statistical package which provides a set of procedures for estimating these models. It offers the following procedures for different quantal response model specifications: 1. LOGIT Estimates the multinomial logit model. 2. ORDERED Estimates the ordered logit or ordered probit model. 3. PROBIT Estimates the binomial probit model. 4. PSNREG Estimates the Poisson regression model. 5. QTEST Performs a linear hypothesis testing of logit or probit model. Here are some examples, Multinomial logit 1 LGTALD2.E: Logit analysis of the Aldrich and Nelson (p. 63) data for a dichotomous dependent variable. library quantal; quantset; output file = lgtald2.out reset; _qrcatnm = { GRD=A, GRD=B, GRD=C }; dsn = "d:/gauss/examples/aldnel"; datalist ^dsn; dv = { abc }; iv = { gpa, tuce, psi }; { vnam,b,vc,n,pct,mn,sd,fit,df,tol } = logit(dsn,dv,iv); 6

7 Multinomial logit 2 LGTDEL5.E: Logit analysis of the NORC data using DATALOOP to delete the fifth category. library quantal; quantset; datalist d:/gauss/examples/norc; dataloop d:/gauss/examples/norc tem; delete depvar == 5; endata; datalist tem; output file = lgtdel5.out reset; title="lgtdel5.e: NORC data. Deleting fifth category."; dsn = "tem"; dv = 1; iv = { 2, 3, 4 }; _qrcatnm = { Menial, Blue_Col, Craft, Whte_Col, Prof }; { vnam,b,vc,n,pct,mn,sd,fit,df,tol } = logit(dsn,dv,iv); Options for Ordered (see src directory) ordered.src - Ordered Logit and Probit Analysis (C) Copyright Aptech Systems, Inc. All Rights Reserved. This Software Product is PROPRIETARY SOURCE CODE OF APTECH SYSTEMS, INC. This File Header must accompany all files using any portion, in whole or in part, of this Source Code. In addition, the right to create such files is strictly limited by Section 2.A. of the GAUSS Applications License Agreement accompanying this Software Product. If you wish to distribute any portion of the proprietary Source Code, in whole or in part, you must first obtain written permission from Aptech Systems Purpose: To estimate the ordered probit or logit model using a GAUSS data set. By default the ordered probit model is estimated. The ordered logit model is estimated by setting _QRLOGIT to 1. Format: { vnames,b,vc,ndtran,pct,meanx,sdx,fit,df,tol } = ORDERED(dataset,depvar,indvars); 7

8 Input: dataset -- string, name of data file depvar -- string, name of dependent variable - or - scalar, index of dependent variable. The value of depvar will be truncated before analysis. Thus, 1.4 is treated as category 1. indvars -- Kx1 character vector, names of independent variables. - or - Kx1 numeric vector, indices of independent variables. The program adds one variable for the constant term. Defaults are provided for the following global input variables. They can be ignored unless you need control over the other options provided by this procedure. WARNING: If you change the defaults in a command file, the new values will apply in the next program you run using ORDERED unless you change them back. This can be done by running QUANTSET. altnam -- provides alternative names for the variables. if 0 (default), the original names of the variables are used. if a ((1+NIVAR)x1) character vector, the first name in this vector will be used to label the dependent variable and the remaining NIVAR names will be used to label the independent variables. miss -- global scalar, default 1. if 0, there are no missing values (fastest). if 1, do listwise deletion, drop an observation if there are any missing values among the independent and dependent variables. output -- global scalar, default 1. if 1, sends results to the output device (including the screen). if 0, no information is sent to output. range -- 2*1 vector. The range of record in data set used for analysis. The first element is the starting row index, the second element is the endding row index. Default is the whole dataset. 8

9 row -- global scalar, default 0. if 0, the number of rows to read per iteration of the read loop is calculated by the program. if not 0, the specified number of rows will be read. _dtsel -- global scalar, default 0. if 0, all cases are selected for analysis. if Kx3, cases are selected into samples according to specified conditions. See DTRAN for details. tol -- global scalar controlling the iterations. tol indicates the maximum difference between estimates of the coefficients in two adjacent iterations. _qrcatnm -- NCATx1 character vector of names of outcome categories - or - default scalar 0 in which case names CAT1, CAT2,... are used. _qrfit -- global scalar, default 0. if 1, print detailed goodness of fit measures, including table of observed and predicted outcomes. if 0, only print chi-square, -2*log-likehood and percent correctly predicted. _qriter -- global scalar, default 0. if 0, do not print information on iterations. if 1, send detailed information on iterations to the screen but not to the output device. if 2, send detailed information on iterations to the output device. _qrlogit -- global scalar, default 0; if 1, the ordered logit model is estimated; if 0, the ordered probit model is estimated. _qrpred -- global scalar, default 0. if 0, predicted values will not be written to disk. if not 0, predicted probabilities for each outcome category are written to file ^_qrpred with NCAT+1 variables. The first ncat are PRED1,PRED2,...,PREDNCAT. The last variable is the variable 9

10 defined by the variable depvar. _qrpredn -- string name of dataset for predicted values. The default name is "_qrpred". _qrstart -- global scalar, default 0. if 0, do not use user supplied start values. if not 0, user should provide a (NCAT-1+NIVAR) vector of start values. First, provide start values for the intercepts, then the slopes. _qrstat -- global scalar, default 0. if 0, do not print descriptive statistics. if 1, print descriptive statistics. ORDERED uses the method of scoring for estimation, with squeezes. Squeezes are controlled with these globals: _qrsqtol -- global scalar, default.01 when the proportional change in the likelihood function is smaller than _qrsqtol or the change in the likelihood function is in the wrong direction, take a squeeze. _qrnsqz0 -- global scalar, default 0. if 0, squeezes will not be computed until changes in the likelihood function from one iteration to the next become small. if not 0, the program will take up to that number of squeezes per iteration starting with the first iteration. Since squeezes take time and are less effective when estimates are far from the converged values, it is generally best to leave this as 0. _qrsqz -- global scalar, default 0. if 0, don't take squeezes until the change in the likelihood function is small. if 1, consider taking squeezes from the first iteration. _qrnsqz1 -- global scalar, default 10. when squeezes begin, this is the maximum number of squeezes that will be taken before proceeding to the next iteration. 10

11 _qrmiter -- maximum number of iterations, default = Output: vnames -- a (K+2)x1 character vector containing the names of the variables in the model. The order is: depvar "CONSTANT" indvars. b -- an NPARM=(NCAT-1)*(K+1) vector of parameter estimates in the order: intercepts var1 var2...vark. For each variable the parameters are in the order comparing the first category to NCAT, the second to NCAT,... to NCAT-1 to NCAT. See below for details. If errors are encountered a message will be sent to the error log. Also, b will contain a scalar error code. This code appears as missing unless it is translated with the command scalerr(b). The codes are defined as: 1 data file not found 2 found undefined variables 30 system singular 31 too few nonmissing observations. 71 number of categories of dependent variable is less than 2 72 one of the outcome categories has no cases 73 an independent variable has no variation 74 can't open file for predicted values 75 out of disk space 77 all cases were deleted 78 singular matrix encountered during iterations 79 wrong number of start values specified vc -- NPARMxNPARM variance covariance matrix for the parameters in b. ndtran -- 2x1 vector of observations. Element 1 contains number of cases read from dataset; element 2 contains number of cases left after deletion of missing cases controlled by miss, it is the number of cases used in the analysis. pct -- the percent of cases in each of the outcome categories. Arranged in order lowest to highest. meanx -- the means based on nused cases of the independent variables in the order in indvars. sdx -- the standard deviations based on nused cases of the independent variables in the order in indvars. fit -- 4x1 vector of goodness of fit measures. Element 1 is the likelihood ratio chi-square assessing the overall fit of the model; element 2 is -2 times the log 11

12 likelihood function evaluated at the estimated values; element 3 is -2 times the log likelihood function evaluated with the slopes fixed to zero; element 4 is the percentage of correct predictions from the model. df -- the degrees of freedom associated with lrx2. tol -- the tolerance reached. If convergence was obtained, tol must be less than tol. Remarks: See the manual for details on the model. Library: QUANTAL See Also: LOGIT, PROBIT, DTRAN Example for ordered logit OLPRED.E: Ordinal Logit analysis of the NORC data. Saving predicted values to disk. library quantal; #include quantal.ext; quantset; output file = olpred.out reset; title="olpred.e: Ordinal Logit Analysis of the NORC"; _qrlogit = 1; _qriter = 0; _qrpred = 1; _qrpredn = "olpred"; _qrfit = 1; _qrstat = 1; dataset = "d:/gauss/examples/norc"; depvar = { DEPVAR }; indvars = { EXPER, EDUC, WHITE, FBLUE }; call ordered(dataset,depvar,indvars); Example for ordered probit OPNORC.E: Ordered Probit analysis of the NORC data. library quantal; quantset; output file = opnorc.out reset; title="opnorc.e: Ordered Probit analysis of NORC data on occupation"; dsn = "d:/gauss/examples/norc"; 12

13 row = { 0, 100 read only first 100 cases for dv = 1; iv = { 2, 3, 4 }; _qrcatnm = { "Menial", "Blue_Col", "Craft", "Whte_Col", "Prof" }; _qrlogit = 0; call ordered(dsn,dv,iv); Example for probit PBTNEWT.E: Probit analysis of Aldrich and Nelson Data (pg. 62) Using Newton-Raphson library quantal; #include quantal.ext; output file = pbtnewt.out reset; quantset; title="pbtnewt.e: Aldrich and Nelson Data (pg. 62) using Newton-Raphson"; _qriter = 1; iteration results: 0 no, 1 view, 2 print _qrstat = 1; 0 for no desc stats _qrfit = 1; 1 to print detailed goodness of fit measures _qrcatnm = { B_or_C, A }; _qrpred = 0; save predicted values _pbtnewt = 1; weight = { gpa }; fnm = "d:/gauss/examples/aldnel"; dv = { 5 }; iv = { 2, 3, 4 }; { vnam,b,vc,n,pct,mn,sd,lrx2,df,tol } = PROBIT(fnm,dv,iv); Example for Poisson PSNREG.E example for PSNREG library quantal; #include quantal.ext; quantset; _qrstat=1; let dep = wars; let ind = age party unem; dataset = "d:/gauss/examples/sample"; output file = psnreg.out reset; call psnreg(dataset,dep,ind); 13

14 Example for Qtest QRTEST.E: Test of linear hypothesis of logit model library quantal; #include quantal.ext; quantset; _qrcatnm = { GRD=A, GRD=B, GRD=C }; dsn = "d:/gauss/examples/aldnel"; dv = { abc }; iv = { gpa, tuce, psi }; { vnam,b,vc,n,pct,mn,sd,fit,df,tol } = logit(dsn,dv,iv); output file = qrtest.out reset; test1 = "gpa:2 + tuce:2 = 0"; { wald1 } = qtest(vnam,b, vc,test1); test2 = "gpa:1-2.5tuce:1 = 2, tuce:1 + psi:1 = 0, 3gpa:2 + 2tuce:1 - psi:2 = 2" ; { wald2 } = qtest(vnam,b, vc,test2); Practical Session 8b In the data set bronc.dat, Res. is the indication of having bronchitis (res =1) or not (res = 0). CIG. is the amount of cigarette consumption and poll is the pollution level in that household. Do the following: 1. Categorise CIG to four categories of 0, less or equal than 3, less or equal than 8 and more than Categorise POLL to less or equal than (0,55], (55,57.5], (57.5,60], (60,62.5], (62.5, 65] and more than Fit a logistic model to this data and find the significant covariates. 4. You may need to consult bronc.prg program. The data in file CLAIMS.dat give the number of policyholders PONO of an insurance company who were exposed to risk, and the number CLAIM of car insurance claims made in the third quarter of 1973 by these policyholders arranged as a contingency table, cross-classified by three four-level factors: DIST, the district in which the policyholder lived, CAR, the insurance group into which the car was placed, and AGE, the age of the policyholder. The first 16 observations are in DIST1, the second 16 are in DIST2,. The first four observations are in CAR1, the second four observations in CAR2,. The first observation is in AGE1, the second in AGE2, the third is in AGE3, the fourth is in AGE4 and again the fifth is in AGE1. 14

15 1. Rearrange the data and construct dummy variables for the above categories. 2. Perform a Poisson regression model for number of claims (CLAIM). 3. Use and modify the CLAIM.prg program. 15

Correctly Compute Complex Samples Statistics

SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample