addesc Add a variable description to the key file CCDmanual0.docx

Size: px

Start display at page:

Download "addesc Add a variable description to the key file CCDmanual0.docx"

Delilah Walton
5 years ago
Views:

1 addesc Add a variable description to the key file CCDmanual0.docx The function adds a variable description to the key file. This is useful in cases where a new variable is created, whose description is not yet in the key file. The description is then available for use in dools output. addesc (nvbs,nvbsdes,dsn=null) nvbs name of variable nvbsdes description of nvbs dsn name of data set variable is based upon ( EA, LRB, SCCS, WNAI ) ea lrb sccs wnai The function appends the description to the key file. CSVwrite Write object to *.csv file The function writes an object, with elements capable of being coerced to a dataframe, to a csv file. It is used to write the output from dools to a file that can be read by a spreadsheet. CSVwrite(a1,a2,a3=FALSE) a1 Object to be written typically output from function dools a2 The base name of the *.csv file (do not include the.csv extension) a3 Should the object be appended to the existing file (default=false) No values are returned in the R environment; only changes occur to the specified *.csv file. Set the option a3=true to append the output of object a1 to an existing file with base name a2. The default will simply overwrite any existing csv file with base name a2. Like the write.csv function, except that CSVwrite can append values to an existing csv file, and it can write elements of a list to a csv file. 1

2 domi Produce multiple imputed data sets The function produces multiple imputed data sets from SCCS data, using methods from the mice package. smi<-domi(eavs=null,lrbvs=null,sccsvs=null,wnaivs=null,nimp=10,maxit=7) EAvs character string containing names of variables from EA dataset LRBvs character string containing names of variables from LRB dataset SCCSvs character string containing names of variables from SCCS dataset WNAIvs character string containing names of variables from WNAI dataset nimp the number of imputed data sets to create (default=10) maxit the number of iterations used to estimate imputed data (default=7). The function domi returns a dataframe containing the number of imputed datasets specified by the nimp option. The datasets are stacked one atop the other, and indexed by the variable.imp. This function imputes several new datasets, using covariates for each variable to create a conditional distribution of estimates for each missing value, and then replacing the missing value with a draw from the distribution; as a result, each of the imputed datasets will typically have slightly different values for the estimated cells. The key to successful imputation is to have good covariates for each variable. The function domi begins the search for good covariates by grouping each variable in a cluster of collinear variables. For each cluster, the best covariates are selected from a set of variables with no missing values, including both network lag variables (based on geographic distance, language, and ecology) and climate and ecology variables. The first four arguments are lists of variable names, from the four ethnographic data sets (EA, LRB, SCCS, and WNAI). These will be the data used in model building. One should include all data one thinks might be useful, but no additional data, since additional variables will add to the time it takes for the procedure to run. The fifth argument is the number of imputed datasets to create: between 5 and 10 imputed data sets are considered adequate, but there is no harm in choosing more; the default is 10. The final argument is the number of iterations to perform in creating each imputed dataset; the default is 7. It is not usually necessary to examine the returned dataframe it is used in estimating the model, but is not in itself that interesting. Nevertheless, some output is automatically written to the console as it executes, in order to provide some information about the clusters to which the variables have been assigned, and the covariates selected for each cluster. For each cluster, the names of the members are printed, along with the method used for imputation (in most cases pmm predictive mean matching; variables without missing values are indicated by empty quotes). Prefixes l, e, and d indicate spatial lags for, respectively, linguistic, ecological, and geographic proximity. Additionally, those variables that could not be imputed, due to perfect multicollinearity, are indicated as each cluster is processed. Squared terms are then created for those variables with at least three unique values, and with maximum values below The squared variables are indicated by the sq suffix on the original variable name (e.g., SCCS.v72sq is the square of SCCS.v72 ). The last step is to identify those variables that are perfectly collinear with a linear combination of other variables users should consider dropping some of these, so that the problem of perfect multicollinearity does not crop up during estimation. Based on the methods proposed by Malcolm M. Dow and E.. 2

3 dools Estimate OLS model on multiply imputed data The function estimates an unrestricted and restricted OLS model, with network lag term, providing common diagnostics. h<-dools(smi, depvar, indpv, rindpv=null, othexog=null, dw=true, lw=true, stepw=false, relimp=false, slmtests=false) smi a multiply imputed dataset, created by the function domi depvar the name of the dependent variable (must be in smi) indpv the names of the independent variables for the unrestricted model (must be in smi) rindpv names of restricted model independent variables (must be in indpv; when default of NULL is executed, the restricted model independent variables will be the same as the unrestricted model, minus the last variable) othexog names of additional exogenous variables (must be in smi; will be added to a list of 21 variables; default is NULL) dw Should geographic proximity be used in constructing composite weight matrix (default=true) lw Should linguistic proximity be used in constructing composite weight matrix (default=true) stepw Should stepwise regression be done to show most-selected variables from unrestricted model (default=false) relimp Should relative importance be calculated for independent variables of restricted model (default=false) slmtests Should spatial lag tests be run for the four weight matrices (default=false) Returns a list with 11 elements: DependVarb Identification of dependent variable URmodel Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs) Rmodel Coefficient estimates from the restricted model RmodelRobust Coefficient estimates from the restricted model with robust SEs Diagnostics Regression diagnostics for the restricted model (RESET test; Wald test on model restrictions; Breusch- Pagan heteroskedasticity test; Shapiro-Wilkes test for normality of residuals; Hausman tests for endogeneity of independent variables). OtherStats Other statistics: Composite weight matrix weights (see details); R 2 for all models (model creating instrument for network lag term; restricted model; unrestricted model); number of imputations; number of observations. DescripStats Descriptive statistics for variables in unrestricted model. dfbetas Influential observations for dfbetas (see details) totry Character string of variables that were most significant in the unrestricted model as well as additional variables that proved significant using the add1 function on the restricted model. didwell Character string of variables that were most significant in the unrestricted model. interacts Character string of interaction variables that proved significant using the add1 function on the restricted model. Users can choose two kinds of proximity/similarity weight matrices for constructing a network lag term: geographic and linguistic. In most cases, users should choose both (the defaults). The optimal composite weight matrix, constructed as the weighted sum of the weight matrices, is that which maximizes unrestricted model R 2. The network lag term is entered in each model as the variable Wy. The dfbetas are scaled changes in coefficient estimates caused by adding an observation to the model. Only the most influential dfbetas are output. The stepwise procedure can provide additional insight on which independent variables provide the best model fit. Since the imputed datasets differ slightly from each other, the variables selected by a stepwise procedure typically differ slightly for each imputed dataset. If the stepw=true option is chosen, a column labeled stepkept will be added to the table reporting 3

4 unrestricted model results. The column reports the number of times the independent variable was retained in the model by a stepwise procedure using both forward and backward selection. The add1 function tests whether the members of a list of variables prove significant when added singly to a model. The list of variables includes all numeric variables in the imputed dataset, as well as squared terms of variables currently in the unrestricted regression. Variables proving significant in over 80 percent of the imputations are returned in the character string totry. Relative importance is a method of assigning R 2 to each independent variable. The method repeatedly estimates a model, first with one independent variable, then with two, etc. and calculates the change in R 2 as each variable is introduced. The order of entry is changed, and the process repeated, to consider all possible orders of entry. The relative importance measure is the average change in R 2 across all these different models. With large numbers of independent variables, the calculations are prohibitively slow. Setting relimp=true will calculate the relative importance of independent variables in the restricted model, and report these in the column labeled relimp. Based on the methods proposed by Malcolm M. Dow and E.. library(mice) library(foreign) library(stringr) library(psych) library(aer) library(relaimpo) library(geosphere) library(spdep) # --bring in functions and data-- load(url(" ls() #-can see the objects contained in DEz2.Rdata #--list and modify variables for use in model-- # --make new variables-- xcd$sccs.valchild<-(xcd$sccs.v473+xcd$sccs.v474+xcd$sccs.v475+xcd$sccs.v476) # --create descriptions for new variables-- addesc("sccs.valchild","degree to which society values children") addesc("wy","network lag term") # --create new dummy variables-- xcd<-cbind(xcd,mkdummy("sccs","v899",1)) # --identify variables to keep for model building-- ev<-c("v30","v78") lv<-c("group2","hunting","gatherin","fishing","huntfil2", "war1","reven","nomov","dismov","store","subdiv2") sv<-c("v1685","v72","v234","v236","v238","v1648","v899d1", "valchild","v1260","v79","v80","v81","v872","v871") wv<-c("v284","v285","v286","v288","v289","v135") # --make imputed data-- smi<-domi(eavs=ev,lrbvs=lv,sccsvs=sv,wnaivs=wv,nimp=5,maxit=5) names(smi) #--can see which variables are available smi$lrb.lngroup2<-log(smi$lrb.group2) xcd$lrb.lngroup2<-log(xcd$lrb.group2) addesc("lrb.lngroup2","natural log of LRB.group2") # --identify role of variables in model-- dv<-"lrb.lngroup2" riv<-uiv<-c("sccs.v21","wnai.v135","lrb.revensq","lrb.subdiv2","lrb.war1sq") h<-dools(fff=smi,depvar=dv,indpv=c("sccs.v1260",uiv), rindpv=riv,othexog=null,dw=true,lw=true, stepw=true,relimp=true,slmtests=false) print(h) # --print output to csv file-- 4

5 CSVwrite(h,"myOutput",FALSE) keyf keyfile dataset The data.frame keyf contains information about variables from four ethnographic datasets: EA, LRB, SCCS, and WNAI. Format rownames variable type description NOTmissing class nuniqvals FNOTmissing Fclass FnUniqVals db levels Variable names from the data.frame xcd Variable names as given within the ethnographic dataset ( EA, LRB, SCCS, or WNAI) Variable type ( ordinal or categorical ) Variable description Number of non-missing values for variable Variable class ( character or numeric ) Number of unique data values for variable For the factor version of the ethnographic dataset: Number of non-missing values for variable For the factor version of the ethnographic dataset: Variable class ( character, factor, integer, or numeric ) For the factor version of the ethnographic dataset: Number of unique data values for variable Source ethnographic dataset ( EA, LRB, SCCS, or WNAI). GIS data is indicated as gisx. Factor levels for variables defined as factors in the factor version (and with fewer than 20 factor levels). head(keyf) mkdummy Make dummy variable and store a description in key file The function makes a dummy variable from a variable in the data.frame xcd, and creates a description stored in the data.frame keyf. mkdummy(dsn,vv,val) dsn name of an ethnographic dataset (EA, LRB, SCCS, or WNAI) vv name of a variable from the specified ethnographic dataset val the value of variable vv for which the dummy equals one. The function returns a variable named dsn.vvdval, which equals one when xcd$dsn.vv==val, and equals zero otherwise. The main reason to use this function is that it will automatically append a description for the dummy variable to the key file, which is then available for use in dools output. The description is created using the variable name from the key file and the description of the value from the levels variable in the data.frame keyf. 5

6 mkwtmat Make and format three weight matrices for the societies in data.frame xcd The function makes and formats three weight matrices (geographic, linguistic, and ecological) for the societies in data.frame xcd. mkwtmat() The function returns three matrices: ddm eem llm Geographic proximity, based on the latitude and longitude fields in data.frame xcd. Each cell is the inverted squared distance between the row society and column society. The diagonal is set to zero, and then the rows are normalized so that their sum equals one. Ecological proximity, based on the Euclidean distance between societies in the 22-dimensional space defined by 19 climate variables, two altitude variables, and one measure of met primary productivity (all variables scaled to standard normal before distances are calculated). Each cell is exp(-d), where d is the distance between the row society and column society. The diagonal is set to zero, and then the rows are normalized so that their sum equals one. Linguistic proximity between each row and column society. This matrix is not created, but only row normalized. Since the geographic and ecological matrices are relatively fast to compute, but very large, it is more efficient to create them than to load an already constructed matrix. The linguistic matrix, on the other hand, takes a very long time to compute, but is small (many fewer unique values) and is therefore loaded with the other data and only row-normalized in this function. The function is run one time in the domi function, making the matrices available both in the function and in the general environment. xcd Cross cultural dataset The data.frame xcd contains the variables from four ethnographic datasets: EA, LRB, SCCS, and WNAI. The number of societies represented in each of the datasets is 1267 (EA), 339 (LRB), 186 (SCCS), and 172 (WNAI), for a total of 1964 records in the four datasets. However, some societies appear in more than one dataset (1090 appear only in one; 257 appear in two; 108 appear in three; and nine appear in all four), so there are 1464 unique societies. The data.frame xcd therefore contains 1464 observations and 2916 variables: 111 from EA; 262 from LRB; 2055 from SCCS; 440 from WNAI; and 48 that are drawn from GIS data. Format 6

7 For each variable drawn from an ethnographic dataset, the variable name is XX.vv where XX is the name of the ethnographic dataset, and vv is the name of the variable in that dataset. For example, variable v207 from SCCS is names SCCS.v207. dim(xcd) 7

new [[.Dow- Eff Functions - DEf]] blue- colored link to go to there and click one of the five models above listed at that page: e.g.

new [[.Dow- Eff Functions - DEf]] blue- colored link to go to there and click one of the five models above listed at that page: e.g. Make your own DEf model http://intersci.ss.uci.edu/wiki/pdf/make_your_own_def_model.pdf Read: http://intersci.ss.uci.edu/wiki/pdf/wileych5ccrnetsofvarsmodels2blackdrw.pdf This will become part of Wiley