Lab 10 Regression IV

Size: px

Start display at page:

Download "Lab 10 Regression IV"

Rodney Hood
5 years ago
Views:

ggplot2 package: Lab 10 Regression IV Dave presented analysis of a data set on body fat which I would like to use to show features I think are worth knowing about in ggplot2 (and associated) packages.

1 ggplot2 package: Lab 10 Regression IV Dave presented analysis of a data set on body fat which I would like to use to show features I think are worth knowing about in ggplot2 (and associated) packages. Look at the code and results below. # ggplot2 features Body_fat_complete <- read.csv("c:/users/user/desktop/body_fat_complete.csv") bf<- Body_fat_complete names(bf) We have 14 variables we can relate (7 per plot), and we do it 2 at a time using the ggpairs() command located in the GGally package. Download the help file on the command from our Moodle page, if you want to investigate its properties. library(ggplot2) library(corrplot) library(ggally) ggpairs(bf[,1:7]) ggpairs(bf[c(1,8:14)]) and -1-

Note that the main diagonal has the shape of each variable density (with row or column label), and the scatter plot pictures of each set of 2 variables, when folded along the main diagonal, will lay

2 Note that the main diagonal has the shape of each variable density (with row or column label), and the scatter plot pictures of each set of 2 variables, when folded along the main diagonal, will lay on top of the various correlation values of the scatter plots. So, for example, the bottom left scatter plot has the correlation value listed in the top right, etc. The scatter plots visually show which pairs correlate will, and the numbers verify that observation. The density plots, when sort of truncated before going across the total range of x axis values indicates outliers. So, for example, the density plot of ankle only goes about half way across the x axis, and the hip variable only goes about 4/5 of the way across the x axis these 2 variables have outliers which can be seen relatively easily in the respective scatter plots. This visualization matrix is quite handy for getting initial impressions of relationships. Below is some code showing various uses for ggplot2. # ggplot2 examples library(ggplot2) # create factors with value labels mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears")) mtcars$am <- factor(mtcars$am,levels=c(0,1), labels=c("automatic","manual")) mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl")) # Kernel density plots for mpg # grouped by number of gears (indicated by color) -2-

qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=i(.

hp for each combination of gears and cylinders # in each facet, transmittion type is represented by shape

ylab="miles per Gallon") # Separate regressions of mpg on weight for each number of cylinders qplot(wt,

on Weight", xlab="weight", ylab="miles per Gallon") # Boxplots of mpg by number of gears # observations

3 qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=i(.5), main="distribution of Gas Milage", xlab="miles Per Gallon", ylab="density") # Scatterplot of mpg vs. hp for each combination of gears and cylinders # in each facet, transmittion type is represented by shape and color qplot(hp, mpg, data=mtcars, shape=am, color=am, facets=gear~cyl, size=i(3), xlab="horsepower", ylab="miles per Gallon") # Separate regressions of mpg on weight for each number of cylinders qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"), method="lm", formula=y~x, color=cyl, main="regression of MPG on Weight", xlab="weight", ylab="miles per Gallon") # Boxplots of mpg by number of gears # observations (points) are overlayed and jittered qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"), fill=gear, main="mileage by Gear Number", xlab="", ylab="miles per Gallon") and

title=element_text(face="bold.italic", size="12", color="brown"), legend.position="top") we have results shown below.

4 also library(ggplot2) p <- qplot(hp, mpg, data=mtcars, shape=am, color=am, facets=gear~cyl, main="scatterplots of MPG vs. Horsepower", xlab="horsepower", ylab="miles per Gallon") # White background and black grid lines p + theme_bw() # Large brown bold italics labels # and legend placed at top of plot p + theme(axis.title=element_text(face="bold.italic", size="12", color="brown"), legend.position="top") we have results shown below. This just scratches the surface of what ggplot2 package does, but it does illustrate the fancy graphing capabilities R can produce with this package. More on model reduction: Dave has illustrated other model criteria we should look at when trying to optimize a model. These include looking at the following model statistics: R 2 adjusted R 2 mean squared error - MSE = sse/(n-p), where n = sample size, p=number of treatments Akaike's Information Criteria (AIC) Bayesian Information Criterion (BIC) Predicted Error Sum of Squares (PRESS) K-Fold cross-validation (CV) As you reduce models automatically using the various step() functions in the -4-

various packages, or you compute these values individually for each model you have in your reduction process, you should probably compute each of these listed items so that you can judge which model

5 various packages, or you compute these values individually for each model you have in your reduction process, you should probably compute each of these listed items so that you can judge which model best suits your predictability concerns. Dave has discussed these things in lecture, their strengths and weaknesses. We will give an example of reducing a model, where we use these statistics and associated graphs. sat.csv data example: Below is a view of the sat.csv data set, containing various indicators of high school kids who took the SAT exam in the various states. Let us look at some criteria. The categories are: takers percentage of eligible students who took the exam income median income of families of test takers years average number of years of study in social science, natural sciences, and humanities by test takers public percentage of test takers in public schools expend state expenditures in hundreds of dollars per student rank median percentile ranking of test takers within their schools Let us first look at the scatter plots and correlation matrix. # sat example sat <- read.csv("c:/users/user/desktop/sat.csv") satdata <- sat[, 2:8] library(corrplot) library(ggally) names(satdata) ggpairs(satdata) We want a model where SAT scores (sat) are the response. Looking at the scatter plots below, I think we could get a better relationship if we had log(takers)(the green arrow) instead of takers (the red arrow) plotted against sat, since the scatter plot and correlations for log(takers) on sat look better. -5-

Output is shown below. Our full model (model.full) will be sat=β 0 +β 1 ln(takers)+β 2 income+β 3 years+β 4 public+β 5 expend+β 6 rank We will want to look at the following reduced models: model.

6 Output is shown below. Our full model (model.full) will be sat=β 0 +β 1 ln(takers)+β 2 income+β 3 years+β 4 public+β 5 expend+β 6 rank We will want to look at the following reduced models: model.1 will be sat=β 0 +β 1 ln(takers)+β 2 income+β 3 years+β 5 expend+β 6 rank model.2 will be sat=β 0 +β 1 ln(takers)+β 3 years+β 5 expend +β 6 rank model.3 will be sat=β 0 +β 1 ln(takers)+β 3 years+β 5 expend model.4 will be sat=β 0 +β 1 ln(takers)+β 5 expend model.5 will be sat=β 0 +β 1 ln(takers) I am not saying that we would want to reduce in this order, necessarily. I want to reduce in this order to demonstrate how the various statistical values change. model.full <- lm(sat ~ logtakers + income + years + public + expend + rank, data=satdata) summary(model.full) -6-

7 Below are the model summaries of the other reduced models. model.1 <- update(model.full,. ~. - public) summary(model.1) model.2<- update(model.1,. ~. - income) summary(model.2) model.3<- update(model.2,. ~. - rank) summary(model.3) model.4<- update(model.3,. ~. - years) summary(model.4) model.5<- update(model.4,. ~. - expend) summary(model.5) next

8 next The resulting r 2 's from these models is stored in vectors shown below. rsquare <- c(.8919,.8918,.8917,.8827,.8675,.8108) rsquare.adj<- c(.8769,.8795,.8821,.875,.8619,.8068) Now, let us get the AIC and BIC of all models. n <- 50 # number of states (observational units) in data set aic.full <- extractaic(model.full) aic.full bic.full <- extractaic(model.full, k=log(n)) bic.full aic.1 <- extractaic(model.1) bic.1 <- extractaic(model.1, k=log(n)) aic.2 <- extractaic(model.2) bic.2 <- extractaic(model.2, k=log(n)) aic.3 <- extractaic(model.3) bic.3 <- extractaic(model.3, k=log(n)) aic.4 <- extractaic(model.4) bic.4 <- extractaic(model.4, k=log(n)) aic.5 <- extractaic(model.5) bic.5 <- extractaic(model.5, k=log(n)) Let us store these values of AIC and BIC in vectors. vec.aic <- c(aic.full[2], aic.1[2], aic.2[2], aic.3[2], aic.4[2], aic.5[2]) vec.bic <- c(bic.full[2], bic.1[2], bic.2[2], bic.3[2], bic.4[2], bic.5[2]) -8-

Now, let us find the PRESS statistic for each model. library(daag) press.full <- press(model.full) press.full press.1 <- press(model.1) press.2 <- press(model.2) press.3 <- press(model.3) press.

9 Now, let us find the PRESS statistic for each model. library(daag) press.full <- press(model.full) press.full press.1 <- press(model.1) press.2 <- press(model.2) press.3 <- press(model.3) press.4 <- press(model.4) press.5 <- press(model.5) vec.press <- c(press.full, press.1, press.2, press.3, press.4, press.5) Now, let us compute CV. library(cvtools) cv.full <- repcv(model.full, K=10, R=20, seed=723) cv.1 <- repcv(model.1, K=10, R=20, seed=723) cv.2 <- repcv(model.2, K=10, R=20, seed=723) cv.3 <- repcv(model.3, K=10, R=20, seed=723) cv.4 <- repcv(model.4, K=10, R=20, seed=723) cv.5 <- repcv(model.5, K=10, R=20, seed=723) cv.full; cv.1; cv.2; cv.3; cv.4; cv.5 So, we can compute a cv vector from this information. -9-

10 cv.vec <- c( , , , , , ) Now, let us construct a table with our results. vec.model <- c("full.model", "model.1", "model.2", "model.3", "model.4", "model.5") vec.title <- c("model", "R^2", "R^2adj", "AIC", "BIC", "PRESS", "CV") table1 <- cbind(vec.model, rsquare, rsquare.adj, vec.aic, vec.bic, vec.press, cv.vec) table.final <-rbind(vec.title, table1) table.final Resulting table is below. Pondering the various statistics in the table, I would probably pick model.3 (or, if I wanted more simplicity, model4) for my overall most efficient model among the 6 choices of model. Homework[1]: In the basic data sets of R is the data set called state.x77. Find an optimum reduced model, using life expectancy (Life Ex) as response variable. A partial picture of the data set information is shown below, -10-

11 and can be viewed by the following commands in R.?state.x77 state.x77 Pictures are shown below. Be sure to show descriptives and some (many) of the criteria Dave and I have shown you,as well as give about 50 words concluding/justifying your final model pick. Homework [2]: Read very carefully the multregtutorial.pdf file, noting the author's use of the following R commands when doing multiple regression: glm( ) for generalized linear models (stats package) gam( ) for generalized additive models (gam package) lme( ) and lmer( ) for linear mixed-effects models (nlme and lme4-11-

12 packages) nls( ) and nlme( ) for nonlinear models (stats and nlme packages) various data frame and labeling commands par() and lm() command uses notes on interactions update(. ~. - factor) command anova() and aov() commands and contrasts model.int <- update(model5,.~.^3) use and meaning step() backward, forward, both uses and weaknesses plot(model), plot(model, 1), plot(model, 2), plot(model, 3) etc., displays confint(model) displays information on extractor functions from the lm() command usage of predict(lm.out, list(solar.r=200, Wind=11, Temp=80, Month=6), interval="conf") comments about partial correlations There is nothing to report here, only things to learn from within this tutorial from W. B. King -12-

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques