Modelo linear no. Valeska Andreozzi

Size: px

Start display at page:

Download "Modelo linear no. Valeska Andreozzi"

Malcolm Barnaby Cain
5 years ago
Views:

1 Modelo linear no Valeska Andreozzi valeska.andreozzi at fc.ul.pt Centro de Estatística e Aplicações da Universidade de Lisboa Faculdade de Ciências da Universidade de Lisboa Lisboa, 2012 Sumário 1 Correlação de Pearson 2 2 Correlação de Spearman 3 3 Modelo linear Exemplo Ajuste do modelo linear Fórmula Sumário Intervalo de confiança Comparação de modelos Seleção de covariáveis Análise de resíduos Predição Gráfico dos efeitos Extraindo valores

2 GNP.deflator GNP Unemployed Armed.Forces Population Year Employed Valeska Andreozzi 1 CORRELAÇÃO DE PEARSON 1 Correlação de Pearson > library(iswr) > data(thuesen) > View(thuesen) > plot(thuesen) > cor(thuesen$blood.glucose, thuesen$short.velocity) [1] NA > cor(thuesen$blood.glucose, thuesen$short.velocity,use="complete.obs") [1] short.velocity blood.glucose Outro exemplo >?longley > library(car) > scatterplotmatrix(longley) > cor(longley) 2

3 GNP.deflator GNP Unemployed Armed.Forces Population GNP.deflator GNP Unemployed Armed.Forces Population Year Employed Year Employed GNP.deflator GNP Unemployed Armed.Forces Population Year Employed Correlação de Spearman > cor(thuesen$blood.glucose,thuesen$short.velocity, + use="complete.obs",method="spearman") [1] Modelo linear Referências online Exemplo Com o objetivo de identificar fatores associados ao peso ao nascer, pesquisadores coletaram as seguintes informações: Utilize estes dados para estimar uma regressão linear múltipla e responder o objetivo do estudo. > bp <- read.table("lowbwtdata.dat", header = T) > dim(bp) [1] > names(bp) <- tolower(names(bp)) > head(bp) 3

4 Descrição Códigos/Valores Variável Identification Code ID Number ID Low Birth Weight 1 = BWT<=2500g LOW 0 = BWT>2500g Age of Mother Years AGE Weight of Mother at Pounds LWT Last Menstrual Period Race 1 = White, 2 = Black RACE 3 = Other Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy History of Premature Labor 0,1,2, PTL History of Hypertension 0 = No, 1 = Yes HT Presence of Uterine Irritability 0 = No, 1 = Yes UI Number of Physician Visits 0,1,2, FTV During the First Trimester Birth Weight Grams BWT id low age lwt race smoke ptl ht ui ftv bwt Indicando ao R que as variáveis são categóricas > bp$race <- factor(bp$race) > bp$smoke <- factor(bp$smoke) > bp$ht <- factor(bp$ht) > bp$ui <- factor(bp$ui) > bp$low <- factor(bp$low) Para saber quais as classes são referências, temos > contrasts(bp$race) > contrasts(bp$smoke)

5 > contrasts(bp$ht) > contrasts(bp$ui) > contrasts(bp$low) Trocando a escala da variável resposta para kg > bp$bwt<-bp$bwt/1000 Sumário dos dados > summary(bp) id low age lwt race smoke Min. : 4.0 0:130 Min. :14.00 Min. : :96 0:115 1st Qu.: : 59 1st Qu.: st Qu.: :26 1: 74 Median :123.0 Median :23.00 Median : :67 Mean :121.1 Mean :23.24 Mean : rd Qu.: rd Qu.: rd Qu.:140.0 Max. :226.0 Max. :45.00 Max. :250.0 ptl ht ui ftv bwt Min. : :177 0:161 Min. : Min. : st Qu.: : 12 1: 28 1st Qu.: st Qu.:2.414 Median : Median : Median :2.977 Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.:3.475 Max. : Max. : Max. :4.990 > as.data.frame(table(bp$ptl)) 5

6 Var1 Freq > as.data.frame(table(bp$ftv)) Var1 Freq Ajuste do modelo linear > bp.lm1 <- lm(bwt ~ age+lwt+race+ftv, data = bp) > bp.lm1 Call: lm(formula = bwt ~ age + lwt + race + ftv, data = bp) (Intercept) age lwt race2 race3 ftv Fórmula + para incluir efeitos principais, A+B : para incluir interações, A : B * para incluir efeitos principais e interações, A B = A+B +A : B I() para incluir termos matemáticos, I(A 2) Exemplos > fit1 <- lm(bwt ~age*race, data = bp) > fit1 Call: lm(formula = bwt ~ age * race, data = bp) (Intercept) age race2 race3 age:race2 age:race

7 > fit2 <- lm(bwt ~age + I(age^2), data = bp) > fit2 Call: lm(formula = bwt ~ age + I(age^2), data = bp) (Intercept) age I(age^2) Sumário > summary(bp.lm1) Call: lm(formula = bwt ~ age + lwt + race + ftv, data = bp) Residuals: Min 1Q Median 3Q Max Estimate Std. Error t value Pr(> t ) (Intercept) e-13 *** age lwt * race ** race * ftv Signif. codes: 0 *** ** 0.01 * Residual standard error: on 183 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 5 and 183 DF, p-value: Intervalo de confiança > confint(bp.lm1) 2.5 % 97.5 % (Intercept) age lwt race race ftv

8 3.6 Comparação de modelos > fit0 <- lm(bwt ~age+race, data = bp) > fit1 <- lm(bwt ~age*race, data = bp) > anova(fit0,fit1,test="f") Analysis of Variance Table Model 1: bwt ~ age + race Model 2: bwt ~ age * race Res.Df RSS Df Sum of Sq F Pr(>F) Seleção de covariáveis Procedimento stepwise > bw.mod1 <- glm(bwt ~ age+lwt+race+smoke+ht+ftv, data = bp) > summary(bw.mod1) Call: glm(formula = bwt ~ age + lwt + race + smoke + ht + ftv, data = bp) Deviance Residuals: Min 1Q Median 3Q Max Estimate Std. Error t value Pr(> t ) (Intercept) e-15 *** age lwt ** race ** race ** smoke *** ht * ftv Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for gaussian family taken to be ) Null deviance: on 188 degrees of freedom Residual deviance: on 181 degrees of freedom AIC: Number of Fisher Scoring iterations: 2 8

9 > mod.both<-step(bw.mod1,direction="both") Start: AIC= bwt ~ age + lwt + race + smoke + ht + ftv - ftv age <none> ht lwt smoke race Step: AIC= bwt ~ age + lwt + race + smoke + ht - age <none> ftv ht lwt smoke race Step: AIC= bwt ~ lwt + race + smoke + ht <none> age ftv ht lwt smoke race > mod.both Call: glm(formula = bwt ~ lwt + race + smoke + ht, data = bp) (Intercept) lwt race2 race3 smoke1 ht Degrees of Freedom: 188 Total (i.e. Null); 183 Residual Null Deviance: Residual Deviance: AIC:

10 Procedimento backward > mod.back<-step(bw.mod1,direction="backward") Start: AIC= bwt ~ age + lwt + race + smoke + ht + ftv - ftv age <none> ht lwt smoke race Step: AIC= bwt ~ age + lwt + race + smoke + ht - age <none> ht lwt smoke race Step: AIC= bwt ~ lwt + race + smoke + ht <none> ht lwt smoke race > mod.back Call: glm(formula = bwt ~ lwt + race + smoke + ht, data = bp) (Intercept) lwt race2 race3 smoke1 ht Degrees of Freedom: 188 Total (i.e. Null); 183 Residual Null Deviance: Residual Deviance: AIC:

11 Procedimento forward > bw.nulo <- glm(bwt ~ 1, data = bp) > mod.forw<-step(bw.nulo,scope=list(upper=~age+lwt+race+smoke+ht+ftv), + direction="forward") Start: AIC= bwt ~ 1 + race smoke lwt ht <none> age ftv Step: AIC= bwt ~ race + smoke lwt ht <none> age ftv Step: AIC= bwt ~ race + smoke + lwt ht <none> ftv age Step: AIC= bwt ~ race + smoke + lwt + ht <none> age ftv Step: AIC=

12 bwt ~ race + smoke + lwt + ht <none> age ftv > mod.forw Call: glm(formula = bwt ~ race + smoke + lwt + ht, data = bp) (Intercept) race2 race3 smoke1 lwt ht Degrees of Freedom: 188 Total (i.e. Null); 183 Residual Null Deviance: Residual Deviance: AIC: Análise de resíduos Calculando os resíduos > res<-rstandard(mod.both,type="deviance") > layout(matrix(c(1,2,3,4),2,2)) > plot(mod.both) 3.9 Predição Considere o modelo > fit<-lm(bwt~lwt+race+smoke+ht,data=bp) > summary(fit) Call: lm(formula = bwt ~ lwt + race + smoke + ht, data = bp) Residuals: Min 1Q Median 3Q Max Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** 12

13 lwt ** race ** race ** smoke *** ht * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 183 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 5 and 183 DF, p-value: 9.859e-07 Considere uma nova observação com a seguintes características: lwt=170 race=3 smoke=1 ht=1 Qual o peso esperado à nascença para uma criança cuja mãe apresenta as características acima? > new<-data.frame(lwt=170,race="3",smoke="1",ht="1") > predict(fit, new,se.fit = TRUE) $fit $se.fit [1] $df [1] 183 $residual.scale [1] Obtenha um intervalo de confiança a um nível de 95% de confiança para esta nova criança. > pred.w.plim <- predict(fit,new, interval="prediction") > pred.w.plim fit lwr upr

14 Obtenha um intervalo de confiança a um nível de 95% de confiança para o peso médio à nascença das crianças cujas mães possuem essas mesmas características. > pred.w.clim <- predict(fit,new, interval="confidence") > pred.w.clim fit lwr upr Gráfico dos efeitos > library(effects) > plot(effect("lwt",fit)) lwt effect plot bwt lwt > plot(effect("race",fit)) race effect plot bwt race > plot(alleffects(fit),ask=false) lwt effect plot race effect plot bwt bwt lwt smoke effect plot race ht effect plot bwt bwt smoke 0 1 ht 14

15 3.11 Extraindo valores > fitted(fit) #valores ajustados > coefficients(fit) #coeficientes do modelo > names(fit) #lista o nome dos objetos do modelo fit > is.list(fit) 15

Regression on the trees data with R

Regression on the trees data with R > trees Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76