Exploratory Data Analysis September 6, 2005 Exploratory Data Analysis p. 1/16
Somethings to Look for with EDA skewness in distributions non-constant variability nonlinearity need for transformations outliers unknown groups or clusters Gain Insight into Data Check Assumptions for more Formal Statistical Models Exploratory Data Analysis p. 2/16
Graphical Views 1. Univariate: histograms, density curves, boxplots, quantile-quantile plots 2. Bivariate: scatter plots with trend lines, side-by-side boxplots 3. Several variables: scatter plot matrices, lattice or trellis plots, 3-dimensional plots, dynamic plots Exploratory Data Analysis p. 3/16
.First() Function To use the HH code, we need to 1. download the hh les from the course calendar link 2. download the First.R le 3. edit the First.R code to add the path for the hh les 4. Install packages for R (abind, lattice, multcomp, mvtnorm): Run the Gui version of R, and use the install packages from CRAN option. 5. load the.first function > source("first.r") 6. run the function (this session only if you save your workspace) >.First() Exploratory Data Analysis p. 4/16
Creating a Dataframe in R The hh function speci es the path for all HH les > usair = read.table(hh("datasets/usair.dat")) > names(usair) [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" colnames(usair)=c("so2","temp","mfgfirms","popn Notes: 1. the header=f (default) indicates no header (variable name) info 2. the names function extracts the names variables and cases in a dataframe 3. colnames can be used to assign more meaningful names Exploratory Data Analysis p. 5/16
Reading Data read.csv Comma separated variable format read.fwf Fixed width format useful for 4.1! read.delim Tab delimited les See help(read.table) for options, such as setting character for NAs, column separators, skipping lines, etc See also scan() Exploratory Data Analysis p. 6/16
Scatter Plots bivariate plot(x,y) plot(y x) Note use of model formula all-possible pairwise scatter plots plot(dataframe) pairs(dataframe) Exploratory Data Analysis p. 7/16
pairs() pairs(usair) pairs(usair, panel=panel.smooth) Add a smoother to each plot pairs(so2., panel=panel.smooth, data=usair) use a model formula Hartigan s original version of a scatterplot matrix had histograms on the diagonal. We need to rst de ne a function panel.hist for the diaginal panels Exploratory Data Analysis p. 8/16
Defining a function panel.hist = function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nb <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb],0,breaks[-1],y,col="cyan",.. } Exploratory Data Analysis p. 9/16
SPLOM with histogram > pairs(so2 temp + mfgfirms + popn + wind + precip + raindays, data=usair, panel=panel.smooth, diag.panel=panel.hist) > pairs(log(so2) log(temp) + log(mfgfirms) + log(popn) + log(wind) + log(precip) + log(raindays), data=usair, panel=panel.smooth, diag.panel=panel.hist) Exploratory Data Analysis p. 10/16
Trellis Plots Trellis plots (S-Plus) and Lattice plots in R also create layouts for multiple plots. A trellis of plots is generated as a sequence of plots that are then arranged in rows, columns and pages. The sequence is determined by the conditioning factors in the formula X Y X Y X Z Y X Z*W where Z and W are factors or shingles, Y is on the y-axis, and X is on the x-axis Exploratory Data Analysis p. 11/16
Getting started library(lattice) help(lattice) help(xyplot) example(xyplot) Exploratory Data Analysis p. 12/16
Ladder of Powers The ladder function of HH is built on the lattice package > ladder(so2 temp, data=usair, main="ladder of Powers for SO2 and Tempe Explore Box-Cox power transformations of y (and x): power(y, p) { y p 1 p (p 0) log(y) (p = 0) Exploratory Data Analysis p. 13/16
Ladder of Powers with Boxplots and QQPl 1. create new function ladder.1d(x) from code in hh/graph/code/graph.f10.r 2. ladder.1d(usair$so2) y^!1 y^!0.5 Boxplot with Powers y^ 0 y^ 0.5 y^ 1 y^ 2!0.12!0.10!0.08!0.06!0.04!0.02!0.35!0.30!0.25!0.20!0.15!0.10 2.0 2.5 3.0 3.5 4.0 4.5 4 6 8 10 20 40 60 80 100 0 2000 4000 6000 800010000 y^!1 Normal quantiles with Powers y^!0.5 y^ 0 y^ 0.5 y^ 1 y^ 2!0.12!0.10!0.08!0.06!0.04!0.02!0.35!0.30!0.25!0.20!0.15!0.10 2.0 2.5 3.0 3.5 4.0 4.5 4 6 8 10 20 40 60 80 100 0 2000 4000 6000 800010000!2 0 1 2!2 0 1 2!2 0 1 2!2 0 1 2!2 0 1 2!2 0 1 2 Exploratory Data Analysis p. 14/16
Box-Cox Function A more formal way to nd a power transformation is to use the Box-Cox function library(mass) # more formal method to estimate power boxcox(so2 temp, data=usair)) boxcox(so2 log(temp), data=usair) boxcox(so2 sqrt(temp), data=usair) boxcox(so2 log(temp) + log(mfgfirms) + log(popn) + log(wind) + log(precip) + log(raindays), data=usair) Find value of power that maximizes the likelihood of normality Exploratory Data Analysis p. 15/16
SO2 data log!likelihood!230!220!210!200!190!180 95%!2!1 0 1 2! Choose a power near max or in interval Assumes a particular model formulation! Exploratory Data Analysis p. 16/16