Exploratory Data Analysis September 6, 2005

Similar documents
Exploratory Data Analysis September 3, 2008

Exploratory Data Analysis September 8, 2010

Advanced Statistics 1. Lab 11 - Charts for three or more variables. Systems modelling and data analysis 2016/2017

Exploratory Data Analysis - Part 2 September 8, 2005

Exploratory Data Analysis EDA

SECTION 1-B. 1.2 Data Visualization with R Reading Data. Flea Beetles. Data Mining 2018

Getting Started. Slides R-Intro: R-Analytics: R-HPC:

Table of Contents. Preface... ix

8. MINITAB COMMANDS WEEK-BY-WEEK

Regression III: Advanced Methods

Introduction to Lattice Graphics. Richard Pugh 4 th December 2012

R Graphics. SCS Short Course March 14, 2008

Importing and visualizing data in R. Day 3

Visual Analytics. Visualizing multivariate data:

Roger D. Peng, Associate Professor of Biostatistics Johns Hopkins Bloomberg School of Public Health

ACHIEVEMENTS FROM TRAINING

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages:

Multistat2 1

Introduction to R. A Statistical Computing Environment. J.C. Wang. Department of Statistics Western Michigan University

MATH11400 Statistics Homepage

Package OLScurve. August 29, 2016

R syntax guide. Richard Gonzalez Psychology 613. August 27, 2015

wireframe: perspective plot of a surface evaluated on a regular grid cloud: perspective plot of a cloud of points (3D scatterplot)

DSCI 325: Handout 18 Introduction to Graphics in R

Package r2d2. February 20, 2015

Trellis Displays. Definition. Example. Trellising: Which plot is best? Historical Development. Technical Definition

1.3 Graphical Summaries of Data

Reading in data. Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen

Chapter 2: Looking at Multivariate Data

R is a programming language of a higher-level Constantly increasing amount of packages (new research) Free of charge Website:

A brief introduction to R

Visualizing univariate data 1

Types of Plotting Functions. Managing graphics devices. Further High-level Plotting Functions. The plot() Function

Data Science and Machine Learning Essentials

Install RStudio from - use the standard installation.

FlowJo Software Lecture Outline:

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Tutorial 3. Chiun-How Kao 高君豪

Visualizing and Exploring Data

Introduction to R Benedikt Brors Dept. Intelligent Bioinformatics Systems German Cancer Research Center

Introduction to R Reading, writing and exploring data

DSC 201: Data Analysis & Visualization

Advanced Multivariate Continuous Displays and Diagnostics

FODAVA Partners Leland Wilkinson (SYSTAT & UIC) Robert Grossman (UIC) Adilson Motter (Northwestern) Anushka Anand, Troy Hernandez (UIC)

Lecture 6: Chapter 6 Summary

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Visualizing Multivariate Data

Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident.

Stat 849: Plotting responses and covariates

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Stat 849: Plotting responses and covariates

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Graph tool instructions and R code

Nature Methods: doi: /nmeth Supplementary Figure 1

netzen - a software tool for the analysis and visualization of network data about

Applied Regression Modeling: A Business Approach

Stat 290: Lab 2. Introduction to R/S-Plus

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

hvpcp.apr user s guide: set up and tour

Introduction to R. Daniel Berglund. 9 November 2017

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

Introduction. Product List. Design and Functionality 1/10/2013. GIS Seminar Series 2012 Division of Spatial Information Science

Matlab Tutorial 1: Working with variables, arrays, and plotting

CREATING POWERFUL AND EFFECTIVE GRAPHICAL DISPLAYS: AN INTRODUCTION TO LATTICE GRAPHICS IN R

Az R adatelemzési nyelv

Visual Encoding Design

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

WELCOME! Lecture 3 Thommy Perlinger

Session 6: Oracle R Enterprise Statistics Engine Oracle R Technologies

EXCEL SKILLS. Selecting Cells: Step 1: Click and drag to select the cells you want.

Applied Multivariate Statistics for Ecological Data ECO632 Lab 1: Data Sc re e ning

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

Mixed models in R using the lme4 package Part 2: Lattice graphics

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Package pmg. R topics documented: March 9, Version Title Poor Man s GUI. Author John Verzani with contributions by Yvonnick Noel

Intermediate Programming in R Session 1: Data. Olivia Lau, PhD

R package

Introduction to R. Nishant Gopalakrishnan, Martin Morgan January, Fred Hutchinson Cancer Research Center

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Chapter 5: The beast of bias

Quick introduction to descriptive statistics and graphs in. R Commander. Written by: Robin Beaumont

(Refer Slide Time: 0:51)

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Desktop Command window

LAB #1: DESCRIPTIVE STATISTICS WITH R

In Minitab interface has two windows named Session window and Worksheet window.

Chapter 2 Describing, Exploring, and Comparing Data

Introduction to Geospatial Analysis

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Introduction to Exploratory Data Analysis

Les exemples des fonctions graphiques de haut niveau

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Exploratory/Visual Data Analysis

ADVANCED EXCEL BY NACHIKET PENDHARKAR (CA, CFA, MICROSOFT CERTIFIED TRAINER & EXCEL EXPERT)

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Time Series Analysis by State Space Methods

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /6/ /13

Using the DATAMINE Program

Transcription:

Exploratory Data Analysis September 6, 2005 Exploratory Data Analysis p. 1/16

Somethings to Look for with EDA skewness in distributions non-constant variability nonlinearity need for transformations outliers unknown groups or clusters Gain Insight into Data Check Assumptions for more Formal Statistical Models Exploratory Data Analysis p. 2/16

Graphical Views 1. Univariate: histograms, density curves, boxplots, quantile-quantile plots 2. Bivariate: scatter plots with trend lines, side-by-side boxplots 3. Several variables: scatter plot matrices, lattice or trellis plots, 3-dimensional plots, dynamic plots Exploratory Data Analysis p. 3/16

.First() Function To use the HH code, we need to 1. download the hh les from the course calendar link 2. download the First.R le 3. edit the First.R code to add the path for the hh les 4. Install packages for R (abind, lattice, multcomp, mvtnorm): Run the Gui version of R, and use the install packages from CRAN option. 5. load the.first function > source("first.r") 6. run the function (this session only if you save your workspace) >.First() Exploratory Data Analysis p. 4/16

Creating a Dataframe in R The hh function speci es the path for all HH les > usair = read.table(hh("datasets/usair.dat")) > names(usair) [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" colnames(usair)=c("so2","temp","mfgfirms","popn Notes: 1. the header=f (default) indicates no header (variable name) info 2. the names function extracts the names variables and cases in a dataframe 3. colnames can be used to assign more meaningful names Exploratory Data Analysis p. 5/16

Reading Data read.csv Comma separated variable format read.fwf Fixed width format useful for 4.1! read.delim Tab delimited les See help(read.table) for options, such as setting character for NAs, column separators, skipping lines, etc See also scan() Exploratory Data Analysis p. 6/16

Scatter Plots bivariate plot(x,y) plot(y x) Note use of model formula all-possible pairwise scatter plots plot(dataframe) pairs(dataframe) Exploratory Data Analysis p. 7/16

pairs() pairs(usair) pairs(usair, panel=panel.smooth) Add a smoother to each plot pairs(so2., panel=panel.smooth, data=usair) use a model formula Hartigan s original version of a scatterplot matrix had histograms on the diagonal. We need to rst de ne a function panel.hist for the diaginal panels Exploratory Data Analysis p. 8/16

Defining a function panel.hist = function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nb <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb],0,breaks[-1],y,col="cyan",.. } Exploratory Data Analysis p. 9/16

SPLOM with histogram > pairs(so2 temp + mfgfirms + popn + wind + precip + raindays, data=usair, panel=panel.smooth, diag.panel=panel.hist) > pairs(log(so2) log(temp) + log(mfgfirms) + log(popn) + log(wind) + log(precip) + log(raindays), data=usair, panel=panel.smooth, diag.panel=panel.hist) Exploratory Data Analysis p. 10/16

Trellis Plots Trellis plots (S-Plus) and Lattice plots in R also create layouts for multiple plots. A trellis of plots is generated as a sequence of plots that are then arranged in rows, columns and pages. The sequence is determined by the conditioning factors in the formula X Y X Y X Z Y X Z*W where Z and W are factors or shingles, Y is on the y-axis, and X is on the x-axis Exploratory Data Analysis p. 11/16

Getting started library(lattice) help(lattice) help(xyplot) example(xyplot) Exploratory Data Analysis p. 12/16

Ladder of Powers The ladder function of HH is built on the lattice package > ladder(so2 temp, data=usair, main="ladder of Powers for SO2 and Tempe Explore Box-Cox power transformations of y (and x): power(y, p) { y p 1 p (p 0) log(y) (p = 0) Exploratory Data Analysis p. 13/16

Ladder of Powers with Boxplots and QQPl 1. create new function ladder.1d(x) from code in hh/graph/code/graph.f10.r 2. ladder.1d(usair$so2) y^!1 y^!0.5 Boxplot with Powers y^ 0 y^ 0.5 y^ 1 y^ 2!0.12!0.10!0.08!0.06!0.04!0.02!0.35!0.30!0.25!0.20!0.15!0.10 2.0 2.5 3.0 3.5 4.0 4.5 4 6 8 10 20 40 60 80 100 0 2000 4000 6000 800010000 y^!1 Normal quantiles with Powers y^!0.5 y^ 0 y^ 0.5 y^ 1 y^ 2!0.12!0.10!0.08!0.06!0.04!0.02!0.35!0.30!0.25!0.20!0.15!0.10 2.0 2.5 3.0 3.5 4.0 4.5 4 6 8 10 20 40 60 80 100 0 2000 4000 6000 800010000!2 0 1 2!2 0 1 2!2 0 1 2!2 0 1 2!2 0 1 2!2 0 1 2 Exploratory Data Analysis p. 14/16

Box-Cox Function A more formal way to nd a power transformation is to use the Box-Cox function library(mass) # more formal method to estimate power boxcox(so2 temp, data=usair)) boxcox(so2 log(temp), data=usair) boxcox(so2 sqrt(temp), data=usair) boxcox(so2 log(temp) + log(mfgfirms) + log(popn) + log(wind) + log(precip) + log(raindays), data=usair) Find value of power that maximizes the likelihood of normality Exploratory Data Analysis p. 15/16

SO2 data log!likelihood!230!220!210!200!190!180 95%!2!1 0 1 2! Choose a power near max or in interval Assumes a particular model formulation! Exploratory Data Analysis p. 16/16