Data Visualization. Andrew Jaffe Instructor

Similar documents
Module 10. Data Visualization. Andrew Jaffe Instructor

Intro to R Graphics Center for Social Science Computation and Research, 2010 Stephanie Lee, Dept of Sociology, University of Washington

An Introduction to R Graphics

INTRODUCTION TO R. Basic Graphics

Graphics - Part III: Basic Graphics Continued

Plotting Complex Figures Using R. Simon Andrews v

data visualization Show the Data Snow Month skimming deep waters

Statistical Programming with R

An Introduction to R 2.2 Statistical graphics

AA BB CC DD EE. Introduction to Graphics in R

Graphics in R Ira Sharenow January 2, 2019

Basics of Plotting Data

Statistics 251: Statistical Methods

Large data. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Types of Plotting Functions. Managing graphics devices. Further High-level Plotting Functions. The plot() Function

Importing and visualizing data in R. Day 3

An introduction to ggplot: An implementation of the grammar of graphics in R

ggplot2 basics Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University September 2011

Analysing Spatial Data in R: Vizualising Spatial Data

Introduction to R for Epidemiologists

Using Built-in Plotting Functions

Introduction to R for Beginners, Level II. Jeon Lee Bio-Informatics Core Facility (BICF), UTSW

User manual forggsubplot

CIND123 Module 6.2 Screen Capture

DSCI 325: Handout 18 Introduction to Graphics in R

Table of Contents. Preface... ix

Stat405. Displaying distributions. Hadley Wickham. Thursday, August 23, 12

R Workshop Module 3: Plotting Data Katherine Thompson Department of Statistics, University of Kentucky

Package EnQuireR. R topics documented: February 19, Type Package Title A package dedicated to questionnaires Version 0.

Exploratory Data Analysis - Part 2 September 8, 2005

Advanced Statistics 1. Lab 11 - Charts for three or more variables. Systems modelling and data analysis 2016/2017

Statistical Programming Camp: An Introduction to R

Az R adatelemzési nyelv

Welcome to Workshop: Making Graphs

Introduction to R: Day 2 September 20, 2017

R Graphics. Paul Murrell. The University of Auckland. R Graphics p.1/47

Package sciplot. February 15, 2013

R Bootcamp Part I (B)

Graphics #1. R Graphics Fundamentals & Scatter Plots

Lab 1 Introduction to R

You submitted this quiz on Sat 17 May :19 AM CEST. You got a score of out of

Visualising Spatial Data UCL Geography R Spatial Workshop Tuesday 9 th November 2010 rspatialtips.org.uk

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

Basic Statistical Graphics in R. Stem and leaf plots 100,100,100,99,98,97,96,94,94,87,83,82,77,75,75,73,71,66,63,55,55,55,51,19

Introduction to R. Biostatistics 615/815 Lecture 23

Practical 2: Plotting

Lab5A - Intro to GGPLOT2 Z.Sang Sept 24, 2018

Introduction to R and the tidyverse. Paolo Crosetto

Advanced Econometric Methods EMET3011/8014

Advanced Graphics with R

plot(seq(0,10,1), seq(0,10,1), main = "the Title", xlim=c(1,20), ylim=c(1,20), col="darkblue");

Mixed models in R using the lme4 package Part 2: Lattice graphics

Stat 849: Plotting responses and covariates

Introduction to Graphics with ggplot2

Stat 849: Plotting responses and covariates

IST 3108 Data Analysis and Graphics Using R Week 9

Package beanplot. R topics documented: February 19, Type Package

Data Visualization in R

A (very) brief introduction to R

Graphing Bivariate Relationships

> glucose = c(81, 85, 93, 93, 99, 76, 75, 84, 78, 84, 81, 82, 89, + 81, 96, 82, 74, 70, 84, 86, 80, 70, 131, 75, 88, 102, 115, + 89, 82, 79, 106)

LAB #1: DESCRIPTIVE STATISTICS WITH R

Working with Charts Stratum.Viewer 6

Statistical transformations

A set of rules describing how to compose a 'vocabulary' into permissible 'sentences'

Package ggextra. April 4, 2018

An introduction to WS 2015/2016

Package lvplot. August 29, 2016

Data visualization with ggplot2

Homework 1 Excel Basics

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan.

Organizing and Summarizing Data

Package rafalib. R topics documented: August 29, Version 1.0.0

Package arphit. March 28, 2019

Recap From Last Time: Today s Learning Goals. Today s Learning Goals BIMM 143. Data visualization with R. Lecture 5. Barry Grant

R Visualizing Data. Fall Fall 2016 CS130 - Intro to R 1

Practice for Learning R and Learning Latex

Package waterfall. R topics documented: February 15, Type Package. Version Date Title Waterfall Charts in R

The Basics of Plotting in R

03 - Intro to graphics (with ggplot2)

Graphics in R. Jim Bentley. The following code creates a couple of sample data frames that we will use in our examples.

Lecture 11: Publication Graphics. Eugen Buehler November 28, 2016

Topics for today Input / Output Using data frames Mathematics with vectors and matrices Summary statistics Basic graphics

Install RStudio from - use the standard installation.

Configuring Figure Regions with prepplot Ulrike Grömping 03 April 2018

ggplot2 for beginners Maria Novosolov 1 December, 2014

Package superheat. February 4, 2017

Using the CRM Pivot Tables

PRESENTING DATA. Overview. Some basic things to remember

R Workshop Guide. 1 Some Programming Basics. 1.1 Writing and executing code in R

STAT 540: R: Sections Arithmetic in R. Will perform these on vectors, matrices, arrays as well as on ordinary numbers

Visualization of large multivariate datasets with the tabplot package

Plotting with Rcell (Version 1.2-5)

Introduction Basics Simple Statistics Graphics. Using R for Data Analysis and Graphics. 4. Graphics

CSC 1315! Data Science

Visualizing Data: Customization with ggplot2

A system for statistical analysis. Instructions for installing software. R, R-studio and the R-commander

LECTURE NOTES FOR ECO231 COMPUTER APPLICATIONS I. Part Two. Introduction to R Programming. RStudio. November Written by. N.

GS Analysis of Microarray Data

BIMM-143: INTRODUCTION TO BIOINFORMATICS (Lecture 5)

Transcription:

Module 9 Data Visualization Andrew Jaffe Instructor

Basic Plots We covered some basic plots previously, but we are going to expand the ability to customize these basic graphics first. 2/45

Read in Data > death = read.csv("http://biostat.jhsph.edu/~ajaffe/files/indicatordeadkids35.csv", + as.is=true,header=true, row.names=1) > print(death[1:2, 1:5]) X1760 X1761 X1762 X1763 X1764 Afghanistan NA NA NA NA NA Albania NA NA NA NA NA We see that the column names were years, and R doesn't necessarily like to read in a column name that starts with a number and puts an X there. We'll just take off that X and get the years. year = as.integer(gsub("x","",names(death))) head(year) ## [1] 1760 1761 1762 1763 1764 1765 3/45

Basic Plots > plot(as.numeric(death["sweden",])~year) 4/45

Basic Plots The y-axis label isn't informative, and we can change the label of the y-axis using ylab (xlab for x), and main for the main title/label. > plot(as.numeric(death["sweden",])~year, + ylab="# of deaths per family", main = "Sweden") 5/45

Basic Plots Let's drop any of the projections and keep it to year 2012, and change the points to blue. > plot(as.numeric(death["sweden",])~year, + ylab="# of deaths per family", main = "Sweden", + xlim = c(1760,2012), pch = 19, cex=1.2,col="blue") 6/45

Basic Plots Using scatter.smooth plots the points and runs a loess smoother through the data. > scatter.smooth(as.numeric(death["sweden",])~year,span=0.2, + ylab="# of deaths per family", main = "Sweden",lwd=3, + xlim = c(1760,2012), pch = 19, cex=0.9,col="grey") 7/45

Basic Plots par(mfrow=c(1,2)) tells R that we want to set a parameter (par function) named mfrow (number of plots - 1 row, 2 columns) so we can have 2 plots side by side (Sweden and the UK) > par(mfrow=c(1,2)) > scatter.smooth(as.numeric(death["sweden",])~year,span=0.2, + ylab="# of deaths per family", main = "Sweden",lwd=3, + xlim = c(1760,2012), pch = 19, cex=0.9,col="grey") > scatter.smooth(as.numeric(death["united Kingdom",])~year,span=0.2, + ylab="# of deaths per family", main = "United Kingdom",lwd=3, + xlim = c(1760,2012), pch = 19, cex=0.9,col="grey") 8/45

Basic Plots We can set the y-axis to be the same. > par(mfrow=c(1,2)) > yl = range(death[c("sweden","united Kingdom"),]) > scatter.smooth(as.numeric(death["sweden",])~year,span=0.2,ylim=yl, + ylab="# of deaths per family", main = "Sweden",lwd=3, + xlim = c(1760,2012), pch = 19, cex=0.9,col="grey") > scatter.smooth(as.numeric(death["united Kingdom",])~year,span=0.2, + ylab="", main = "United Kingdom",lwd=3,ylim=yl, + xlim = c(1760,2012), pch = 19, cex=0.9,col="grey") 9/45

Bar Plots Stacked Bar Charts are sometimes wanted to show distributions of data > ## Stacked Bar Charts > cars = read.csv("http://biostat.jhsph.edu/~ajaffe/files/kagglecarauction.csv",as.is=t) > counts <- table(cars$isbadbuy, cars$vehicleage) > barplot(counts, main="car Distribution by Age and Bad Buy Status", + xlab="vehicle Age", col=c("darkblue","red"), + legend = rownames(counts)) 10/45

Bar Plots prop.table allows you to convert a table to proportions (depends on margin - either row percent or column percent) > ## Use percentages (column percentages) > barplot(prop.table(counts, 2), main="car Distribution by Age and Bad Buy Status", + xlab="vehicle Age", col=c("darkblue","red"), + legend = rownames(counts)) 11/45

Bar Plots Using the beside argument in barplot, you can get side-by-side barplots. > # Stacked Bar Plot with Colors and Legend > barplot(counts, main="car Distribution by Age and Bad Buy Status", + xlab="vehicle Age", col=c("darkblue","red"), + legend = rownames(counts), beside=true) 12/45

Graphics parameters Set within most plots in the base 'graphics' package: pch = point shape, http://voteview.com/symbols_pch.htm cex = size/scale xlab, ylab = labels for x and y axes main = plot title lwd = line density col = color cex.axis, cex.lab, cex.main = scaling/sizing for axes marks, axes labels, and title 13/45

Devices By default, R displays plots in a separate panel. From there, you can export the plot to a variety of image file types, or copy it to the clipboard. However, sometimes its very nice to save many plots made at one time to one pdf file, say, for flipping through. Or being more precise with the plot size in the saved file. R has 5 additional graphics devices: bmp(), jpeg(), png(), tiff(), and pdf() The syntax is very similar for all of them: pdf("filename.pdf", width=8, height=8) # inches plot() # plot 1 plot() # plot 2 # etc dev.off() Basically, you are creating a pdf file, and telling R to write any subsequent plots to that file. Once you are done, you turn the device off. Note that failing to turn the device off will create a pdf file that is corrupt, that you cannot open. 14/45

Boxplots, revisited These are one of my favorite plots. They are way more informative than the barchart + antenna... > with(chickweight, boxplot(weight ~ Diet, outline=false)) > points(chickweight$weight ~ jitter(as.numeric(chickweight$diet),0.5)) 15/45

Formulas Formulas have the format of y ~ x and functions taking formulas have a data argument where you pass the data.frame. You don't need to use $ or referencing when using formulas: boxplot(weight ~ Diet, data=chickweight, outline=false) 16/45

ggplot2 ggplot2 is a package of plotting that is very popular and powerful. > library(ggplot2) > qplot(factor(diet), y= weight, data = ChickWeight, geom = "boxplot") 17/45

Boxplots revisited again We can do the same plot, by just saying we want a boxplot and points (and jitter the points) > qplot(factor(diet), y= weight, data = ChickWeight, geom = c("boxplot", "point"), + position = c('identity', "jitter")) 18/45

Histograms again We can do histograms again using hist. Let's do histograms of weight at all time points for the chick's weights. We reiterate how useful these are to show your data. > hist(chickweight$weight, breaks=20) 19/45

Multiple Histograms > qplot(x= weight, fill = Diet, data = ChickWeight, geom = c("histogram")) stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this. 20/45

Multiple Histograms Alpha refers tot he opacity of the color, less is > qplot(x= weight, fill = Diet, data = ChickWeight, geom = c("histogram"), alpha=i(.7)) stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this. 21/45

Multiple Densities We cold also do densities > qplot(x= weight, fill = Diet, data = ChickWeight, geom = c("density"), alpha=i(.7)) 22/45

Multiple Densities > qplot(x= weight, colour = Diet, data = ChickWeight, geom = c("density"), alpha=i(.7)) 23/45

Multiple Densities You can take off the lines of the bottom like this > qplot(x= weight, colour = Diet, data = ChickWeight, geom = c("line"), stat="density") 24/45

Spaghetti plot We can make a spaghetti plot by telling ggplot we want a "line", and each line is colored by Chick. > qplot(x=time, y=weight, colour = Chick, + data = ChickWeight, geom = "line") 25/45

Spaghetti plot: Facets In ggplot2, if you want separate plots for something, these are referred to as facets. > qplot(x=time, y=weight, colour = Chick, facets = ~ Diet, + data = ChickWeight, geom = "line") 26/45

Spaghetti plot: Facets We can turn off the legend (referred to a "guide" in ggplot2). (Note - there is different syntax with the +) > qplot(x=time, y=weight, colour = Chick, facets = ~ Diet, + data = ChickWeight, geom = "line") + guides(colour=false) 27/45

Colors R relies on color 'palettes'. > palette("default") > plot(1:8, 1:8, type="n") > text(1:8, 1:8, lab = palette(), col = 1:8) 28/45

Colors The default color palette is pretty bad, so you can try to make your own. > palette(c("darkred","orange","blue")) > plot(1:3,1:3,col=1:3,pch =19,cex=2) 29/45

Colors It's actually pretty hard to make a good color palette. Luckily, smart and artistic people have spent a lot more time thinking about this. The result is the 'RColorBrewer' package RColorBrewer::display.brewer.all() will show you all of the palettes available. You can even print it out and keep it next to your monitor for reference. The help file for brewer.pal() gives you an idea how to use the package. You can also get a "sneak peek" of these palettes at: www.colorbrewer2.com. You would provide the number of levels or classes of your data, and then the type of data: sequential, diverging, or qualitative. The names of the RColorBrewer palettes are the string after 'pick a color scheme:' 30/45

Colors > palette("default") > with(chickweight, plot(weight ~ Time, pch = 19, col = Diet)) 31/45

> library(rcolorbrewer) > palette(brewer.pal(5,"dark2")) > with(chickweight, plot(weight ~ Time, pch = 19, col = Diet)) 32/45

> library(rcolorbrewer) > palette(brewer.pal(5,"dark2")) > with(chickweight, plot(weight ~ jitter(time,amount=0.2), + pch = 19, col = Diet),xlab="Time") 33/45

Adding legends The legend() command adds a legend to your plot. There are tons of arguments to pass it. x, y=null: this just means you can give (x,y) coordinates, or more commonly just give x, as a character string: "top","bottom","topleft","bottomleft","topright","bottomright". legend: unique character vector, the levels of a factor pch, lwd: if you want points in the legend, give a pch value. if you want lines, give a lwd value. col: give the color for each legend level 34/45

> palette(brewer.pal(5,"dark2")) > with(chickweight, plot(weight ~ jitter(time,amount=0.2), + pch = 19, col = Diet),xlab="Time") > legend("topleft", paste("diet",levels(chickweight$diet)), + col = 1:length(levels(ChickWeight$Diet)), + lwd = 3, ncol = 2) 35/45

Coloring by variable > load("data/charmcirc.rda") > palette(brewer.pal(7,"dark2")) > dd = factor(dat$day) > with(dat, plot(orangeaverage ~ greenaverage, pch=19, col = as.numeric(dd))) > legend("bottomright", levels(dd), col=1:length(dd), pch = 19) 36/45

Coloring by variable > dd = factor(dat$day, levels=c("monday","tuesday","wednesday","thursday", + "Friday","Saturday","Sunday")) > with(dat, plot(orangeaverage ~ greenaverage, pch=19, col = as.numeric(dd))) > legend("bottomright", levels(dd), col=1:length(dd), pch = 19) 37/45

More powerful graphics There are two very common packages for making very nice looking graphics. lattice: http://lmdvr.r-forge.r-project.org/figures/figures.html ggplot2: http://docs.ggplot2.org/current/index.html 38/45

Lattice > library(lattice) > xyplot(weight ~ Time Diet, data = ChickWeight) 39/45

Lattice > densityplot(~weight Diet, data = ChickWeight) 40/45

Lattice > rownames(dat2) = dat2$date > mat = as.matrix(dat2[975:nrow(dat2),3:6]) > levelplot(t(mat), aspect = "fill") 41/45

Lattice > theseq = seq(0,max(mat), by=50) > my.col <- colorramppalette(brewer.pal(5,"greens"))(length(theseq)) > levelplot(t(mat), aspect = "fill",at = theseq,col.regions = my.col,xlab="route",ylab="date") 42/45

Lattice > library(rcolorbrewer) > tmp=death[grep("s$", rownames(death)), 200:251] > yr = gsub("x","",names(tmp)) > theseq = seq(0,max(tmp,na.rm=true), by=0.05) > my.col <- colorramppalette(brewer.pal(5,"reds"))(length(theseq)) > levelplot(t(tmp), aspect = "fill",at = theseq,col.regions = my.col, + scales=list(x=list(label=yr, rot=90, cex=0.7))) 43/45

ggplot2 Useful links: http://docs.ggplot2.org/0.9.3/index.html http://www.cookbook-r.com/graphs/ 44/45

45/45