Stat405. Displaying distributions. Hadley Wickham. Thursday, August 23, 12

Similar documents
Large data. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

ggplot2 basics Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University September 2011

Statistical transformations

Ggplot2 QMMA. Emanuele Taufer. 2/19/2018 Ggplot2 (1)

ggplot2: elegant graphics for data analysis

Lecture 4: Data Visualization I

The diamonds dataset Visualizing data in R with ggplot2

An Introduction to R Graphics

Introductory Tutorial: Part 1 Describing Data

Plotting with Rcell (Version 1.2-5)

Graphical critique & theory. Hadley Wickham

Introduction to Graphics with ggplot2

Facets and Continuous graphs

03 - Intro to graphics (with ggplot2)

Visualization of large multivariate datasets with the tabplot package

Pragmatic R for Biologists 10/22/10

Getting started with ggplot2

A set of rules describing how to compose a 'vocabulary' into permissible 'sentences'

An introduction to ggplot: An implementation of the grammar of graphics in R

User manual forggsubplot

Package ggsubplot. February 15, 2013

Intro to R for Epidemiologists

Visualizing Data: Freq. Tables, Histograms

ggplot2 for beginners Maria Novosolov 1 December, 2014

Intoduction to data analysis with R

Lab 4A: Foundations for statistical inference - Sampling distributions

CSC 1315! Data Science

Ch. 1.4 Histograms & Stem-&-Leaf Plots

Data Visualization Using R & ggplot2. Karthik Ram October 6, 2013

LondonR: Introduction to ggplot2. Nick Howlett Data Scientist

MATH 117 Statistical Methods for Management I Chapter Two

Data Visualization. Andrew Jaffe Instructor

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan.

1 The ggplot2 workflow

Introduction to Minitab 1

Data Visualization in R

Stat 849: Plotting responses and covariates

Stat 849: Plotting responses and covariates

EXCEL SKILLS. Selecting Cells: Step 1: Click and drag to select the cells you want.

data visualization Show the Data Snow Month skimming deep waters

3. Solve the following. Round to the nearest thousandth.

Making Science Graphs and Interpreting Data

Foundations for statistical inference - Sampling distribu- tions

Styles, Style Sheets, the Box Model and Liquid Layout

Chapter 2: Graphical Summaries of Data 2.1 Graphical Summaries for Qualitative Data. Frequency: Frequency distribution:

Statistics Lecture 6. Looking at data one variable

Econ 2148, spring 2019 Data visualization

Chapter 2: Descriptive Statistics (Part 1)

Visualizing ASH. John Beresniewicz NoCOUG 2018

Designing an Energy Drink Can Teacher Materials

Rstudio GGPLOT2. Preparations. The first plot: Hello world! W2018 RENR690 Zihaohan Sang

How to create a theograph. A step-by-step guide

Install RStudio from - use the standard installation.

Combo Charts. Chapter 145. Introduction. Data Structure. Procedure Options

Orientation Assignment for Statistics Software (nothing to hand in) Mary Parker,

2. The histogram. class limits class boundaries frequency cumulative frequency

Introduction to R and the tidyverse. Paolo Crosetto

Frequency Distributions and Graphs

Introduction to Data Visualization

Homework 8. ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` ``

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #

Welcome to class! Put your Create Your Own Survey into the inbox. Sign into Edgenuity. Begin to work on the NC-Math I material.

Our Changing Forests Level 2 Graphing Exercises (Google Sheets)

Data Visualization in R

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

1.2. Pictorial and Tabular Methods in Descriptive Statistics

Data Presentation using Excel

MKTG 460 Winter 2019 Solutions #1

1 Introduction to Using Excel Spreadsheets

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

CPSC 535 Assignment 1: Introduction to Matlab and Octave

NOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA

Data input & output. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Maps & layers. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

2.1: Frequency Distributions

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

Lesson 18-1 Lesson Lesson 18-1 Lesson Lesson 18-2 Lesson 18-2

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

Practical 4: The Integrate & Fire neuron

Properties of Data. Digging into Data: Jordan Boyd-Graber. University of Maryland. February 11, 2013

Statistics 251: Statistical Methods

Figure 1: The PMG GUI on startup

Information Visualization. SWE 432, Fall 2016 Design and Implementation of Software for the Web

Statistical Tables and Graphs

Features List. Updated on 27-September-2018 (Hiddime-1.8.3)

Chapter 2 Assignment (due Thursday, April 19)

Package lemon. September 12, 2017

D-SCOPE The indispensable analysis tool for all diamonds

Stat405. More about data. Hadley Wickham. Tuesday, September 11, 12

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Area and Perimeter EXPERIMENT. How are the area and perimeter of a rectangle related? You probably know the formulas by heart:

Importing and visualizing data in R. Day 3

Visualizing Data: Customization with ggplot2

Package ggrepel. September 30, 2017

Chapter 2: Understanding Data Distributions with Tables and Graphs

1 Introduction. 1.1 What is Statistics?

Intro to Stata for Political Scientists

Chapter 2 Assignment (due Thursday, October 5)

Understanding prototype fidelity What is Digital Prototyping? Introduction to various digital prototyping tools

Transcription:

Stat405 Displaying distributions Hadley Wickham

1. The diamonds data 2. Histograms and bar charts 3. Homework

Diamonds

Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour, clarity, cut Total depth, table, depth, width, height Price

x table width z depth = z / diameter table = table width / x * 100

Recall Write down five ways to inspect the diamonds dataset. You have one minute!

Histogram & bar charts

Histograms and barcharts Used to display the distribution of a variable Categorical variable bar chart Continuous variable histogram

# With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) # Change binwidth: qplot(carat, data = diamonds, binwidth = 1) qplot(carat, data = diamonds, binwidth = 0.1) qplot(carat, data = diamonds, binwidth = 0.01) resolution(diamonds$carat) last_plot() + xlim(0, 3)

Always experiment with the bin width!

qplot(table, data = diamonds, binwidth = 1) # To zoom in on a plot region use xlim() and ylim() qplot(table, data = diamonds, binwidth = 1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) + ylim(0, 50) # Note that this type of zooming discards data # outside of the plot regions. See #?coord_cartesian() for an alternative

Additional variables As with scatterplots can use aesthetics or faceting. Using aesthetics creates pretty, but ineffective, plots. The following examples show the difference, when investigation the relationship between cut and depth.

4000 3000 count 2000 1000 0 56 58 60 62 64 66 68 70 qplot(depth, data = diamonds, depth binwidth = 0.2)

4000 3000 count 2000 cut Fair Good Very Good Premium Ideal 1000 0 qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, depth 70)

4000 3000 count 2000 cut Fair Good Very Good Premium Ideal 1000 Fill is the aesthetic 0 for fill colour qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, depth 70)

Fair Good Very Good 2500 2000 1500 1000 500 count 0 Premium Ideal 2500 2000 1500 1000 500 0 qplot(depth, 56 58 60 62 data 64 66= 68diamonds, 70 56 58 60 binwidth 62 64 66 68 = 70 0.2) 56 58 + 60 62 64 66 68 70 xlim(55, 70) + facet_wrap(~ depth cut)

Your turn Explore the distribution of price. What is a good binwidth to use? (Hint: How many bins will a binwidth of 1 give you?) Practice zooming in on regions of interest. How does price vary with colour, cut, or clarity?

Fair Good Very Good 6000 5000 4000 3000 2000 1000 count 0 6000 Premium Ideal 5000 4000 3000 2000 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 qplot(price, data = diamonds, binwidth price = 500) + facet_wrap(~ cut)

Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count 6000 5000 4000 3000 2000 1000 0 Premium Ideal What makes it difficult to compare the shape of the distributions? Brainstorm for 1 minute. 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 qplot(price, data = diamonds, binwidth price = 500) + facet_wrap(~ cut)

Problems Each histogram far away from the others, but we know stacking is hard to read use another way of displaying densities Varying relative abundance makes comparisons difficult rescale to ensure constant area

# Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density #.. indicates that variable isn't in original data qplot(price,..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price,..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut)

Aside: coding strategy At the end of each interactive session, you want a summary of everything you did. Two options: 1. Copy from the history panel. 2. Build up the important bits as you go. (recommended)

Homework

Asking questions You have two minutes to write down as many interesting questions as you can about the diamonds data. Share them with your neighbour. Try not to make them too simple (what s the biggest diamond) or too complex.

Homework Create three plots that reveal something interesting about the diamonds or mpg data. Throwing out 90% of the data can be a valid strategy. Include code, plot and paragraph of text. For one plot, show the process of iteration by which you reached the final answer.

Grading You will be graded out of five on three dispositions of a good data analyst: curiosity, scepticism and organisation. The same rubric will be used throughout the semester, so it is very hard to get a 4 or 5 on your first homework.

Common mistakes Too much in one plot. Broad and shallow instead of deep. Failed to cite resources used if you get an image from the web, provide the url. Incorrect aspect ratio. Failed to deal with overplotting.