Data Visualization. Module 7

Similar documents
Facets and Continuous graphs

Statistical transformations

An Introduction to R Graphics

PRESENTING DATA. Overview. Some basic things to remember

Getting started with ggplot2

Plotting with Rcell (Version 1.2-5)

Intro to R for Epidemiologists

Data Visualization Using R & ggplot2. Karthik Ram October 6, 2013

Session 3 Nick Hathaway;

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan.

Lab5A - Intro to GGPLOT2 Z.Sang Sept 24, 2018

# Call plot plot(gg)

social data science Data Visualization Sebastian Barfort August 08, 2016 University of Copenhagen Department of Economics 1/86

DATA VISUALIZATION WITH GGPLOT2. Coordinates

The diamonds dataset Visualizing data in R with ggplot2

8. MINITAB COMMANDS WEEK-BY-WEEK

Introduction to R and the tidyverse. Paolo Crosetto

Data Import and Formatting

Introduction to Graphics with ggplot2

Data visualization with ggplot2

Graphical critique & theory. Hadley Wickham

ggplot2 for beginners Maria Novosolov 1 December, 2014

ggplot in 3 easy steps (maybe 2 easy steps)

Package ggmosaic. February 9, 2017

Introduction to Data Visualization

Package ggsubplot. February 15, 2013

User manual forggsubplot

Analysis of Variance. Module 8

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

Data Visualization in R

Data Handling: Import, Cleaning and Visualisation

03 - Intro to graphics (with ggplot2)

Session 5 Nick Hathaway;

The Average and SD in R

Creating elegant graphics in R with ggplot2

Lecture 4: Data Visualization I

1 The ggplot2 workflow

ggplot2 basics Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University September 2011

Package reghelper. April 8, 2017

Data Visualization in R

Lecture 09. Graphics::ggplot I R Teaching Team. October 1, 2018

An introduction to ggplot: An implementation of the grammar of graphics in R

Large data. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Econ 2148, spring 2019 Data visualization

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Maps & layers. Hadley Wickham. Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University.

Install RStudio from - use the standard installation.

QUIZ 4 (CHAPTER 17) SOLUTIONS MATH 252 FALL 2008 KUNIYUKI SCORED OUT OF 125 POINTS MULTIPLIED BY % POSSIBLE

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Importing and visualizing data in R. Day 3

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

x 2 + 3, r 4(x) = x2 1

Bar Charts and Frequency Distributions

Homework 1 Excel Basics

Organizing and Summarizing Data

6 Using Technology Wisely

Rstudio GGPLOT2. Preparations. The first plot: Hello world! W2018 RENR690 Zihaohan Sang

Learning Packet THIS BOX FOR INSTRUCTOR GRADING USE ONLY. Mini-Lesson is complete and information presented is as found on media links (0 5 pts)

Introduction to R. Andy Grogan-Kaylor October 22, Contents

1 Introduction to Using Excel Spreadsheets

A set of rules describing how to compose a 'vocabulary' into permissible 'sentences'

Lab 4. Recall, from last lab the commands Table[], ListPlot[], DiscretePlot[], and Limit[]. Go ahead and review them now you'll be using them soon.

An introduction to plotting data

Adding CSS to your HTML

Object-Oriented Thinking

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Automation.

Hints for Exercise 4: Recursion

Rockefeller College MPA Excel Workshop: Clinton Impeachment Data Example

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Learning Objectives for Data Concept and Visualization

Data Science. Data Analyst. Data Scientist. Data Architect

Week 7: The normal distribution and sample means

Visualizing the World

6.001 Notes: Section 6.1

Introducing Categorical Data/Variables (pp )

The Normal Curve. June 20, Bryan T. Karazsia, M.A.

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides

6.001 Notes: Section 15.1

New and Improved Formula Editor in JMP 13

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Lesson 8 Introduction to Quadratic Functions

Stat 849: Plotting responses and covariates

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Activity Guide APIs and Using Functions with Parameters

Excel Tips and FAQs - MS 2010

Stat 849: Plotting responses and covariates

XGBoost: The Art and Science of Communicating Machine Learning Algorithms. Amy Szadziewska, Peak 6 th February 2018

Color. Today. part 2. How to Read a Research Paper Components of a Well-written Research Paper 3 Readings for Today

GRAPHING BAYOUSIDE CLASSROOM DATA

S4C03, HW2: model formulas, contrasts, ggplot2, and basic GLMs

Stat405. Displaying distributions. Hadley Wickham. Thursday, August 23, 12

Section Graphs and Lines

Data Manipulation. Module 5

An Introduction to R. Ed D. J. Berry 9th January 2017

LAST UPDATED: October 16, 2012 DISTRIBUTIONS PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder

Hash Tables (Cont'd) Carlos Moreno uwaterloo.ca EIT

Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore

Transcription:

Data Visualization http://datascience.tntlab.org Module 7

Today s Agenda A Brief Reminder to Update your Software A walkthrough of ggplot2 Big picture New cheatsheet, with some familiar caveats Geometric Objects Aesthetics Statistics Coordinates Facets Themes

Before We Begin We've now been using R for many weeks New versions of R have been released since we started Any time you install a new library that throws a version warning, update. Go back to the R website and re-download After installing R, redownload R Studio After installing R Studio, update your packages using update.packages(ask=f) If you find packages disappear/stop working, trace through error messages and reinstall using install.packages() 3

Big Picture on Visualization with ggplot2 Visualization is arguably the single most important step in data analysis Allows you to check assumptions quickly and intuitively Quickly communicate to other people on your project what you've learned Does the same for stakeholders ggplot2, like the rest of tidyverse, tries to guess what you are trying to do and glosses over some of the details Base-R does no glossing, which is why we're focusing on ggplot2 ggplot2 syntax allows you to very quickly try out multiple visualizations quickly, and keeps a record of precisely where those visualizations came from Remember the data pipeline in R scripting 4

What Does ggplot2 Get Us? ggplot2 handles a lot of the "under the hood" mathematics that you would normally need to do in R Remember that to run base-r's barplot(), we had to use the table() command first to create a table of frequencies Try to create a basic barplot with both ggplot() and barplot() Default visualization settings with ggplot2 are much nicer looking than base- R, and it's much easier to modify those settings It also gets us to think like data scientists in terms of visualization, an area where social science is typically horrible 5

ggplot2's Grammar of Graphics Each ggplot is a base ggplot object, defined with ggplot(). That object contains its own aesthetics plus geometric objects (geoms) that also have their own aesthetics, by default inherited from higher parent objects. If you want a geom to have varying properties, this needs to be specified in a parent object. If you want all instances of a geom to be the same, specify it within the geom. More specific always overrides less specific. Step 1: Create a ggplot object, specify data, and define its fundamental aesthetics (e.g., which axis is which) Step 2: Add geometric objects you want (geoms) and their own aesthetics Use + to combine features. For multiline coding, end with + Call a ggplot object directly to display it. 6

Getting Data into a ggplot2-friendly Format Although just about any structure of your data may be functional, you typically want one variable of values for each DV plus grouping factors For univariate statistics Isolate your outcome values as a single column If your data are longitudinal, restructure so that all DV measurements are in a single variable Isolate your grouping variables as unique columns If your data are split by condition, convert raw condition codes into their main effect representations (e.g., in a between-subjects 2x2, conditions 1 2 3 4 should be converted into two new factor variables with descriptive values). For multivariate statistics Isolate each DV as a column Still isolate your grouping variables as unique columns Always convert to correct types (numeric, factor, etc) at before starting to code a ggplot() 7

General Aesthetics of a ggplot Common Aesthetics (but there are more) x and y, your axes color, which is line color fill, which is fill color size, which is marker size shape, which is marker type weight, which is line thickness alpha, which refers to "alpha channel", i.e., transparency linetype, which is line pattern labels, which refers to any text Different geoms inherit different aesthetics, so decide on your geoms first Notice that this is the opposite order taken when writing code Can be helpful to sketch it on paper first Within a geom, aesthetics may be attributes instead (no aes() function) 8

New Cheatsheet! https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2- cheatsheet.pdf 9

Defining Specific Colors Web colors (the basis for many color systems) are traditionally coded in hexadecimal (i.e., base-16) 0 1 2 3 4 5 6 7 8 9 A B C D E F So, 00 = 0, 10 = 16, 20 = 32, FF = 255 Windows has a helpful calculator for this. Colors are represented as three channels of 8-bit additive color marked with # called a hex triplet: Red, Green, Blue (RGB) #000000 = 0% R, 0% G, 0% B, so black #FF0000 = 100% R, 0% G, 0% B, so red #C0C0C0 = 75% R, 75% G, 75% B, so 25% grey #FF00FF = 100% R, 0% G, 100% B, so fuchsia http://htmlcolorcodes.com/ can be helpful for finding specific colors But remember colors can also be referred to by name, as strings. 10

Scale Aesthetics scale functions change the appearance of the x or y scale, and are followed by continuous, discrete, date, datetime, log10, time, or sqrt scale_x_ scale_y_ scale_color_ scale_fill_ scale_shape_ scale_linetype_ Further scale modifiers, which all take vectors; but you usually want to work in the coordinates layer instead if you need these, since they literally slice apart the figure rather than just the part of the figure you see limits breaks expand 11

Other Useful Aesthetics jitter is useful for exploratory plots but should generally not be used in published papers unless you are working with "big data" and indicate what you're doing very clearly position is useful to reposition pieces of data (as a group) within a figure, e.g., to switch from a simple bar to a stacked bar stack, which is position by stacking dodge, which is position by offset fill, which rescales values as proportions 12

Reusing Aesthetics Don't do this ggplot(my_df, aes(x=var1, y=var2)) + geom_point(col="#abcdef", size=4, shape=3) ggplot(my_df, aes(x=var1, y=var3)) + geom_point(col="#abcdef", size=4, shape=3) ggplot(my_df, aes(x=var1, y=var4)) + geom_point(col="#abcdef", size=4, shape=3) ggplot(my_df, aes(x=var1, y=var5)) + geom_point(col="#abcdef", size=4, shape=3) Maybe do this my_geom_point <- geom_point(col="#abcdef", size=4, shape=3) ggplot(my_df, aes(x=var1, y=var2)) + my_geom_point ggplot(my_df, aes(x=var1, y=var3)) + my_geom_point ggplot(my_df, aes(x=var1, y=var4)) + my_geom_point ggplot(my_df, aes(x=var1, y=var5)) + my_geom_point Better yet, restructure your data using gather() from dplyr and use facet_grid() 13

Statistics Layer Peeks under the hood a bit of ggplot2 You will not often need to dip into the statistics layer, but you might need it for specific projects Examples: adding regions of significance to a regression scatterplot or adding a specific type of error bar to a graph You can generally do things like this in two ways Calculating the equivalent statistics yourself as new variables and plotting those variables Adding a stat_ layer: stat_summary() for summarizing y data given x, stat_function() for predicting y from x, and stat_qq() for Q-Q plots 14

Deconstructing stat_summary() Calls ggplot(iris, aes(x = Species, y = Sepal.Length)) + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1) What is happening here? stat_summary() summarizes observed y given x Thus, these stat_summary() summarize Sepal.Length given values of Species In the first, simple means are displayed as points for the mean of Species (y) In the second, we need to provide a range of 3 values for geom_errorbar() defined as x, ymin and ymax (see?geom_errorbar) A function from the Hmisc package, smean.sdl() will do this for us, but they won't be named what we need (see difference between smean.sdl() and mean_sdl() output) mean_sdl() essentially just renames the elements of the output list smean.sdl() defaults to +/- 2 SDs but can be overridden with the mult argument Because we're inside another function, the arguments must be passed as a list 15

Coordinates Layer Many different coordinate systems but you will rarely need anything other than a Cartesian coordinate system coord_cartesian(xlim = c(1,2), ylim = c(3,4)) Specifies limits for axes Data science default ideology is to display all data with as much detail as possible, common to exploratory approaches Very useful for us, e.g., to specify a 1 5 range of y for a Likert-type scale Might be able to accomplish the same thing with coord_equal() and some attributes 16

Facets Layer Create multiple figures with the same aesthetics and geoms from a single base figure Uses formula notation, which we'll deal with in much more detail in the next few weeks y ~ x1 + x2 == "univariate linear model of y with predictors x1 and x2" y ~. == "model of y with all other variables in the data frame as predictors" In the case of facets, formula notation specifies how the splitting is done rowvar ~.. ~ columnvar rowvar ~ columnvar # same order as df/matrix So, for example: ggplot(mydf, aes(x = condition, y = score)) + facet_grid(. ~ gender) 17

Themes Layer Use themes to control text and shape formatting after the figure is finalized; use aesthetics first if possible Change formatting of parts of a figure element_text(), element_line(), element_rect(), and element_blank() (which is really transparent) Look at?theme for the full list Best practice is to define a theme separately and then apply it, e.g., mytheme <- theme( legend.background = element_rect(fill="#543215"), plot.title = element_text(size = 4)) ggplot(mydf, aes(x=condition, y=value)) + theme(mytheme) 18

Margins, Spacing and Positioning Margins, spacing and positioning are typically done by distance with either "units" or presets (like legend.position = "right") What a "unit" appears as visually depends on your coordinates layer; a ggplot unit is defined as "one axis length from origin", so c(.8,.8) == 80% up each axis Units can be arbitrarily created or converted via other functions (like unit()) This is done because R figures are vector graphics instead of raster graphics (original image by Freepik) 19

Quick Plots Uses ggplot2 aesthetics but in a single command and with a lot of defaults set for you You just need a "data=" somewhere to specify the dataset Not really needed and you probably shouldn't use it It was designed to ease the transition between base-r's plot() and the ggplot() grammar HOWEVER ggpairs() is a quick and dirty visualization tool which you should definitely use from the GGally library You may see references to a function called plotmatrix but this has been dropped from ggplot2 20