1 Lecture 2 STATS/CME 195 Matteo Sesia Stanford University Spring 2018

2 Contents The diamonds dataset Visualizing data in R with ggplot2

3 The diamonds dataset

4 The tibble package The tibble package is part of the core tidyverse. library(tidyverse) Tibbles are data frames, tweaked to make life a little easier: never change the type of the inputs (e.g. do not convert strings to factors!) never changes the names of variables only recycles inputs of length 1 never creates row.names() Subsetting is a little different in tibbles: use [[]] or $ to extract columns. You can read more about these features with vignette("tibble")

5 The diamonds dataset Contains prices and other attributes of almost 54,000 diamonds. Included in tidyverse. data(diamonds) diamonds ## # A tibble: 53,940 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## Ideal E SI ## Premium E SI ## Good E VS ## Premium I VS ## Good J SI ## Very Good J VVS ## Very Good I VVS ## Very Good H SI ## Fair E VS ## Very Good H VS ## #... with 53,930 more rows More information with?diamonds. Spreadsheet view in RStudio with View(diamonds).

6 Introduction to ggplot2

7 The ggplot2 package The ggplot2 package is part of the core of tidyverse. library(tidyverse) It is the most elegant and versatile tool for graphically visualizing data in R, offering a coherent system (or grammar) for building graphs. R also has some basic built-in graphics, but we do not use that in this course. Instead, ggplot2 offers: higher level of abstraction plots broken into layers beautiful graphics excellent documentation large user base Base graphics are good for drawing pictures; ggplot2 graphics are good for understanding the data. (Wickham, 2012)

8 Building blocks of a ggplot2 graph A ggplot2 graph is built up from a few basic elements: Element Symbol Description Data The raw data that you want to plot. Geometries geom_ The geometric shapes that will represent the data. Aesthetics aes() Aesthetics of the geometric and statistical objects, such as color, size, shape and position. Scales scale_ Maps between the data and the aesthetic dimensions, such as data range to plot width or factor values to colors. The ggplot() function is used to initialize the basic graph structure. You need to add extra components to generate a graph. Specify different parts of a plot, and add them using an + operator.

9 Creating a ggplot Create a scatterplot with weight on the x axis and price on the y axis. ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

10 Plots as objects Whenever ggplot() is called, an object is created. p <- ggplot(diamonds, aes(x=carat, y=price)) + geom_point() p

11 Saving plots Now that you have your beautiful plot, you may want to save it as an image. ggsave() is a convenient function for saving a plot. By default, it saves the last plot that you displayed, using the size of the current graphics device. It also guesses the type of graphics device from the extension. ggsave(filename, plot = last_plot(), device = NULL, path = NULL, scale = 1, width = NA, height = NA, units = c("in", "cm", "mm"), dpi = 300, limitsize = TRUE,...) Device can be either be a device function (e.g. png), or one of eps, ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg or wmf (windows only).

12 Aesthetic mappings

13 What are aesthetic mappings? Aesthetic means something you can see. Examples include: position (i.e., on the x and y axes) color ( outside color) fill ( inside color) shape (of points) linetype size Each type of geom accepts only a subset of all aesthetics. You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset.

14 Adding a color aesthetic We can color the points based on clarity by adding another aesthetic. ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

15 Adding a color aesthetic (2) How does the quality of the color affect the price? ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_point()

16 Adding a shape aesthetic ggplot(diamonds, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point()

17 Facets

18 Facets for a categorical variable Another way of adding categorical variables is to split your plot into facets. ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point() + facet_wrap(~ cut)

19 Facets for combinations of variables You can facet your plot on the combination of two variables. ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + facet_wrap(cut ~ clarity, nrow=5)

20 Exercise Fuel economy data for 38 popular models of car mpg ## # A tibble: 3 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl c ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> < ## 1 audi a auto f p c ## 2 audi a manu f p c ## 3 audi a manu f p c What plots does the following code make? What does. do? ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~.) ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(. ~ cyl)

21 Geometric objects

22 What are geometric objects? Geometric objects are the marks we put on the plot. Examples: points (geom_point, for scatter plots, dot plots, ) lines (geom_line, for time series; geom_smooth, for trend lines, ) boxplot (geom_boxplot, for boxplots) A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator. For a list of available geometric objects:"geom_", package = "ggplot2")

23 Adding a smoothing trend ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth() The shaded area represents uncertainty in this smoothing curve.

24 Adding a smoothing trend (2) With a color aesthetic, ggplot will create one smoothing curve for each color. ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point() + geom_smooth()

25 Changing the smoothing method ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point() + geom_smooth(method="lm") Type help(geom_smooth, ggplot2) for more options.

26 Specifying geometric object aesthetics Aesthetics can also be specified for a single geometric object. ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=clarity)) + geom_smooth()

27 Aesthetic mappings vs fixed aesthetics Set the same color and transparency for all observations. ggplot(diamonds, aes(x=carat, y=price)) + geom_point(color="darkred", alpha=0.2)

28 Histograms ggplot(diamonds, aes(x=price)) + geom_histogram()

29 Customizing histograms You can specify the number of bins or the bin width. ggplot(diamonds, aes(x=price)) + geom_histogram(bins=10)

30 Bar charts A discrete analogue of a histogram is the bar chart. The geom_bar counts the number of instances of each discrete class. ggplot(diamonds, aes(x=clarity)) + geom_bar()

31 Make this plot. Exercise

32 Boxplots Boxplots graphically depict groups of numerical data through their quartiles. ggplot(diamonds, aes(x=clarity, y=carat)) + geom_boxplot()

33 Position adjustments Position adjustments are used to adjust the position of each geom. The following position adjustments are available: position_identity: default of most geoms position_jitter: adds a small amount of random variation position_dodge: default of geom_boxplot position_stack: default of geom_bar, geom_histogram position_fill: useful for geom_bar, geom_histogram The position parameter can be set as follows: geom_point(..., position="jitter")

34 Position adjustments for scatterplots Overplotting: many points overlap each other. Here variables are categorical, but sometimes rounding causes overplotting. p0 <- ggplot(diamonds,aes(x=cut, y=depth)) p1 <- p0 + geom_point() p2 <- p0 + geom_point(position = "jitter")

35 Position adjustments for bar charts The stacking is performed automatically by the position adjustment specified by the position argument. p0 <- ggplot(data = diamonds, aes(x=cut, fill=clarity)) p1 <- p0 + geom_bar() # p2 <- EXERCISE

36 Scales

37 Aesthetic mapping vs variable scaling aes() assigns an aesthetic to a variable; it doesn t determine how mapping should be done. For example, aes(shape = x) or aes(color = z) do not specify what shapes or what colors should be used. To choose colors/shapes/sizes etc. you need to modify the corresponding scale. ggplot2 includes scales for: position color and fill size shape line type Scales can be modified with functions of the form: scale_<aesthetic>_<type>() In RStudio, type scale_ followed by TAB to list all available scales.

38 Scales for axes Square-root transformation on the y-axis: p1 <- ggplot(diamonds, aes(x = carat, y = price)) + geom_point() p2 <- p1 + scale_y_sqrt()

39 Scales for shapes p1 <- ggplot(diamonds,aes(x=carat,y=price,shape=cut))+geom_point() p2 <- p1 + scale_shape_manual(values = c(1:5))

40 Scales for discrete colors To choose specific colors for discrete variables we use scale_color_manual(). p1 <- ggplot(diamonds,aes(x=carat,y=price,color=cut))+geom_point() color.values <- c("red","orange","yellow","green","blue") p2 <- p1 + scale_color_manual(values=color.values) You can also use default palettes with scale_color_brewer.

41 Scales for continuous colors For continuous variables we use scale_color_gradient and specify the end-points of the color spectrum. p1 <- ggplot(diamonds,aes(x=carat,y=price,color=price))+geom_point() #p2 <- EXERCISE You can also scale the values of the variable corresponding to color: scale_color_gradient(low = "blue", high = "red", trans = "log10")

42 Manual transformations You can also define your own transformations, e.g. position scaling. Square-root transformation on the y-axis: p1 <- ggplot(diamonds,aes(x=carat,y=price))+geom_point()+scale_y_sqrt() p2 <- ggplot(diamonds,aes(x=carat,y=sqrt(price))) + geom_point() Note that the labels on the y-axis are different.

43 Modify axis, legend, and plot labels Good labels are critical for making your plots accessible to a wider audience. Some of the most useful ggplot2 functions: labs(...) xlab(label) ylab(label) ggtitle(label, subtitle = NULL) You can even display mathematical formulae, using the function expression() and the syntax of plotmath expressions. Alternatively, the R package latex2exp lets you use LaTeX to typeset math.

44 Axis labels with mathematical expressions Square-root transformation on the y-axis: p1 <- ggplot(diamonds,aes(x=carat,y=sqrt(price))) + geom_point() p2 <- p1 + labs(x = "carat", y = expression(sqrt(price)))

45 Statistical transformations

46 Credit to What are statistical transformations? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot: Bar charts and histograms bin your data and then plot bin counts. Smoothers fit a model to your data and then plot predictions. Boxplots compute a robust summary of the distribution and then display it. The algorithm used to calculate new values for a graph is called a stat.

47 C ed t to ttp:// ds. geom vs. stat You can generally use geoms and stats interchangeably. Every geom has a default stat, and every stat has a default geom. p1 <- ggplot(data = diamonds) + geom_bar(aes(x = cut)) p2 <- ggplot(data = diamonds) + stat_count(aes(x = cut))

48 So why should you know about stat? You may want to: Override the default stat. E.g., in a bar chart with frequencies already in data set, use stat_identity instead of the default stat_count. Override the default mapping from transformed variables to aesthetics. E.g., bar chart of proportion, rather than count. Draw greater attention to the statistical transformation in your code. E.g., stat_summary summarises the y values for each unique x value. ggplot2 provides over 20 stats for you to use. You can get help in the usual way:"stat_", package = "ggplot2")

49 The layered grammar of graphics

50 A code template At this point we have a foundation to make any type of plot. ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<mappings>), stat = <STAT>, position = <POSITION>) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> This composes the grammar of graphics, a formal system for building plots. In practice, we don t need all: ggplot2 will provide useful defaults for everything except: the data the mappings the geom function. We did not discuss coordinate systems: they are a little more complicated. The default is Cartesian: x and y positions act independently to determine the location of each point.

51 Summary of the workflow At this point we have developed a general recipe for making any plot. 1. Specify the dataset 2. Transform it into the information that you want to display (stat_) 3. Choose a geometric object to represent each observation in the transformed data (geom_) 4. Select a coordinate system to place the geoms into (default: Cartesian) 5. If needed, extent the plot by adding layers 6. If needed, create multiple plots with facets Have we missed anything? Scales (discussed earlier): from data values to visual properties Aesthetics unrelated to the data (later) Annotations (omitted): layers than don t inherit global settings from the plot. Used to add fixed reference data to plot. Programming with ggplot2 (omitted): automating the creation of plots

52 Conclusion

53 Learn more about ggplot2 Use the help?function_name help(function_name, package_name) Online package reference can be easier to search. Lots of other online resources. A good starting point: Book: R for data science, by Garrett Grolemund and Hadley Wickham.

54 Other R packages for plotting Even ggplot2 cannot literally make any plot. Some kinds of plot require specialized packages. plotly: interactive and 3D plots, good for online publications gridextra: easily combine plots into grids ggnet2: visualizing networks heatmaply: interactive heatmaps ggmap: retrieve maps from popular online mapping services and plot them using the ggplot2 framework More information online, or as a starting point last year s course material.

55 Common problems As you start to run R code, you re likely to run into problems. Don t worry - it happens to everyone. Common things to check: Pair every ( with ) and opening " with closing " If there is a + at the end, R expects you to complete the expression + has to go at the end of the line, not the beginning Do not be afrain to use the help.?function_name help(function_name, package_name) Carefully read any error messages. Another great tool is Google: try looking up the error message.

56 Next time Importing data from files

More information