Homework 8. ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` ``

Size: px

Start display at page:

Download "Homework 8. ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` ``"

Tracey Daniels
5 years ago
Views:

1 Homework 8 Once again, update agoldst/litdata. Chunk options (small exercise) Use the Homework template. Your first exercise is to modify the setup chunk to load dplyr and tidyr. It is time to be a little more sophisticated about those code chunks. Every chunk can have a name, which can be anything you want (but no spaces): ` ``{r my-chunk-name}... ` `` You can also include chunk options, which work exactly like named parameters to a function in R. You ve already encountered some of these, like ` ``{r eval=f}... ` `` for a chunk which is formatted to look like code but not actually executed. If you name your chunk, then you follow the name with a comma and then include options, separated by commas: ` ``{r chunk-name, opt1=val1, opt2=val2,...}... ` `` You can also apply chunk options to all the chunks in the document by including a first chunk that calls the knitr::opts_chunk$set function with named parameters. This is exactly what I have been doing all along in your homeworks. The comprehensive chunk option reference is yihui.name/knitr/options/. Now that you are including plots, you should know about three figure-related chunk options: fig.cap, fig.width, and fig.height. The first gives the figure caption; the latter two give the dimensions in inches of the figure. If you don t like the way any particular figure looks, by all means change its dimensions: ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` `` By default, figures in PDF files float, which means they might be typeset before or after the place in the text where you write the code to create them, as the software attempts to create aesthetically pleasing pages (really!). Altering this behavior is a topic for another day, so for now just remember to include captions that make clear which of your plots go with which problems. 1 1 Your figures are automatically numbered, and if you want to mention the figure number for a chunk called my-chunk, you can enter \ref{fig:my-chunk} in your text (not your code; this is a LaTeX macro). 1

2 Visualization and descriptive statistics As we have seen, it is a bit more straightforward to talk about visualizing quantitative data than it is to visualize categorical data. So, although economic data is a whole other kettle of fish, let s just work a little with the prices of the poetry translations from the Three Percent database, filtering out any Price entries that are NA or not formatted as a price: txl <- read.csv("three-percent.csv", as.is=t, encoding="utf-8") %>% filter(year < 2015) %>% rename(language=lanuage) txl_priced <- txl %>% filter(!is.na(price)) %>% filter(str_detect(price, "^\\d*\\.\\d\\d$")) %>% mutate(price=as.numeric(price)) po <- txl_priced %>% filter(genre == "Poetry") Univariate descriptive statistics Central tendencies What is the typical list price? There are three common ways to characterize the central tendency of single-variable data. The most important is the mean or average. If you have data x i, i = 1...n, the mean is x = 1 n Calculate this in R using the mean function. The mean is sensitive to large outliers (take a look at the optional trim parameter). The median is the half-way value or 50th percentile of the data. More exactly, if the x i are arranged in ascending order, then n i=1 { x(n+1)/2 if n is odd median(x) = ( ) xn/2 + x n/2+1 if n is even 1 2 Again R provides the median function. The third measure, the mode, is simply the most frequently occurring value of x i. You know how to find this already! Calculate the mean, median, and modal price and explain why they are different from one another. x i Dispersion How far does any given item deviate in price from the typical? The most important measure of dispersion is the variance. To measure dispersion, you might first think of taking the average of the differences from the mean 1 n n (x i x) i=1 2

3 but this is always zero (why?). So, to weight negative and positive differences equally, we square these differences (another option would be take absolute values, but this mean absolute deviation does not have the same nice mathematical properties). This yields the sample 2 variance s 2 = 1 n 1 n (x i x) 2 i=1 Once again R provides a convenient function var for the variance. But the variance is in squared units, which is hard to interpret. To obtain a dispersion measure in units of the original data, we take the square root to produce the sample standard deviation, R s function for this is called sd. s = s 2 Another useful concept for describing distribution spread is the interquartile range. Just as the median is the 50% point of the data, the first quartile is the 25% point of the data and the third quartile is the 75% point. (More generally, we can speak of percentiles. In R, you calculate these with the quantile function, first translating the percentage into a number between 0 and 1.) The IQR, which is the difference between the third and first quartiles, gives the spread of the middle 50% of the data. Yet another handy R function, IQR, calculates this for you. But it s often better to examine the quartiles themselves. Calculate a five-number summary of the poetry prices: the minimum, first quartile, median, third quartile, and maximum. Plotting the distribution Plot a histogram of the prices. Set the binwidth aesthetic to a constant value of one dollar. Small multiples! Plot a histogram of the prices for each year. To facet across a single variable, follow the faceting model from class, but instead of writing facet_grid(y_var ~ x_var), write facet_wrap(~ var) (a one-sided formula ). Is there a trend? Use dplyr operations to calculate the mean and median poetry translation price in each year. I get: Year mean median We might now want to investigate further a shift in the list prices (but, of course, with care, since list and retail are not the same, and we haven t accounted for inflation). 2 n 1 appears in the denominator here, rather than n, unless we think the x i are a population and not a sample that is, all the xs in the world. s 2 is defined this way to produce an unbiased estimator of the population variance σ 2. 3

4 Small duples! Are the list price distributions for fiction and poetry translations comparable? If you try to make two bar plots of txl_priced, you will find that there are so many more fiction titles that it s hard to see the shape of the poetry distribution when you place it on the same scale. Instead, use geom_density, which produces a smoothed approximation to the empirical distribution (that is to say, it shows the proportion of the data that is near a given value). This time, when you facet, put the two plots on top of one another instead of side by side by passing facet_wrap an ncol parameter. Bivariate description When we have two variables for each observation, x i and y i, we may be interested in the nature of their relation. Earlier in the semester we glanced at one such relation, between a word s frequency in a text and its frequency rank. Zipf posits that the former increase linearly with the inverse of the latter. Let s work from the four early-twentieth-century novels we looked at on the last homework. Setup (copy code, no exercise) featurize <- function (ll) { result <- unlist(strsplit(ll, "\\W+")) result <- result[result!= ""] tolower(result) } feature_frame <- function (fs) { frms <- list() for (j in seq_along(fs)) { ll <- readlines(fs[j], encoding="utf-8") words <- featurize(ll) frms[[j]] <- data.frame(title=basename(fs[j]), feature=words, stringsasfactors=f) } } do.call(rbind, frms) novels <- feature_frame(file.path("e20c-novels", list.files("e20c-novels"))) novel_counts <- novels %>% group_by(title, feature) %>% summarize(count=n()) Zipf x 4 The dplyr expression dense_rank(desc(x)) gives a vector of ranks for the values in x, with the largest value ranked 1, the second-largest ranked 2, and so on. Construct a faceted scatterplot of the frequencies against the inverse ranks for the top twenty words in each of the four novels. Add a line of best fit to each. Mine looks like: 4

5 6000 blue lagoon.txt Zipf's law in four novels sheik.txt word count 6000 three weeks.txt way of an eagle.txt / rank Correlation Do some of these lines fit better than others? The correlation coefficient, which I mentioned back in solution set 2, measures this association. The Pearson correlation coefficient r is given by r = 1 s x s y n (x i x)(y i ȳ) i=1 R s cor function calculates this. Use summarize to find the correlation coefficients between inverse word ranks and word frequencies for the top twenty words of each of the four novels. For example, for The Blue Lagoon I obtain: title r 1 blue-lagoon.txt More power Actually, Zipf s law is usually given a little differently. It posits a power law relation between the rank and the frequency: frequency = rank α where α is a positive constant. We ve been looking at α = 1. To investigate the possibility of a power law, we take the logarithm of both our x and y values. If the law holds, then we will see an approximately linear relationship, because 5

6 log frequency = α log rank R s function for taking logs is called log (by default, it gives natural logarithms log base e; the base doesn t matter). Repeat the above plots and correlation calculations. Hint: though I demonstrated scale_x_log10 in class, that is not what you need here; instead, carry out the transformation of the data and plot them on normal x and y scales. As a check, I find the following correlation for The Blue Lagoon between log rank and word frequency: title r 1 blue-lagoon.txt Why is it negative? Its magnitude is pretty much the same as the earlier correlation coefficient. Visually, however, what differences do you see between the fit of the line of best fit on the log-log plot instead of the inverse-rank plot? I should add that power laws are considered very dangerous beasties by statisticians, because it s particularly easy to see one even when it isn t there. However, Zipf s law is apparently very well-attested though still somewhat mysterious in its explanation. A quartet Descriptive statistics are very useful for comparing data sets to one another. The same usefulness means that, like any other abstraction, these statistics ignore differences that might matter. Here is a famous exemplification of this principle. It is so famous that R includes the data set by default every time you load it. It is Anscombe s quartet, contained in the variable anscombe, which you can print out on your console. Unfortunately anscombe is not in tidy form, so I will tidy it up for you: a_qt <- anscombe %>% mutate(obs=seq_along(x1)) %>% # number rows so we can spread later gather("key", "value", -obs) %>% # key is x1, x2,... y1, y2,... separate(key, c("key", "group"), 1) %>% # x1 is key=x, group=1, etc. spread(key, value) # spread out x and y columns For each of the four numbered groups in a_qt, calculate the mean and standard deviation of the x and y values and the correlation between x and y. Certain patterns should be apparent. Now make a faceted scatter plot showing the data, with the line of best fit added as well. Plotting categorical variables Reordering In class, we stopped before we had made acceptable plots when one of the variables is categorical. Let us now return to the translations data txl and address this. In class we had: langs <- txl %>% group_by(language) %>% summarize(titles=n()) %>% top_n(10, titles) ggplot(langs, aes(x=language, y=titles)) + geom_bar(stat="identity") 6

7 titles Arabic Chinese French German Italian JapanesePortugueseRussian Spanish Swedish Language We need to modify the display order of langs$language. The scale_x_discrete function takes a named parameter, limits, which can be a vector specifying the order in which to display the levels of the factor mapped to the x aesthetic. Hint: remember order? Use this to produce a new version of the plot in which the languages are arranged in ascending order by number of titles published. Graphical refinement Let s polish this plot. Skim the help for the theme function. From the example under Manipulate Axis Attributes you should be able to figure out how to rotate the country labels to run vertically using element_text inside your call to theme. Further refinement But actually, it might be better to do away with the axis labels and put the labeling text on top of the bars. To get rid of axis labels, add theme(axis.text.x=element_blank()) and add a new geom to the plot, geom_text. This needs four aesthetics to work right: x and y give the position, and in fact these can be just the same as for the bars; label gives the text; and the last aesthetic is not mapped, but a constant (set it outside aes()): the vjust value. 7

8 Catching up with Jockers (sort of) You do not have to read chapters 6 7 of Jockers, but these exercises are based on that part of the book. In chapter 6, Jockers introduces two measures of lexical variety: mean word use (61) and the type-token ratio (65). Actually these two numbers are reciprocals of one another (check that you see why). Then he notes that the type-token ratio of a text tends to be strongly related to its length. In practice problem 6.1 he suggests checking this claim for chapters of Moby-Dick, but instead, let s try it with another collection of texts : the ECCO titles dataset. This code will help set you up. It introduces the dplyr function do, which is a bit special: it takes an expression in terms of., where. is a data frame representing each group; then it stacks all the results together. Most dplyr functions transform individual rows; summarize squashes each group down to one row; do, on the other hand, can turn each group into an arbitrary, and varying, number of rows. In this case the group is only a single row, one for each title, to which we assign an ID number for clarity: ecco <- read.csv("ecco-headers.csv", as.is=t, encoding="utf-8") ecco_titles <- ecco %>% select(title) %>% mutate(id=seq_along(title)) %>% group_by(id) %>% do({ data_frame(id=.$id, feature=featurize(.$title)) }) It s not too important if that s a bit hazy. Explore ecco_titles in your console so you can see what results. Transform ecco_titles into a data frame ecco_ttrs, with one row per original title, and three columns: id, title_length (the number of words in the title, and ttr (the type-token ratio of the title itself). A spot check: ecco$title[1000] [1] "Three memorials on French affairs: Written in the years 1791, 1792 and By the late Right Hon. Edmun ecco_ttrs[1000, ] Source: local data frame [1 x 3] id title_length ttr A correlation Now (with no for loops!) show that the correlation between the type-token ratio and the title length is about A join In chapter 7, Jockers introduces yet another measure of lexical variety, the hapax richness. This is the proportion of word types that occur only once in a given text. For example, the hapax richness of the phrase To be or not to be is 2/4 = 0.5 (four word types, two of which occur once). Let us take as our text each 8

9 year s worth of titles in the ECCO set. We ought to have remembered to keep the pubdate column above, but since we didn t, we have a chance to practice an important data-manipulation concept: the join. First, let s derive a data frame with the same id column and just the year of publication. You did a very similar operation in homework 5. ecco_pubdates <- ecco %>% select(pubdate) %>% filter(!is.na(pubdate)) %>% mutate(id=seq_along(pubdate)) %>% mutate(pubdate=as.numeric(str_extract(pubdate, "\\d{4}"))) # correct one erroneous date: ecco_pubdates$pubdate[ecco_pubdates$pubdate == 1607] < Now we ve had to filter out some missing dates. The next step is to combine ecco_pubdates with ecco_titles, making use of the fact that they both follow the same id values. This is called a join. There are multiple ways to join, but the first and most important is the inner join. If x and y are data frames that both have a var column, inner_join(x, y, by="var") produces a new data frame with all the columns of both x and y and rows as follows: 1. Group both x and y by var values. 2. Discard any row in x whose var value does not occur in y; similarly, discard any row in y whose var value does not occur in x. 3. For each matching group, create new rows with each possible combination of row from x and row from y. If the group for a given var value has n rows in x and m rows in y, the resulting group has nm rows. Now ecco_pubdates has one row for each title, whereas ecco_titles has one row for each word in each title. Explain why pubdates_titles <- inner_join(ecco_pubdates, ecco_titles, by="id") does not have quite as many rows as ecco_titles. Visualize the time series Now derive the proportion of word types that are hapax legomena in the entire year s worth of titles in the data set, and plot this proportion as a series of bars. This is modeled after Jockers s fig I get a chart that looks like this: 9

10 Hapax legomena in each year's worth of titles 0.75 Hapax richness Publication date We could fix the visual banding by adjusting the width aesthetic, but enough already for now. 10

Solution Set 8. Andrew Goldstone March 26, 2015

Solution Set 8. Andrew Goldstone March 26, 2015 Solution Set 8 Andrew Goldstone March 26, 2015 Chunk options (small exercise) My setup chunk looks like ` ``{r setup, include=f, cache=f} knitr::opts_chunk$set(comment=na, error=t, cache=t, autodep=t)