Homework 8. ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` ``

Size: px
Start display at page:

Download "Homework 8. ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` ``"

Transcription

1 Homework 8 Once again, update agoldst/litdata. Chunk options (small exercise) Use the Homework template. Your first exercise is to modify the setup chunk to load dplyr and tidyr. It is time to be a little more sophisticated about those code chunks. Every chunk can have a name, which can be anything you want (but no spaces): ` ``{r my-chunk-name}... ` `` You can also include chunk options, which work exactly like named parameters to a function in R. You ve already encountered some of these, like ` ``{r eval=f}... ` `` for a chunk which is formatted to look like code but not actually executed. If you name your chunk, then you follow the name with a comma and then include options, separated by commas: ` ``{r chunk-name, opt1=val1, opt2=val2,...}... ` `` You can also apply chunk options to all the chunks in the document by including a first chunk that calls the knitr::opts_chunk$set function with named parameters. This is exactly what I have been doing all along in your homeworks. The comprehensive chunk option reference is yihui.name/knitr/options/. Now that you are including plots, you should know about three figure-related chunk options: fig.cap, fig.width, and fig.height. The first gives the figure caption; the latter two give the dimensions in inches of the figure. If you don t like the way any particular figure looks, by all means change its dimensions: ` ``{r my-small-plot, fig.cap="small plot", fig.width=2, fig.height=2} ggplot(...) + geom_point() +... ` `` By default, figures in PDF files float, which means they might be typeset before or after the place in the text where you write the code to create them, as the software attempts to create aesthetically pleasing pages (really!). Altering this behavior is a topic for another day, so for now just remember to include captions that make clear which of your plots go with which problems. 1 1 Your figures are automatically numbered, and if you want to mention the figure number for a chunk called my-chunk, you can enter \ref{fig:my-chunk} in your text (not your code; this is a LaTeX macro). 1

2 Visualization and descriptive statistics As we have seen, it is a bit more straightforward to talk about visualizing quantitative data than it is to visualize categorical data. So, although economic data is a whole other kettle of fish, let s just work a little with the prices of the poetry translations from the Three Percent database, filtering out any Price entries that are NA or not formatted as a price: txl <- read.csv("three-percent.csv", as.is=t, encoding="utf-8") %>% filter(year < 2015) %>% rename(language=lanuage) txl_priced <- txl %>% filter(!is.na(price)) %>% filter(str_detect(price, "^\\d*\\.\\d\\d$")) %>% mutate(price=as.numeric(price)) po <- txl_priced %>% filter(genre == "Poetry") Univariate descriptive statistics Central tendencies What is the typical list price? There are three common ways to characterize the central tendency of single-variable data. The most important is the mean or average. If you have data x i, i = 1...n, the mean is x = 1 n Calculate this in R using the mean function. The mean is sensitive to large outliers (take a look at the optional trim parameter). The median is the half-way value or 50th percentile of the data. More exactly, if the x i are arranged in ascending order, then n i=1 { x(n+1)/2 if n is odd median(x) = ( ) xn/2 + x n/2+1 if n is even 1 2 Again R provides the median function. The third measure, the mode, is simply the most frequently occurring value of x i. You know how to find this already! Calculate the mean, median, and modal price and explain why they are different from one another. x i Dispersion How far does any given item deviate in price from the typical? The most important measure of dispersion is the variance. To measure dispersion, you might first think of taking the average of the differences from the mean 1 n n (x i x) i=1 2

3 but this is always zero (why?). So, to weight negative and positive differences equally, we square these differences (another option would be take absolute values, but this mean absolute deviation does not have the same nice mathematical properties). This yields the sample 2 variance s 2 = 1 n 1 n (x i x) 2 i=1 Once again R provides a convenient function var for the variance. But the variance is in squared units, which is hard to interpret. To obtain a dispersion measure in units of the original data, we take the square root to produce the sample standard deviation, R s function for this is called sd. s = s 2 Another useful concept for describing distribution spread is the interquartile range. Just as the median is the 50% point of the data, the first quartile is the 25% point of the data and the third quartile is the 75% point. (More generally, we can speak of percentiles. In R, you calculate these with the quantile function, first translating the percentage into a number between 0 and 1.) The IQR, which is the difference between the third and first quartiles, gives the spread of the middle 50% of the data. Yet another handy R function, IQR, calculates this for you. But it s often better to examine the quartiles themselves. Calculate a five-number summary of the poetry prices: the minimum, first quartile, median, third quartile, and maximum. Plotting the distribution Plot a histogram of the prices. Set the binwidth aesthetic to a constant value of one dollar. Small multiples! Plot a histogram of the prices for each year. To facet across a single variable, follow the faceting model from class, but instead of writing facet_grid(y_var ~ x_var), write facet_wrap(~ var) (a one-sided formula ). Is there a trend? Use dplyr operations to calculate the mean and median poetry translation price in each year. I get: Year mean median We might now want to investigate further a shift in the list prices (but, of course, with care, since list and retail are not the same, and we haven t accounted for inflation). 2 n 1 appears in the denominator here, rather than n, unless we think the x i are a population and not a sample that is, all the xs in the world. s 2 is defined this way to produce an unbiased estimator of the population variance σ 2. 3

4 Small duples! Are the list price distributions for fiction and poetry translations comparable? If you try to make two bar plots of txl_priced, you will find that there are so many more fiction titles that it s hard to see the shape of the poetry distribution when you place it on the same scale. Instead, use geom_density, which produces a smoothed approximation to the empirical distribution (that is to say, it shows the proportion of the data that is near a given value). This time, when you facet, put the two plots on top of one another instead of side by side by passing facet_wrap an ncol parameter. Bivariate description When we have two variables for each observation, x i and y i, we may be interested in the nature of their relation. Earlier in the semester we glanced at one such relation, between a word s frequency in a text and its frequency rank. Zipf posits that the former increase linearly with the inverse of the latter. Let s work from the four early-twentieth-century novels we looked at on the last homework. Setup (copy code, no exercise) featurize <- function (ll) { result <- unlist(strsplit(ll, "\\W+")) result <- result[result!= ""] tolower(result) } feature_frame <- function (fs) { frms <- list() for (j in seq_along(fs)) { ll <- readlines(fs[j], encoding="utf-8") words <- featurize(ll) frms[[j]] <- data.frame(title=basename(fs[j]), feature=words, stringsasfactors=f) } } do.call(rbind, frms) novels <- feature_frame(file.path("e20c-novels", list.files("e20c-novels"))) novel_counts <- novels %>% group_by(title, feature) %>% summarize(count=n()) Zipf x 4 The dplyr expression dense_rank(desc(x)) gives a vector of ranks for the values in x, with the largest value ranked 1, the second-largest ranked 2, and so on. Construct a faceted scatterplot of the frequencies against the inverse ranks for the top twenty words in each of the four novels. Add a line of best fit to each. Mine looks like: 4

5 6000 blue lagoon.txt Zipf's law in four novels sheik.txt word count 6000 three weeks.txt way of an eagle.txt / rank Correlation Do some of these lines fit better than others? The correlation coefficient, which I mentioned back in solution set 2, measures this association. The Pearson correlation coefficient r is given by r = 1 s x s y n (x i x)(y i ȳ) i=1 R s cor function calculates this. Use summarize to find the correlation coefficients between inverse word ranks and word frequencies for the top twenty words of each of the four novels. For example, for The Blue Lagoon I obtain: title r 1 blue-lagoon.txt More power Actually, Zipf s law is usually given a little differently. It posits a power law relation between the rank and the frequency: frequency = rank α where α is a positive constant. We ve been looking at α = 1. To investigate the possibility of a power law, we take the logarithm of both our x and y values. If the law holds, then we will see an approximately linear relationship, because 5

6 log frequency = α log rank R s function for taking logs is called log (by default, it gives natural logarithms log base e; the base doesn t matter). Repeat the above plots and correlation calculations. Hint: though I demonstrated scale_x_log10 in class, that is not what you need here; instead, carry out the transformation of the data and plot them on normal x and y scales. As a check, I find the following correlation for The Blue Lagoon between log rank and word frequency: title r 1 blue-lagoon.txt Why is it negative? Its magnitude is pretty much the same as the earlier correlation coefficient. Visually, however, what differences do you see between the fit of the line of best fit on the log-log plot instead of the inverse-rank plot? I should add that power laws are considered very dangerous beasties by statisticians, because it s particularly easy to see one even when it isn t there. However, Zipf s law is apparently very well-attested though still somewhat mysterious in its explanation. A quartet Descriptive statistics are very useful for comparing data sets to one another. The same usefulness means that, like any other abstraction, these statistics ignore differences that might matter. Here is a famous exemplification of this principle. It is so famous that R includes the data set by default every time you load it. It is Anscombe s quartet, contained in the variable anscombe, which you can print out on your console. Unfortunately anscombe is not in tidy form, so I will tidy it up for you: a_qt <- anscombe %>% mutate(obs=seq_along(x1)) %>% # number rows so we can spread later gather("key", "value", -obs) %>% # key is x1, x2,... y1, y2,... separate(key, c("key", "group"), 1) %>% # x1 is key=x, group=1, etc. spread(key, value) # spread out x and y columns For each of the four numbered groups in a_qt, calculate the mean and standard deviation of the x and y values and the correlation between x and y. Certain patterns should be apparent. Now make a faceted scatter plot showing the data, with the line of best fit added as well. Plotting categorical variables Reordering In class, we stopped before we had made acceptable plots when one of the variables is categorical. Let us now return to the translations data txl and address this. In class we had: langs <- txl %>% group_by(language) %>% summarize(titles=n()) %>% top_n(10, titles) ggplot(langs, aes(x=language, y=titles)) + geom_bar(stat="identity") 6

7 titles Arabic Chinese French German Italian JapanesePortugueseRussian Spanish Swedish Language We need to modify the display order of langs$language. The scale_x_discrete function takes a named parameter, limits, which can be a vector specifying the order in which to display the levels of the factor mapped to the x aesthetic. Hint: remember order? Use this to produce a new version of the plot in which the languages are arranged in ascending order by number of titles published. Graphical refinement Let s polish this plot. Skim the help for the theme function. From the example under Manipulate Axis Attributes you should be able to figure out how to rotate the country labels to run vertically using element_text inside your call to theme. Further refinement But actually, it might be better to do away with the axis labels and put the labeling text on top of the bars. To get rid of axis labels, add theme(axis.text.x=element_blank()) and add a new geom to the plot, geom_text. This needs four aesthetics to work right: x and y give the position, and in fact these can be just the same as for the bars; label gives the text; and the last aesthetic is not mapped, but a constant (set it outside aes()): the vjust value. 7

8 Catching up with Jockers (sort of) You do not have to read chapters 6 7 of Jockers, but these exercises are based on that part of the book. In chapter 6, Jockers introduces two measures of lexical variety: mean word use (61) and the type-token ratio (65). Actually these two numbers are reciprocals of one another (check that you see why). Then he notes that the type-token ratio of a text tends to be strongly related to its length. In practice problem 6.1 he suggests checking this claim for chapters of Moby-Dick, but instead, let s try it with another collection of texts : the ECCO titles dataset. This code will help set you up. It introduces the dplyr function do, which is a bit special: it takes an expression in terms of., where. is a data frame representing each group; then it stacks all the results together. Most dplyr functions transform individual rows; summarize squashes each group down to one row; do, on the other hand, can turn each group into an arbitrary, and varying, number of rows. In this case the group is only a single row, one for each title, to which we assign an ID number for clarity: ecco <- read.csv("ecco-headers.csv", as.is=t, encoding="utf-8") ecco_titles <- ecco %>% select(title) %>% mutate(id=seq_along(title)) %>% group_by(id) %>% do({ data_frame(id=.$id, feature=featurize(.$title)) }) It s not too important if that s a bit hazy. Explore ecco_titles in your console so you can see what results. Transform ecco_titles into a data frame ecco_ttrs, with one row per original title, and three columns: id, title_length (the number of words in the title, and ttr (the type-token ratio of the title itself). A spot check: ecco$title[1000] [1] "Three memorials on French affairs: Written in the years 1791, 1792 and By the late Right Hon. Edmun ecco_ttrs[1000, ] Source: local data frame [1 x 3] id title_length ttr A correlation Now (with no for loops!) show that the correlation between the type-token ratio and the title length is about A join In chapter 7, Jockers introduces yet another measure of lexical variety, the hapax richness. This is the proportion of word types that occur only once in a given text. For example, the hapax richness of the phrase To be or not to be is 2/4 = 0.5 (four word types, two of which occur once). Let us take as our text each 8

9 year s worth of titles in the ECCO set. We ought to have remembered to keep the pubdate column above, but since we didn t, we have a chance to practice an important data-manipulation concept: the join. First, let s derive a data frame with the same id column and just the year of publication. You did a very similar operation in homework 5. ecco_pubdates <- ecco %>% select(pubdate) %>% filter(!is.na(pubdate)) %>% mutate(id=seq_along(pubdate)) %>% mutate(pubdate=as.numeric(str_extract(pubdate, "\\d{4}"))) # correct one erroneous date: ecco_pubdates$pubdate[ecco_pubdates$pubdate == 1607] < Now we ve had to filter out some missing dates. The next step is to combine ecco_pubdates with ecco_titles, making use of the fact that they both follow the same id values. This is called a join. There are multiple ways to join, but the first and most important is the inner join. If x and y are data frames that both have a var column, inner_join(x, y, by="var") produces a new data frame with all the columns of both x and y and rows as follows: 1. Group both x and y by var values. 2. Discard any row in x whose var value does not occur in y; similarly, discard any row in y whose var value does not occur in x. 3. For each matching group, create new rows with each possible combination of row from x and row from y. If the group for a given var value has n rows in x and m rows in y, the resulting group has nm rows. Now ecco_pubdates has one row for each title, whereas ecco_titles has one row for each word in each title. Explain why pubdates_titles <- inner_join(ecco_pubdates, ecco_titles, by="id") does not have quite as many rows as ecco_titles. Visualize the time series Now derive the proportion of word types that are hapax legomena in the entire year s worth of titles in the data set, and plot this proportion as a series of bars. This is modeled after Jockers s fig I get a chart that looks like this: 9

10 Hapax legomena in each year's worth of titles 0.75 Hapax richness Publication date We could fix the visual banding by adjusting the width aesthetic, but enough already for now. 10

Solution Set 8. Andrew Goldstone March 26, 2015

Solution Set 8. Andrew Goldstone March 26, 2015 Solution Set 8 Andrew Goldstone March 26, 2015 Chunk options (small exercise) My setup chunk looks like ` ``{r setup, include=f, cache=f} knitr::opts_chunk$set(comment=na, error=t, cache=t, autodep=t)

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies. Instructions: You are given the following data below these instructions. Your client (Courtney) wants you to statistically analyze the data to help her reach conclusions about how well she is teaching.

More information

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

More information

1 Overview of Statistics; Essential Vocabulary

1 Overview of Statistics; Essential Vocabulary 1 Overview of Statistics; Essential Vocabulary Statistics: the science of collecting, organizing, analyzing, and interpreting data in order to make decisions Population and sample Population: the entire

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

More information

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1 Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1 KEY SKILLS: Organize a data set into a frequency distribution. Construct a histogram to summarize a data set. Compute the percentile for a particular

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

Measures of Central Tendency

Measures of Central Tendency Page of 6 Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean The sum of all data values divided by the number of

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Week 4: Describing data and estimation

Week 4: Describing data and estimation Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate

More information

CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

2.1: Frequency Distributions and Their Graphs

2.1: Frequency Distributions and Their Graphs 2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use

More information

UNIT 1A EXPLORING UNIVARIATE DATA

UNIT 1A EXPLORING UNIVARIATE DATA A.P. STATISTICS E. Villarreal Lincoln HS Math Department UNIT 1A EXPLORING UNIVARIATE DATA LESSON 1: TYPES OF DATA Here is a list of important terms that we must understand as we begin our study of statistics

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd Chapter 3: Data Description - Part 3 Read: Sections 1 through 5 pp 92-149 Work the following text examples: Section 3.2, 3-1 through 3-17 Section 3.3, 3-22 through 3.28, 3-42 through 3.82 Section 3.4,

More information

15 Wyner Statistics Fall 2013

15 Wyner Statistics Fall 2013 15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.

More information

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set. Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean the sum of all data values divided by the number of values in

More information

Measures of Central Tendency:

Measures of Central Tendency: Measures of Central Tendency: One value will be used to characterize or summarize an entire data set. In the case of numerical data, it s thought to represent the center or middle of the values. Some data

More information

Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

More information

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016) CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1 Daphne Skipper, Augusta University (2016) 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

The Average and SD in R

The Average and SD in R The Average and SD in R The Basics: mean() and sd() Calculating an average and standard deviation in R is straightforward. The mean() function calculates the average and the sd() function calculates the

More information

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

MATH& 146 Lesson 8. Section 1.6 Averages and Variation MATH& 146 Lesson 8 Section 1.6 Averages and Variation 1 Summarizing Data The distribution of a variable is the overall pattern of how often the possible values occur. For numerical variables, three summary

More information

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation Objectives: 1. Learn the meaning of descriptive versus inferential statistics 2. Identify bar graphs,

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks, STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I 4 th Nine Weeks, 2016-2017 1 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource for

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Understanding and Comparing Distributions. Chapter 4

Understanding and Comparing Distributions. Chapter 4 Understanding and Comparing Distributions Chapter 4 Objectives: Boxplot Calculate Outliers Comparing Distributions Timeplot The Big Picture We can answer much more interesting questions about variables

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

More information

AP Statistics Summer Assignment:

AP Statistics Summer Assignment: AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Univariate descriptives

Univariate descriptives Univariate descriptives Johan A. Elkink University College Dublin 18 September 2014 18 September 2014 1 / Outline 1 Graphs for categorical variables 2 Graphs for scale variables 3 Frequency tables 4 Central

More information

Chapter 2: The Normal Distribution

Chapter 2: The Normal Distribution Chapter 2: The Normal Distribution 2.1 Density Curves and the Normal Distributions 2.2 Standard Normal Calculations 1 2 Histogram for Strength of Yarn Bobbins 15.60 16.10 16.60 17.10 17.60 18.10 18.60

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different? A frequency table is a table with two columns, one for the categories and another for the number of times each category occurs. See Example 1 on p. 247. Create a bar graph that displays the data from the

More information

SLStats.notebook. January 12, Statistics:

SLStats.notebook. January 12, Statistics: Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Statistics: Interpreting Data and Making Predictions. Visual Displays of Data 1/31

Statistics: Interpreting Data and Making Predictions. Visual Displays of Data 1/31 Statistics: Interpreting Data and Making Predictions Visual Displays of Data 1/31 Last Time Last time we discussed central tendency; that is, notions of the middle of data. More specifically we discussed

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015 MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015 2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Chapter 3 Analyzing Normal Quantitative Data

Chapter 3 Analyzing Normal Quantitative Data Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing

More information

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks, STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I Part II 3 rd Nine Weeks, 2016-2017 1 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource

More information

+ Statistical Methods in

+ Statistical Methods in 9/4/013 Statistical Methods in Practice STA/MTH 379 Dr. A. B. W. Manage Associate Professor of Mathematics & Statistics Department of Mathematics & Statistics Sam Houston State University Discovering Statistics

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

Lecture Notes 3: Data summarization

Lecture Notes 3: Data summarization Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean &

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Install RStudio from - use the standard installation.

Install RStudio from   - use the standard installation. Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/

More information

Univariate Statistics Summary

Univariate Statistics Summary Further Maths Univariate Statistics Summary Types of Data Data can be classified as categorical or numerical. Categorical data are observations or records that are arranged according to category. For example:

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents Chapter 7: Proportions and Percents CHAPTER 7: PROPORTIONS AND PERCENTS Date: Lesson: Learning Log Title: Date: Lesson: Learning Log Title: Chapter 7: Proportions and Percents Date: Lesson: Learning Log

More information

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Contents 1 Introduction to Using Excel Spreadsheets 2 1.1 A Serious Note About Data Security.................................... 2 1.2

More information

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

No. of blue jelly beans No. of bags

No. of blue jelly beans No. of bags Math 167 Ch5 Review 1 (c) Janice Epstein CHAPTER 5 EXPLORING DATA DISTRIBUTIONS A sample of jelly bean bags is chosen and the number of blue jelly beans in each bag is counted. The results are shown in

More information

AND NUMERICAL SUMMARIES. Chapter 2

AND NUMERICAL SUMMARIES. Chapter 2 EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Today s Topics. Percentile ranks and percentiles. Standardized scores. Using standardized scores to estimate percentiles

Today s Topics. Percentile ranks and percentiles. Standardized scores. Using standardized scores to estimate percentiles Today s Topics Percentile ranks and percentiles Standardized scores Using standardized scores to estimate percentiles Using µ and σ x to learn about percentiles Percentiles, standardized scores, and the

More information

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

Probability and Statistics. Copyright Cengage Learning. All rights reserved. Probability and Statistics Copyright Cengage Learning. All rights reserved. 14.5 Descriptive Statistics (Numerical) Copyright Cengage Learning. All rights reserved. Objectives Measures of Central Tendency:

More information

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or  me, I will answer promptly. Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00

More information

Exploratory Data Analysis

Exploratory Data Analysis Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation

More information

Lecture 1: Exploratory data analysis

Lecture 1: Exploratory data analysis Lecture 1: Exploratory data analysis Statistics 101 Mine Çetinkaya-Rundel January 17, 2012 Announcements Announcements Any questions about the syllabus? If you sent me your gmail address your RStudio account

More information

CITS4009 Introduc0on to Data Science

CITS4009 Introduc0on to Data Science School of Computer Science and Software Engineering CITS4009 Introduc0on to Data Science SEMESTER 2, 2017: CHAPTER 3 EXPLORING DATA 1 Chapter Objec0ves Using summary sta.s.cs to explore data Exploring

More information

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Lecture 3 Questions that we should be able to answer by the end of this lecture: Lecture 3 Questions that we should be able to answer by the end of this lecture: Which is the better exam score? 67 on an exam with mean 50 and SD 10 or 62 on an exam with mean 40 and SD 12 Is it fair

More information

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one. Probability and Statistics Chapter 2 Notes I Section 2-1 A Steps to Constructing Frequency Distributions 1 Determine number of (may be given to you) a Should be between and classes 2 Find the Range a The

More information

1.2. Pictorial and Tabular Methods in Descriptive Statistics

1.2. Pictorial and Tabular Methods in Descriptive Statistics 1.2. Pictorial and Tabular Methods in Descriptive Statistics Section Objectives. 1. Stem-and-Leaf displays. 2. Dotplots. 3. Histogram. Types of histogram shapes. Common notation. Sample size n : the number

More information

appstats6.notebook September 27, 2016

appstats6.notebook September 27, 2016 Chapter 6 The Standard Deviation as a Ruler and the Normal Model Objectives: 1.Students will calculate and interpret z scores. 2.Students will compare/contrast values from different distributions using

More information

Introduction to Geospatial Analysis

Introduction to Geospatial Analysis Introduction to Geospatial Analysis Introduction to Geospatial Analysis 1 Descriptive Statistics Descriptive statistics. 2 What and Why? Descriptive Statistics Quantitative description of data Why? Allow

More information

Visual Analytics. Visualizing multivariate data:

Visual Analytics. Visualizing multivariate data: Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or

More information

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations

More information

Stat405. Displaying distributions. Hadley Wickham. Thursday, August 23, 12

Stat405. Displaying distributions. Hadley Wickham. Thursday, August 23, 12 Stat405 Displaying distributions Hadley Wickham 1. The diamonds data 2. Histograms and bar charts 3. Homework Diamonds Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour,

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital

More information