Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015

Size: px
Start display at page:

Download "Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015"

Transcription

1 Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015 Data Exploration Import Relevant Packages: library(grdevices) library(graphics) library(plyr) library(hexbin) library(base) library(stats) library(mosaic) library(datasets) The Lahman package contains Sean Lahman s Baseball Database stored as a set of R data.frames. Today we will be using the Teams data.frame. library(lahman) data(teams) What variables are in the Teams data.frame? For variable descriptions, see packages/lahman/lahman.pdf. names(teams) We can easily view the first few rows of a dataset with the head() command. This allows us to get an idea of what the data looks like without having to open the entire data.frame. head(teams) If we want to get information on the class (data/type structure) of each variable and the kinds of values it takes on, we can use the str() command. str(teams) Which teams have Boston in their Name? unique(teams[grep("boston", Teams$name), c("name")]) Subset on Boston Red Sox: redsox = subset(teams, name == "Boston Red Sox") How would we find the unique values that the variable W (wins) takes on in the Red Sox dataset? 1

2 sort(unique(redsox$w)) What about the unique values of W in the Teams dataset? Try on your own! Summary Statistics Let s get a little more information on the Wins variable in both the Red Sox and Teams datasets. We can easily find the minimum, maximum, median, mean, 1st and 3rd quartiles using the summary() command. Going a little deeper, we can plot the boxplots of Wins in both datasets and visually compare the summary statistics. Note that the boxplots show all summary statistics except the mean. summary(redsox$w) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## summary(teams$w) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## boxplot(redsox$w, Teams$W, range = 0, names=c("red Sox", "Teams"), ylab = "Wins") Wins Red Sox Teams What conclusions can we make? Correlation Coefficient Now let s say we re interested in discovering if Wins is positively correlated with Home Runs (the number of wins increases as the number of home runs increases). We can start out with plotting the two variables against each other: plot(teams$hr, Teams$W, xlab = "Home Runs", ylab = "Wins") Before we go on to calculate the correlation coefficient, let s make a plot that is a little more visually pleasing than the one we just produced. We are Visual Analysts after all. 2

3 hbin <- hexbin(teams$hr, Teams$W, xbins = 20, xlab = "HR", ylab = "W") plot(hbin) W Counts HR There is some increase in Home Runs as Wins increases. Let s calculate the correlation coefficient to be sure. The correlation coefficient can take on any value between -1 and 1, inclusive. A value of -1 represents a strong negative correlation (one variable decreases as the other increases), a value of 0 represents no correlation, and a value of 1 represents a strong positive correlation. Calculate the correlation coefficient: cor(teams$w, Teams$HR) ## [1] Thus, we have a moderate positive relationship between Home Runs and Wins. Data Enrichment The Teams dataset has a lot of valuable information in it that we can use to perform various explorations. But let s say we re interested in exploring Winning Proportion and Runs Scored versus Runs Allowed, two variables that do not exist in our dataset. What do we do? We create the new variables ourselves! Don t assume that you re stuck with the variables in your dataset. You can always create and add new variables. Winning Proportion is the number of Wins divided by Wins plus Losses. Runs scored versus Runs Allowed is the number of Runs minus the number of Runs Allowed. df <- mutate(teams, WP = W / (W + L), RunDiff = R - RA) Now we have a new data.frame that contains the two additional variables we re interested in. 3

4 Missing Values R stores a missing value with NA. Use is.na() to check for missing values in your data. Use the na.rm = TRUE argument when computing statistics, such as the mean, on vectors with missing values. Use na.omit() to remove all rows with missing values in a dataset. head(teams$dp) is.na(teams$dp) sum(is.na(teams$dp)) mean(teams$dp) mean(teams$dp, na.rm = TRUE) nomissing = na.omit(teams) sum(is.na(nomissing)) Quantify Uncertainties The Teams dataset is a full, complete dataset that contains ALL of the data for the population of Baseball Statistics. This is a rarity; we often have a random sample of the population that we are trying to get information about. In these cases, uncertainty quantification is key, so here is how we do it. Disclaimer: The only reason we re creating a random sample is so we can learn how the bootstrap technique works. Since we already have ALL of the data in the Teams dataset, we know with certainty that the true population mean of DP is simply the mean DP from the Teams dataset. sample = Teams[sample(nrow(Teams), 1000), ] # create our random sample to justify the # usage of the bootstrap technique Using our random sample of Baseball Statistics throughout history, we can provide information on the true population mean Double Plays (DP). To do this we calculate a confidence interval using the sample mean DP. A confidence interval allows us to quantify our uncertainty in the sample estimate. If we end up with a tight bound for our confidence interval then we know that our sample estimate is a good estimate for the true Double Plays (DP) mean. However, if we have a large bound then we should use caution when using the sample estimate. Today, we will use the bootstrap technique to find the confidence interval. The bootstrap technique works as follows: Let n = the number of observations you have in your dataset. 1. Create a new dataset of n observations by resampling from your orignal dataset with replacement. 2. Compute the statistic of interest on this sample. Here we are interested in the mean. 3. Repeat steps 1 and 2 many times and collect the results. In step 1, why do we need to sample with replacement? Question: How confident are we that the mean DP of this sample represents the true population mean? Answer: mean(sample$dp, na.rm=true) #sample DP mean bstrap <- do(10000) * mean(resample(sample$dp), na.rm=true) densityplot(~result, data=bstrap,plot.points=false, xlab = "Mean Double Plays") qdata(c(0.025,0.975), vals = result, data = bstrap) We are 95% confident that the true population mean for Double Plays (DP) is between , We have quanitified our uncertainty in the sample mean estimate. 4

5 Statistical Models All statistical models are wrong, but some are useful. - George Box Simple Linear Regression: Let s find a model for predicting Winning Proportion(WP) using Runs Scored vs. Runs Allowed(RunDiff) in our new df dataset. Note that both variables are quantitative. There are three key assumptions for using linear regression: 1. The relationship between your response and predictor variables is linear 2. Normality of the residuals 3. Constant Variance of the residuals df <- df[df$g > 158, ] # let's filter on the Teams that played a full season plot(df$rundiff, df$wp, col = "blue", xlab = "RunDiff", ylab = "WP") # assumption 1 WP mdl <- lm(df$wp ~ df$rundiff) summary(mdl) RunDiff ## ## Call: ## lm(formula = df$wp ~ df$rundiff) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 5.001e e <2e-16 *** ## df$rundiff 6.460e e <2e-16 *** ## --- 5

6 ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on 1251 degrees of freedom ## Multiple R-squared: , Adjusted R-squared: ## F-statistic: 9169 on 1 and 1251 DF, p-value: < 2.2e-16 resi <- residuals(mdl) fit <- predict(mdl) hist(resi) # assumption 2 Histogram of resi Frequency resi plot(fit,resi, ylab = "Residual", xlab = "Fitted Value") # assumption 3 Residual Fitted Value Have we met our assumptions? Can we conclude that using a linear model to predict WP from RunDiff was appropriate? Now that we have shown that using a linear model was appropriate, how do we quanitfy how good of a model this is? Answer: Use R 2 to assess the goodness-of-fit. Since our response variable, WP, is quantitative, we can use the R 2 (Coefficient of Determination) value, or the adjusted R 2 value, to assess the goodness-of-fit. R 2 takes on a value between 0 and 1, inclusive, and 6

7 represents the proportion of the variability in the response explained by the model. Thus, a good model will have a value close to 1. The R 2 value will automatically increase if we add more predictors to the model. Why do you think this might be? The adjusted R 2 value is a modified version of R 2 that penalizes the model for adding additional variables. It ensures that the information added to the model makes up for the added complexity of the model. Since our model has only one variable, both values are the same. A good rule of thumb: Simpler is better. If your adjusted R 2 value only increases a little bit when you add a variable, you probably don t need it. What is the goodness-of-fit of this model? Is it a good model? plot(df$rundiff, df$wp, col = "blue", xlab = "RunDiff", ylab = "WP") abline(mdl) Automated Feature Selection In the previous section we found a good simple linear model for predicting WP using a single variable. But now we want to go a little deeper and add more explanatory variables to our model. How can we easily find the best set of explanatory variables to predict WP? Answer: Automated Feature Selection! R has a few automated feature selection techniques that you can explore on your own, such as stepwise, forward, and best subset(warning: computationally intensive for datasets with many variables). Today we ll use Backward Elimination. Backward Elimination throws all of the predictors into the model, and then removes the ones that are the least statistically significant. Make sure that your data has no missing values. rel = df[,c("wp", "Rank", "H", "RA", "HR", "SV", "IPouts", "HRA", "BB", "SO", "SB", "CS", "HBP", "SF", "RunDiff")] nomissing = na.omit(rel) mdl.full = lm(wp ~., data = nomissing) bck = step(mdl.full, direction = "backward", trace = "FALSE") summary(bck) Don t forget to check your diagnostic plots! resi=residuals(bck) fit=predict(bck) hist(resi) plot(fit,resi, ylab = "Residual", xlab = "Fitted Value") What is the goodness-of-fit of this model? Is it a good model? Our model captures a significant portion of the variability in the response. However, let s look back at the quote by George Box stated earlier. Is our model useful? For starters, our model violates the simpler is better rule of thumb. If you were presenting this model to a Baseball Coach who was interested in understanding their Team s WP, would you tell them that the value is based on of their Home Runs(HR) value and of their Rank(R) value, in addition to many other factors? No, you wouldn t, because that doesn t make any sense contextually. If we want to try narrowing down our model to fewer predictors we can pick the variables that are the most significant (ie. the variables that have the most number of astericks to the right of them). Just because R claims a model is the most significant statistically doesn t mean it s the most significant contextually. 7

8 Presenting your Data with Context Many different tools exist for analyzing your data and presenting your findings. However, if you don t present your results with context, the meaning is lost. A great example of presenting data with context is available at By overlaying the baseball diamond on the plot, the author is able to present the locations of Carlos Gomez s catches on the field with context. Additional Topics: Time Series Analysis A big assumption in linear regression is that each data point is independent of the others (the value of one does not affect the value of others). However, when time is involved, this assumption breaks down. For 8

9 example, if today it is below freezing, it is likely to be below freezing tomorrow. Intuitively, we expect data points measured closer together in time to have response values that are similar. When we have data collected over time (time series data), we have seperate statistical tools to work with. R has a predefined dataset, co2, of time series data. The dataset consists of monthly measurements of Carbon Dioxide from 1959 to If the data you are working with is time series data, you can use stl() to get the estimated seasonal, trend, and remainder components for the original data. monthplot() will give you a plot of the estimated monthly means for either the seasonal, trend, or remainder components. fit = stl(co2, s.window = "periodic") plot(fit) remainder trend seasonal data monthplot(fit, choice = "seasonal") time 9

10 seasonal J F M A M J J A S O N D Further Reading If you re interested in using a Statistical Model in your final project and want to learn more about how to handle categorical variables, logical variables, higher order terms, and the various models that exist for your data, check out: Cannon, A. R. (2013). STAT2: Building models for a world of data. New York: W.H. Freeman. If you re interested in presenting your R code and final report in an R Markdown file like this one, check out: Udwin, D. & Baumer, B. (2015). R Markdown. For more information on Confidence Intervals: For more information on Diagnosic Plots for Regression: For more information on Time Series Analysis: Cryer, J. D. (2009). Timer Series Analysis with Applications in R. Blackwell Publishing Ltd 10

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

36-402/608 HW #1 Solutions 1/21/2010

36-402/608 HW #1 Solutions 1/21/2010 36-402/608 HW #1 Solutions 1/21/2010 1. t-test (20 points) Use fullbumpus.r to set up the data from fullbumpus.txt (both at Blackboard/Assignments). For this problem, analyze the full dataset together

More information

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015 predict and Friends: Common Methods for Predictive Models in R 36-402, Spring 2015 Handout No. 1, 25 January 2015 R has lots of functions for working with different sort of predictive models. This handout

More information

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:

More information

SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are

More information

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from

More information

Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Salary 9 mo : 9 month salary for faculty member for 2004

Salary 9 mo : 9 month salary for faculty member for 2004 22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor

More information

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables: Regression Lab The data set cholesterol.txt available on your thumb drive contains the following variables: Field Descriptions ID: Subject ID sex: Sex: 0 = male, = female age: Age in years chol: Serum

More information

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 4.1: Time Series I Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Time Series Data and Dependence Time-series data are simply a collection of observations gathered

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016 Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Introduction to R, Github and Gitlab

Introduction to R, Github and Gitlab Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and

More information

Statistics 251: Statistical Methods

Statistics 251: Statistical Methods Statistics 251: Statistical Methods Summaries and Graphs in R Module R1 2018 file:///u:/documents/classes/lectures/251301/renae/markdown/master%20versions/summary_graphs.html#1 1/14 Summary Statistics

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Solution to Bonus Questions

Solution to Bonus Questions Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Nonparametric Classification Methods

Nonparametric Classification Methods Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

An Introductory Guide to R

An Introductory Guide to R An Introductory Guide to R By Claudia Mahler 1 Contents Installing and Operating R 2 Basics 4 Importing Data 5 Types of Data 6 Basic Operations 8 Selecting and Specifying Data 9 Matrices 11 Simple Statistics

More information

Stat 579: More Preliminaries, Reading from Files

Stat 579: More Preliminaries, Reading from Files Stat 579: More Preliminaries, Reading from Files Ranjan Maitra 2220 Snedecor Hall Department of Statistics Iowa State University. Phone: 515-294-7757 maitra@iastate.edu September 1, 2011, 1/10 Some more

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version) Practice in R January 28, 2010 (pdf version) 1 Sivan s practice Her practice file should be (here), or check the web for a more useful pointer. 2 Hetroskadasticity ˆ Let s make some hetroskadastic data:

More information

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Simulating power in practice

Simulating power in practice Simulating power in practice Author: Nicholas G Reich This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data Introduction About this Document This manual was written by members of the Statistical Consulting Program as an introduction to SPSS 12.0. It is designed to assist new users in familiarizing themselves

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Introduction In this lab you will work through Section 5.3 of ISL and record your code and results in an RMarkdown document. I have added section

More information

Gelman-Hill Chapter 3

Gelman-Hill Chapter 3 Gelman-Hill Chapter 3 Linear Regression Basics In linear regression with a single independent variable, as we have seen, the fundamental equation is where ŷ bx 1 b0 b b b y 1 yx, 0 y 1 x x Bivariate Normal

More information

TI-83 Users Guide. to accompany. Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock

TI-83 Users Guide. to accompany. Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock TI-83 Users Guide to accompany by Lock, Lock, Lock, Lock, and Lock TI-83 Users Guide- 1 Getting Started Entering Data Use the STAT menu, then select EDIT and hit Enter. Enter data for a single variable

More information

Week 4: Describing data and estimation

Week 4: Describing data and estimation Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate

More information

Excel 2010 with XLSTAT

Excel 2010 with XLSTAT Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with

More information

Introduction to hypothesis testing

Introduction to hypothesis testing Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Correlation. January 12, 2019

Correlation. January 12, 2019 Correlation January 12, 2019 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions

More information

Statistics Lecture 6. Looking at data one variable

Statistics Lecture 6. Looking at data one variable Statistics 111 - Lecture 6 Looking at data one variable Chapter 1.1 Moore, McCabe and Craig Probability vs. Statistics Probability 1. We know the distribution of the random variable (Normal, Binomial)

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice)

CS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice) CS8803: Statistical Techniques in Robotics Byron Boots Thoughts on Machine Learning and Robotics (how to apply machine learning in practice) 1 CS8803: Statistical Techniques in Robotics Byron Boots Thoughts

More information

Install RStudio from - use the standard installation.

Install RStudio from   - use the standard installation. Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3 s:5 Applied Linear Regression DeCook all 0 Lab onday October The data Set In 004, a study was done to examine if gender, after controlling for other variables, was a significant predictor of salary for

More information

Nonparametric Approaches to Regression

Nonparametric Approaches to Regression Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

More information

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one. Probability and Statistics Chapter 2 Notes I Section 2-1 A Steps to Constructing Frequency Distributions 1 Determine number of (may be given to you) a Should be between and classes 2 Find the Range a The

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Online Learning Centre Technology Step-by-Step - Minitab Minitab is a statistical software application originally created

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

Chapter 4: Analyzing Bivariate Data with Fathom

Chapter 4: Analyzing Bivariate Data with Fathom Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

More information

Building Better Parametric Cost Models

Building Better Parametric Cost Models Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute

More information

Regression III: Lab 4

Regression III: Lab 4 Regression III: Lab 4 This lab will work through some model/variable selection problems, finite mixture models and missing data issues. You shouldn t feel obligated to work through this linearly, I would

More information

Practical 2: Plotting

Practical 2: Plotting Practical 2: Plotting Complete this sheet as you work through it. If you run into problems, then ask for help - don t skip sections! Open Rstudio and store any files you download or create in a directory

More information

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015 STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................

More information

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2) SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

More information

Written by Donna Hiestand-Tupper CCBC - Essex TI 83 TUTORIAL. Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition

Written by Donna Hiestand-Tupper CCBC - Essex TI 83 TUTORIAL. Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition TI 83 TUTORIAL Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition Written by Donna Hiestand-Tupper CCBC - Essex 1 2 Math 153 - Introduction to Statistical Methods TI 83 (PLUS)

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D. Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D. Introduction to Minitab The interface for Minitab is very user-friendly, with a spreadsheet orientation. When you first launch Minitab, you will see

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

S CHAPTER return.data S CHAPTER.Data S CHAPTER

S CHAPTER return.data S CHAPTER.Data S CHAPTER 1 S CHAPTER return.data S CHAPTER.Data MySwork S CHAPTER.Data 2 S e > return ; return + # 3 setenv S_CLEDITOR emacs 4 > 4 + 5 / 3 ## addition & divison [1] 5.666667 > (4 + 5) / 3 ## using parentheses [1]

More information

Package GLDreg. February 28, 2017

Package GLDreg. February 28, 2017 Type Package Package GLDreg February 28, 2017 Title Fit GLD Regression Model and GLD Quantile Regression Model to Empirical Data Version 1.0.7 Date 2017-03-15 Author Steve Su, with contributions from:

More information

VCEasy VISUAL FURTHER MATHS. Overview

VCEasy VISUAL FURTHER MATHS. Overview VCEasy VISUAL FURTHER MATHS Overview This booklet is a visual overview of the knowledge required for the VCE Year 12 Further Maths examination.! This booklet does not replace any existing resources that

More information

More Summer Program t-shirts

More Summer Program t-shirts ICPSR Blalock Lectures, 2003 Bootstrap Resampling Robert Stine Lecture 2 Exploring the Bootstrap Questions from Lecture 1 Review of ideas, notes from Lecture 1 - sample-to-sample variation - resampling

More information

Package r2d2. February 20, 2015

Package r2d2. February 20, 2015 Package r2d2 February 20, 2015 Version 1.0-0 Date 2014-03-31 Title Bivariate (Two-Dimensional) Confidence Region and Frequency Distribution Author Arni Magnusson [aut], Julian Burgos [aut, cre], Gregory

More information

Section 2.2: Covariance, Correlation, and Least Squares

Section 2.2: Covariance, Correlation, and Least Squares Section 2.2: Covariance, Correlation, and Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 A Deeper

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

A (very) brief introduction to R

A (very) brief introduction to R A (very) brief introduction to R You typically start R at the command line prompt in a command line interface (CLI) mode. It is not a graphical user interface (GUI) although there are some efforts to produce

More information

Advanced Econometric Methods EMET3011/8014

Advanced Econometric Methods EMET3011/8014 Advanced Econometric Methods EMET3011/8014 Lecture 2 John Stachurski Semester 1, 2011 Announcements Missed first lecture? See www.johnstachurski.net/emet Weekly download of course notes First computer

More information

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age NEURAL NETWORKS As an introduction, we ll tackle a prediction task with a continuous variable. We ll reproduce research from the field of cement and concrete manufacturing that seeks to model the compressive

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html Intro to R R is a functional programming language, which means that most of what one does is apply functions to objects. We will begin with a brief introduction to R objects and how functions work, and

More information

OLS Assumptions and Goodness of Fit

OLS Assumptions and Goodness of Fit OLS Assumptions and Goodness of Fit A little warm-up Assume I am a poor free-throw shooter. To win a contest I can choose to attempt one of the two following challenges: A. Make three out of four free

More information

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18 CSE 417T: Introduction to Machine Learning Lecture 6: Bias-Variance Trade-off Henry Chai 09/13/18 Let! ", $ = the maximum number of dichotomies on " points s.t. no subset of $ points is shattered Recall

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

#NULL! Appears most often when you insert a space (where you should have comma) to separate cell references used as arguments for functions.

#NULL! Appears most often when you insert a space (where you should have comma) to separate cell references used as arguments for functions. Appendix B Excel Errors Under certain circumstances, even the best formulas can appear to have freaked out once you get them in your worksheet. You can tell right away that a formula s gone haywire because

More information

Product Catalog. AcaStat. Software

Product Catalog. AcaStat. Software Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information