Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015
|
|
- Rose Fox
- 6 years ago
- Views:
Transcription
1 Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015 Data Exploration Import Relevant Packages: library(grdevices) library(graphics) library(plyr) library(hexbin) library(base) library(stats) library(mosaic) library(datasets) The Lahman package contains Sean Lahman s Baseball Database stored as a set of R data.frames. Today we will be using the Teams data.frame. library(lahman) data(teams) What variables are in the Teams data.frame? For variable descriptions, see packages/lahman/lahman.pdf. names(teams) We can easily view the first few rows of a dataset with the head() command. This allows us to get an idea of what the data looks like without having to open the entire data.frame. head(teams) If we want to get information on the class (data/type structure) of each variable and the kinds of values it takes on, we can use the str() command. str(teams) Which teams have Boston in their Name? unique(teams[grep("boston", Teams$name), c("name")]) Subset on Boston Red Sox: redsox = subset(teams, name == "Boston Red Sox") How would we find the unique values that the variable W (wins) takes on in the Red Sox dataset? 1
2 sort(unique(redsox$w)) What about the unique values of W in the Teams dataset? Try on your own! Summary Statistics Let s get a little more information on the Wins variable in both the Red Sox and Teams datasets. We can easily find the minimum, maximum, median, mean, 1st and 3rd quartiles using the summary() command. Going a little deeper, we can plot the boxplots of Wins in both datasets and visually compare the summary statistics. Note that the boxplots show all summary statistics except the mean. summary(redsox$w) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## summary(teams$w) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## boxplot(redsox$w, Teams$W, range = 0, names=c("red Sox", "Teams"), ylab = "Wins") Wins Red Sox Teams What conclusions can we make? Correlation Coefficient Now let s say we re interested in discovering if Wins is positively correlated with Home Runs (the number of wins increases as the number of home runs increases). We can start out with plotting the two variables against each other: plot(teams$hr, Teams$W, xlab = "Home Runs", ylab = "Wins") Before we go on to calculate the correlation coefficient, let s make a plot that is a little more visually pleasing than the one we just produced. We are Visual Analysts after all. 2
3 hbin <- hexbin(teams$hr, Teams$W, xbins = 20, xlab = "HR", ylab = "W") plot(hbin) W Counts HR There is some increase in Home Runs as Wins increases. Let s calculate the correlation coefficient to be sure. The correlation coefficient can take on any value between -1 and 1, inclusive. A value of -1 represents a strong negative correlation (one variable decreases as the other increases), a value of 0 represents no correlation, and a value of 1 represents a strong positive correlation. Calculate the correlation coefficient: cor(teams$w, Teams$HR) ## [1] Thus, we have a moderate positive relationship between Home Runs and Wins. Data Enrichment The Teams dataset has a lot of valuable information in it that we can use to perform various explorations. But let s say we re interested in exploring Winning Proportion and Runs Scored versus Runs Allowed, two variables that do not exist in our dataset. What do we do? We create the new variables ourselves! Don t assume that you re stuck with the variables in your dataset. You can always create and add new variables. Winning Proportion is the number of Wins divided by Wins plus Losses. Runs scored versus Runs Allowed is the number of Runs minus the number of Runs Allowed. df <- mutate(teams, WP = W / (W + L), RunDiff = R - RA) Now we have a new data.frame that contains the two additional variables we re interested in. 3
4 Missing Values R stores a missing value with NA. Use is.na() to check for missing values in your data. Use the na.rm = TRUE argument when computing statistics, such as the mean, on vectors with missing values. Use na.omit() to remove all rows with missing values in a dataset. head(teams$dp) is.na(teams$dp) sum(is.na(teams$dp)) mean(teams$dp) mean(teams$dp, na.rm = TRUE) nomissing = na.omit(teams) sum(is.na(nomissing)) Quantify Uncertainties The Teams dataset is a full, complete dataset that contains ALL of the data for the population of Baseball Statistics. This is a rarity; we often have a random sample of the population that we are trying to get information about. In these cases, uncertainty quantification is key, so here is how we do it. Disclaimer: The only reason we re creating a random sample is so we can learn how the bootstrap technique works. Since we already have ALL of the data in the Teams dataset, we know with certainty that the true population mean of DP is simply the mean DP from the Teams dataset. sample = Teams[sample(nrow(Teams), 1000), ] # create our random sample to justify the # usage of the bootstrap technique Using our random sample of Baseball Statistics throughout history, we can provide information on the true population mean Double Plays (DP). To do this we calculate a confidence interval using the sample mean DP. A confidence interval allows us to quantify our uncertainty in the sample estimate. If we end up with a tight bound for our confidence interval then we know that our sample estimate is a good estimate for the true Double Plays (DP) mean. However, if we have a large bound then we should use caution when using the sample estimate. Today, we will use the bootstrap technique to find the confidence interval. The bootstrap technique works as follows: Let n = the number of observations you have in your dataset. 1. Create a new dataset of n observations by resampling from your orignal dataset with replacement. 2. Compute the statistic of interest on this sample. Here we are interested in the mean. 3. Repeat steps 1 and 2 many times and collect the results. In step 1, why do we need to sample with replacement? Question: How confident are we that the mean DP of this sample represents the true population mean? Answer: mean(sample$dp, na.rm=true) #sample DP mean bstrap <- do(10000) * mean(resample(sample$dp), na.rm=true) densityplot(~result, data=bstrap,plot.points=false, xlab = "Mean Double Plays") qdata(c(0.025,0.975), vals = result, data = bstrap) We are 95% confident that the true population mean for Double Plays (DP) is between , We have quanitified our uncertainty in the sample mean estimate. 4
5 Statistical Models All statistical models are wrong, but some are useful. - George Box Simple Linear Regression: Let s find a model for predicting Winning Proportion(WP) using Runs Scored vs. Runs Allowed(RunDiff) in our new df dataset. Note that both variables are quantitative. There are three key assumptions for using linear regression: 1. The relationship between your response and predictor variables is linear 2. Normality of the residuals 3. Constant Variance of the residuals df <- df[df$g > 158, ] # let's filter on the Teams that played a full season plot(df$rundiff, df$wp, col = "blue", xlab = "RunDiff", ylab = "WP") # assumption 1 WP mdl <- lm(df$wp ~ df$rundiff) summary(mdl) RunDiff ## ## Call: ## lm(formula = df$wp ~ df$rundiff) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 5.001e e <2e-16 *** ## df$rundiff 6.460e e <2e-16 *** ## --- 5
6 ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on 1251 degrees of freedom ## Multiple R-squared: , Adjusted R-squared: ## F-statistic: 9169 on 1 and 1251 DF, p-value: < 2.2e-16 resi <- residuals(mdl) fit <- predict(mdl) hist(resi) # assumption 2 Histogram of resi Frequency resi plot(fit,resi, ylab = "Residual", xlab = "Fitted Value") # assumption 3 Residual Fitted Value Have we met our assumptions? Can we conclude that using a linear model to predict WP from RunDiff was appropriate? Now that we have shown that using a linear model was appropriate, how do we quanitfy how good of a model this is? Answer: Use R 2 to assess the goodness-of-fit. Since our response variable, WP, is quantitative, we can use the R 2 (Coefficient of Determination) value, or the adjusted R 2 value, to assess the goodness-of-fit. R 2 takes on a value between 0 and 1, inclusive, and 6
7 represents the proportion of the variability in the response explained by the model. Thus, a good model will have a value close to 1. The R 2 value will automatically increase if we add more predictors to the model. Why do you think this might be? The adjusted R 2 value is a modified version of R 2 that penalizes the model for adding additional variables. It ensures that the information added to the model makes up for the added complexity of the model. Since our model has only one variable, both values are the same. A good rule of thumb: Simpler is better. If your adjusted R 2 value only increases a little bit when you add a variable, you probably don t need it. What is the goodness-of-fit of this model? Is it a good model? plot(df$rundiff, df$wp, col = "blue", xlab = "RunDiff", ylab = "WP") abline(mdl) Automated Feature Selection In the previous section we found a good simple linear model for predicting WP using a single variable. But now we want to go a little deeper and add more explanatory variables to our model. How can we easily find the best set of explanatory variables to predict WP? Answer: Automated Feature Selection! R has a few automated feature selection techniques that you can explore on your own, such as stepwise, forward, and best subset(warning: computationally intensive for datasets with many variables). Today we ll use Backward Elimination. Backward Elimination throws all of the predictors into the model, and then removes the ones that are the least statistically significant. Make sure that your data has no missing values. rel = df[,c("wp", "Rank", "H", "RA", "HR", "SV", "IPouts", "HRA", "BB", "SO", "SB", "CS", "HBP", "SF", "RunDiff")] nomissing = na.omit(rel) mdl.full = lm(wp ~., data = nomissing) bck = step(mdl.full, direction = "backward", trace = "FALSE") summary(bck) Don t forget to check your diagnostic plots! resi=residuals(bck) fit=predict(bck) hist(resi) plot(fit,resi, ylab = "Residual", xlab = "Fitted Value") What is the goodness-of-fit of this model? Is it a good model? Our model captures a significant portion of the variability in the response. However, let s look back at the quote by George Box stated earlier. Is our model useful? For starters, our model violates the simpler is better rule of thumb. If you were presenting this model to a Baseball Coach who was interested in understanding their Team s WP, would you tell them that the value is based on of their Home Runs(HR) value and of their Rank(R) value, in addition to many other factors? No, you wouldn t, because that doesn t make any sense contextually. If we want to try narrowing down our model to fewer predictors we can pick the variables that are the most significant (ie. the variables that have the most number of astericks to the right of them). Just because R claims a model is the most significant statistically doesn t mean it s the most significant contextually. 7
8 Presenting your Data with Context Many different tools exist for analyzing your data and presenting your findings. However, if you don t present your results with context, the meaning is lost. A great example of presenting data with context is available at By overlaying the baseball diamond on the plot, the author is able to present the locations of Carlos Gomez s catches on the field with context. Additional Topics: Time Series Analysis A big assumption in linear regression is that each data point is independent of the others (the value of one does not affect the value of others). However, when time is involved, this assumption breaks down. For 8
9 example, if today it is below freezing, it is likely to be below freezing tomorrow. Intuitively, we expect data points measured closer together in time to have response values that are similar. When we have data collected over time (time series data), we have seperate statistical tools to work with. R has a predefined dataset, co2, of time series data. The dataset consists of monthly measurements of Carbon Dioxide from 1959 to If the data you are working with is time series data, you can use stl() to get the estimated seasonal, trend, and remainder components for the original data. monthplot() will give you a plot of the estimated monthly means for either the seasonal, trend, or remainder components. fit = stl(co2, s.window = "periodic") plot(fit) remainder trend seasonal data monthplot(fit, choice = "seasonal") time 9
10 seasonal J F M A M J J A S O N D Further Reading If you re interested in using a Statistical Model in your final project and want to learn more about how to handle categorical variables, logical variables, higher order terms, and the various models that exist for your data, check out: Cannon, A. R. (2013). STAT2: Building models for a world of data. New York: W.H. Freeman. If you re interested in presenting your R code and final report in an R Markdown file like this one, check out: Udwin, D. & Baumer, B. (2015). R Markdown. For more information on Confidence Intervals: For more information on Diagnosic Plots for Regression: For more information on Time Series Analysis: Cryer, J. D. (2009). Timer Series Analysis with Applications in R. Blackwell Publishing Ltd 10
Section 2.3: Simple Linear Regression: Predictions and Inference
Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple
More information36-402/608 HW #1 Solutions 1/21/2010
36-402/608 HW #1 Solutions 1/21/2010 1. t-test (20 points) Use fullbumpus.r to set up the data from fullbumpus.txt (both at Blackboard/Assignments). For this problem, analyze the full dataset together
More informationpredict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015
predict and Friends: Common Methods for Predictive Models in R 36-402, Spring 2015 Handout No. 1, 25 January 2015 R has lots of functions for working with different sort of predictive models. This handout
More informationEXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression
EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationSection 2.1: Intro to Simple Linear Regression & Least Squares
Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationGetting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018
Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from
More informationSection 2.1: Intro to Simple Linear Regression & Least Squares
Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationSalary 9 mo : 9 month salary for faculty member for 2004
22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor
More informationRegression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:
Regression Lab The data set cholesterol.txt available on your thumb drive contains the following variables: Field Descriptions ID: Subject ID sex: Sex: 0 = male, = female age: Age in years chol: Serum
More informationSection 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 4.1: Time Series I Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Time Series Data and Dependence Time-series data are simply a collection of observations gathered
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationResampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016
Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation
More informationTHE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann
Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG
More informationYour Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression
Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationMinitab 17 commands Prepared by Jeffrey S. Simonoff
Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save
More informationIntroduction to R, Github and Gitlab
Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and
More informationStatistics 251: Statistical Methods
Statistics 251: Statistical Methods Summaries and Graphs in R Module R1 2018 file:///u:/documents/classes/lectures/251301/renae/markdown/master%20versions/summary_graphs.html#1 1/14 Summary Statistics
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationSolution to Bonus Questions
Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More informationNonparametric Classification Methods
Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationAn Introductory Guide to R
An Introductory Guide to R By Claudia Mahler 1 Contents Installing and Operating R 2 Basics 4 Importing Data 5 Types of Data 6 Basic Operations 8 Selecting and Specifying Data 9 Matrices 11 Simple Statistics
More informationStat 579: More Preliminaries, Reading from Files
Stat 579: More Preliminaries, Reading from Files Ranjan Maitra 2220 Snedecor Hall Department of Statistics Iowa State University. Phone: 515-294-7757 maitra@iastate.edu September 1, 2011, 1/10 Some more
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationStatistics Lab #7 ANOVA Part 2 & ANCOVA
Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")
More informationPractice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)
Practice in R January 28, 2010 (pdf version) 1 Sivan s practice Her practice file should be (here), or check the web for a more useful pointer. 2 Hetroskadasticity ˆ Let s make some hetroskadastic data:
More informationBIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26
Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations
More informationWELCOME! Lecture 3 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important
More informationSimulating power in practice
Simulating power in practice Author: Nicholas G Reich This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en
More informationIntroduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data
Introduction About this Document This manual was written by members of the Statistical Consulting Program as an introduction to SPSS 12.0. It is designed to assist new users in familiarizing themselves
More informationApplied Statistics and Econometrics Lecture 6
Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,
More informationLab #13 - Resampling Methods Econ 224 October 23rd, 2018
Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Introduction In this lab you will work through Section 5.3 of ISL and record your code and results in an RMarkdown document. I have added section
More informationGelman-Hill Chapter 3
Gelman-Hill Chapter 3 Linear Regression Basics In linear regression with a single independent variable, as we have seen, the fundamental equation is where ŷ bx 1 b0 b b b y 1 yx, 0 y 1 x x Bivariate Normal
More informationTI-83 Users Guide. to accompany. Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock
TI-83 Users Guide to accompany by Lock, Lock, Lock, Lock, and Lock TI-83 Users Guide- 1 Getting Started Entering Data Use the STAT menu, then select EDIT and hit Enter. Enter data for a single variable
More informationWeek 4: Describing data and estimation
Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate
More informationExcel 2010 with XLSTAT
Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with
More informationIntroduction to hypothesis testing
Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationLecture 13: Model selection and regularization
Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always
More informationCorrelation. January 12, 2019
Correlation January 12, 2019 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationSection 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions
More informationStatistics Lecture 6. Looking at data one variable
Statistics 111 - Lecture 6 Looking at data one variable Chapter 1.1 Moore, McCabe and Craig Probability vs. Statistics Probability 1. We know the distribution of the random variable (Normal, Binomial)
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice)
CS8803: Statistical Techniques in Robotics Byron Boots Thoughts on Machine Learning and Robotics (how to apply machine learning in practice) 1 CS8803: Statistical Techniques in Robotics Byron Boots Thoughts
More informationInstall RStudio from - use the standard installation.
Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/
More informationData Analyst Nanodegree Syllabus
Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working
More information8. MINITAB COMMANDS WEEK-BY-WEEK
8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are
More informationSection 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions
More informationBIOL 458 BIOMETRY Lab 10 - Multiple Regression
BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but
More informationMultiple Regression White paper
+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms
More information22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3
s:5 Applied Linear Regression DeCook all 0 Lab onday October The data Set In 004, a study was done to examine if gender, after controlling for other variables, was a significant predictor of salary for
More informationNonparametric Approaches to Regression
Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)
More informationa. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.
Probability and Statistics Chapter 2 Notes I Section 2-1 A Steps to Constructing Frequency Distributions 1 Determine number of (may be given to you) a Should be between and classes 2 Find the Range a The
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationBluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition
Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Online Learning Centre Technology Step-by-Step - Minitab Minitab is a statistical software application originally created
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationFathom Dynamic Data TM Version 2 Specifications
Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other
More informationThings you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.
1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.
More informationChapter 4: Analyzing Bivariate Data with Fathom
Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative
More informationData Analyst Nanodegree Syllabus
Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use
More informationHeteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors
Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms
More informationExploring and Understanding Data Using R.
Exploring and Understanding Data Using R. Loading the data into an R data frame: variable
More informationBuilding Better Parametric Cost Models
Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute
More informationRegression III: Lab 4
Regression III: Lab 4 This lab will work through some model/variable selection problems, finite mixture models and missing data issues. You shouldn t feel obligated to work through this linearly, I would
More informationPractical 2: Plotting
Practical 2: Plotting Complete this sheet as you work through it. If you run into problems, then ask for help - don t skip sections! Open Rstudio and store any files you download or create in a directory
More informationSTENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015
STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................
More informationGoals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)
SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple
More informationWritten by Donna Hiestand-Tupper CCBC - Essex TI 83 TUTORIAL. Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition
TI 83 TUTORIAL Version 3.0 to accompany Elementary Statistics by Mario Triola, 9 th edition Written by Donna Hiestand-Tupper CCBC - Essex 1 2 Math 153 - Introduction to Statistical Methods TI 83 (PLUS)
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationMinitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.
Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D. Introduction to Minitab The interface for Minitab is very user-friendly, with a spreadsheet orientation. When you first launch Minitab, you will see
More informationRobust Linear Regression (Passing- Bablok Median-Slope)
Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their
More informationS CHAPTER return.data S CHAPTER.Data S CHAPTER
1 S CHAPTER return.data S CHAPTER.Data MySwork S CHAPTER.Data 2 S e > return ; return + # 3 setenv S_CLEDITOR emacs 4 > 4 + 5 / 3 ## addition & divison [1] 5.666667 > (4 + 5) / 3 ## using parentheses [1]
More informationPackage GLDreg. February 28, 2017
Type Package Package GLDreg February 28, 2017 Title Fit GLD Regression Model and GLD Quantile Regression Model to Empirical Data Version 1.0.7 Date 2017-03-15 Author Steve Su, with contributions from:
More informationVCEasy VISUAL FURTHER MATHS. Overview
VCEasy VISUAL FURTHER MATHS Overview This booklet is a visual overview of the knowledge required for the VCE Year 12 Further Maths examination.! This booklet does not replace any existing resources that
More informationMore Summer Program t-shirts
ICPSR Blalock Lectures, 2003 Bootstrap Resampling Robert Stine Lecture 2 Exploring the Bootstrap Questions from Lecture 1 Review of ideas, notes from Lecture 1 - sample-to-sample variation - resampling
More informationPackage r2d2. February 20, 2015
Package r2d2 February 20, 2015 Version 1.0-0 Date 2014-03-31 Title Bivariate (Two-Dimensional) Confidence Region and Frequency Distribution Author Arni Magnusson [aut], Julian Burgos [aut, cre], Gregory
More informationSection 2.2: Covariance, Correlation, and Least Squares
Section 2.2: Covariance, Correlation, and Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 A Deeper
More informationTHE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)
THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination
More informationA (very) brief introduction to R
A (very) brief introduction to R You typically start R at the command line prompt in a command line interface (CLI) mode. It is not a graphical user interface (GUI) although there are some efforts to produce
More informationAdvanced Econometric Methods EMET3011/8014
Advanced Econometric Methods EMET3011/8014 Lecture 2 John Stachurski Semester 1, 2011 Announcements Missed first lecture? See www.johnstachurski.net/emet Weekly download of course notes First computer
More informationNEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age
NEURAL NETWORKS As an introduction, we ll tackle a prediction task with a continuous variable. We ll reproduce research from the field of cement and concrete manufacturing that seeks to model the compressive
More informationSPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL
SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered
More informationDescriptive Statistics, Standard Deviation and Standard Error
AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.
More informationUsing the DATAMINE Program
6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection
More informationfile:///users/williams03/a/workshops/2015.march/final/intro_to_r.html
Intro to R R is a functional programming language, which means that most of what one does is apply functions to objects. We will begin with a brief introduction to R objects and how functions work, and
More informationOLS Assumptions and Goodness of Fit
OLS Assumptions and Goodness of Fit A little warm-up Assume I am a poor free-throw shooter. To win a contest I can choose to attempt one of the two following challenges: A. Make three out of four free
More informationCSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18
CSE 417T: Introduction to Machine Learning Lecture 6: Bias-Variance Trade-off Henry Chai 09/13/18 Let! ", $ = the maximum number of dichotomies on " points s.t. no subset of $ points is shattered Recall
More informationMulticollinearity and Validation CIVL 7012/8012
Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,
More information( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.
Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING
More information#NULL! Appears most often when you insert a space (where you should have comma) to separate cell references used as arguments for functions.
Appendix B Excel Errors Under certain circumstances, even the best formulas can appear to have freaked out once you get them in your worksheet. You can tell right away that a formula s gone haywire because
More informationProduct Catalog. AcaStat. Software
Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,
More informationMultivariate Analysis Multivariate Calibration part 2
Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data
More information