Lab 3: Sampling Distributions

Similar documents
Lab 4A: Foundations for statistical inference - Sampling distributions

Foundations for statistical inference - Sampling distribu- tions

Euler s Method for Approximating Solution Curves

Stepwise Refinement. Lecture 12 COP 3014 Spring February 2, 2017

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Multiple Variable Drag and Drop Demonstration (v1) Steve Gannon, Principal Consultant GanTek Multimedia

4.7 Approximate Integration

Grade 6 Math Circles November 6 & Relations, Functions, and Morphisms

Intro. Scheme Basics. scm> 5 5. scm>

Nearest Neighbor Predictors

6 Further... Programming

If Statements, For Loops, Functions

Chapter 6: DESCRIPTIVE STATISTICS

Central Limit Theorem Sample Means

Here we will look at some methods for checking data simply using JOSM. Some of the questions we are asking about our data are:

Foundations, Reasoning About Algorithms, and Design By Contract CMPSC 122

CHAPTER 3: Data Description

1. The Normal Distribution, continued

introduction to Programming in C Department of Computer Science and Engineering Lecture No. #40 Recursion Linear Recursion

CMSC/BIOL 361: Emergence Cellular Automata: Introduction to NetLogo

Animator Friendly Rigging Part 1

Textbook. Topic 5: Repetition. Types of Loops. Repetition

Team Prob. Team Prob

Here we will look at some methods for checking data simply using JOSM. Some of the questions we are asking about our data are:

Surveying Prof. Bharat Lohani Indian Institute of Technology, Kanpur. Lecture - 1 Module - 6 Triangulation and Trilateration

Chapter 4 Loops. int x = 0; while ( x <= 3 ) { x++; } System.out.println( x );

3 Graphical Displays of Data

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Figure 1. Figure 2. The BOOTSTRAP

1 Pencil and Paper stuff

Professor: Kyle Jepson

(Python) Chapter 3: Repetition

3 Graphical Displays of Data

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Lecture 1: Overview

Here is the probability distribution of the errors (i.e. the magnitude of the errorbars):

9. MATHEMATICIANS ARE FOND OF COLLECTIONS

Constant Velocity Is a Myth

INTRO TO THE METROPOLIS ALGORITHM

CS1046 Lab 4. Timing: This lab should take you 85 to 130 minutes. Objectives: By the end of this lab you should be able to:

UV Mapping to avoid texture flaws and enable proper shading

Introduction to Programming

CSCI 1100L: Topics in Computing Lab Lab 11: Programming with Scratch

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Chapters 5-6: Statistical Inference Methods

Unit 5: Estimating with Confidence

n! = 1 * 2 * 3 * 4 * * (n-1) * n

Animations involving numbers

Loopy stuff: for loops

COMP 161 Lecture Notes 16 Analyzing Search and Sort

Part 1 Simple Arithmetic

Digitizer Leapfrogging

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #13. Loops: Do - While

Crash Course in Modernization. A whitepaper from mrc

XP: Backup Your Important Files for Safety

Exercise 1 Using Boolean variables, incorporating JavaScript code into your HTML webpage and using the document object

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

Designing the staging area contents

Lab 1: Introduction to data

Usable Privacy and Security Introduction to HCI Methods January 19, 2006 Jason Hong Notes By: Kami Vaniea

Intro to Stata for Political Scientists

CS61A Summer 2010 George Wang, Jonathan Kotker, Seshadri Mahalingam, Eric Tzeng, Steven Tang

Statistical Computing (36-350)

Lecture 4 Median and Selection

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15

UNINFORMED SEARCH. Announcements Reading Section 3.4, especially 3.4.1, 3.4.2, 3.4.3, 3.4.5

RouteOp. Step 1: Make sure requirements are met.

Parcel QA/QC: Video Script. 1. Introduction 1

1 Lecture 5: Advanced Data Structures

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Activity Guide - Will it Crash?

Notes on Simulations in SAS Studio

Shorthand for values: variables

CS3205 HCI IN SOFTWARE DEVELOPMENT INTRODUCTION TO PROTOTYPING. Tom Horton. * Material from: Floryan (UVa) Klemmer (UCSD, was at Stanford)

CHAPTER 6. The Normal Probability Distribution

Keep Track of Your Passwords Easily

6.001 Notes: Section 6.1

Condition Controlled Loops. Introduction to Programming - Python

Section 0.3 The Order of Operations

Discussion 11. Streams

Chapter01.fm Page 1 Monday, August 23, :52 PM. Part I of Change. The Mechanics. of Change

Ingredients of Change: Nonlinear Models

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

RECURSION. Week 6 Laboratory for Introduction to Programming and Algorithms Uwe R. Zimmer based on material by James Barker. Pre-Laboratory Checklist

MAT 003 Brian Killough s Instructor Notes Saint Leo University

Animations that make decisions

Visual Application Usage Modeling with UCML 1.1. Scott Barber Chief Technology Officer PerfTestPlus, Inc. STAR East, 2004 Orlando, FL

Head-to-head: Which will win for your business?

Taking Civil 3D to the next level. July 14, 2017

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Laboratory. Low-Level. Languages. Objective. References. Study simple machine language and assembly language programs.

Estimation of Item Response Models

Tutorial 3: Probability & Distributions Johannes Karreth RPOS 517, Day 3

What is version control? (discuss) Who has used version control? Favorite VCS? Uses of version control (read)

LABORATORY. 16 Databases OBJECTIVE REFERENCES. Write simple SQL queries using the Simple SQL app.

Adding content to your Blackboard 9.1 class

Coding Workshop. Learning to Program with an Arduino. Lecture Notes. Programming Introduction Values Assignment Arithmetic.

SEER AKADEMI LINUX PROGRAMMING AND SCRIPTINGPERL 7

CS 4349 Lecture August 21st, 2017

RACKET BASICS, ORDER OF EVALUATION, RECURSION 1

Transcription:

Lab 3: Sampling Distributions Sampling from Ames, Iowa In this lab, we will investigate the ways in which the estimates that we make based on a random sample of data can inform us about what the population might look like. We re interested in formulating a sampling distribution of our estimate in order to get a sense of how good of an estimate it is. The Data The dataset that we ll be considering comes from the town of Ames, Iowa. The Assessors Office records information on all real estate sales and the data set we re considering contain information on all residential home sales between 2006 and 2010. We will consider these data as our statistical population. In this lab we would like to learn as much as we can about these homes by taking smaller samples from the full population. Let s load the data. ames <- read.delim("http://www.amstat.org/publications/jse/v19n3/decock/ AmesHousing.txt") We see that there are quite a few variables in the data set, but we ll focus on the number of rooms above ground (TotRms.AbvGrd) and sale price (SalePrice). Let s look at the distribution of number of rooms in homes in Ames by calculating some summary statistics and making a histogram. summary(ames$totrms.abvgrd) hist(ames$totrms.abvgrd) Exercise 1 How would you describe this population distribution? The Unknown Sampling Distribution In this lab, we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or even impossible. Because of this, we often take a smaller sample survey of the population and use that to make educated guesses about the properties of the population. If we were interested in estimating the mean age number of rooms in homes in Ames based on a sample, we can use the following command to survey the population. samp1 <- sample(ames$totrms.abvgrd,75) This command allows us to create a new vector called samp1 that is a simple random sample of size 75 from the population vector ames$totrms.abvgrd. At a conceptual level, you 1

can imagine randomly choosing 75 entries from the Ames phonebook, calling them up, and recording the number of rooms in their houses. You would be correct in objecting that the phonebook probably doesn t contain phone numbers for all homes and that there will almost certainly be people that don t pick up the phone or refuse to give this information. These are issues that can make gathering data very difficult and are a strong incentive to collect a high quality sample. Exercise 2 How would you describe the distribution of this sample? How does it compare to the distribution of the population? If we re interested in estimating the average number of rooms in homes in Ames, our best guess is going to be the sample mean from this simple random sample. mean(samp1) Exercise 3 How does your sample mean compare to your team mates? Are the sample means the same? Why or why not? Depending which 75 homes you selected, your estimate could be a bit above or a bit below the true population mean of 6.438. But in general, the sample mean turns out to be a pretty good estimate of the average age, and we were able to get it by sampling less than 3% of the population. Exercise 4 Take a second sample, also of size 75, and call it samp2. How does the mean of samp2 compare with the mean of samp1? If we took a third sample of size 150, intuitively would you expect the sample mean to be a better or worse estimate of the population mean? Not surprisingly, every time we take another random sample, we get a different sample mean. It s useful to get a sense of just how much variability we should expect when estimating the population mean this way. This is what is captured by the sampling distribution. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps 5000 times. samp_means <- rep(0, 5000) for(i in 1:5000){ samp_means[i] <- mean(samp) } hist(samp_means, probability = TRUE) Here we rely on the computational ability of R to quickly take 5000 samples of size 75 from the population, compute each of those sample means, and store them in a vector called samp 2

_means. Note that since there are multiple lines in the code that are part of the for loop, it would make your life much easier to first write this code in an R script and then run it from there. Exercise 5 How many elements are there in samp_means? How would you describe this sampling distribution? On what value is it centered? Would you expect the distribution to change if we instead collected 50,000 sample means? Interlude: The For Loop Let s take a break from the statistics for a moment to let that last block of code sink in. You have just run your first for loop - a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 75 from the population of number of rooms in homes (ames$totrms.abvgrd) then save the mean of that sample into the samp_means vector. Without the for loop, this would be painful: samp_means <- rep(0, 5000) samp_means[1] <- mean(samp) samp_means[2] <- mean(samp) samp_means[3] <- mean(samp) samp_means[4] <- mean(samp) and so on... and so on... and so on. With the for loop, these thousands of lines of code are compressed into 4 lines. samp_means <- rep(0, 5000) for(i in 1:5000){ samp_means[i] <- mean(samp) 3

print(i) } Let s consider this code line by line to figure out what it does. In the first line we initialized a vector. In this case, we created a vector of 5000 zeros called samp_means. This vector will serve as empty box waiting to be filled by whatever output we produce in the for loop. The second line calls the for loop itself. The syntax can be loosely read as, for every element i from 1 to 5000, run the following lines of code. The last component of the loop is whatever is inside the curly braces; the lines of code that we ll be iterating over. Here, on every loop, we take a random sample of size 75 from ames$ TotRms.AbvGrd, take its mean, and store it as the i th element of samp_means. Notice that in addition to automatically iterating the code 5000 times, the for loop provided an object that is specific to each loop: the i. You can think of i as the counter that keeps track of which loop you re on. On the first pass, i takes the value of 1; on the second pass, it takes the value of 2; and so on, until the 5000 th and final loop, where it takes the value of 5000. In order to display that this is really happening we asked R to print i at each iteration so that. The line of code where we ask R to print i, print(i), is optional, and is only used for displaying what s going on while the for loop is running. The for loop allows us not just to run the code 5000 times, but to neatly package the results, element by element, back into the empty vector that we initialized at the outset. Exercise 6 To make sure you understand what you ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called samp_means_ small. Run a loop that takes a sample of size 75 from ames$totrms.abvgrd and stores the sample mean in samp_means_small, but only iterate from 1 to 50. Print the output to your screen (type samp_means_small into the console and press enter). How many elements are there in this object called samp_means _small? What does each element represent? Do you notice something odd going on here? How would you fix this? Approximating the Sampling Distribution Mechanics aside, let s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one. hist(samp_means, probability = TRUE) The sampling distribution that we computed tells us everything that we would hope for about the average number of rooms in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average number of rooms of the the population and the spread of the distribution indicates how much variability is induced by sampling only 75 of the homes. 4

We computed the sampling distribution for mean number of rooms by drawing 5000 samples from the population and calculating 5000 sample means. This was only possible because we had access to the population. In most cases you don t (if you did, there would be no need to estimate!). Therefore, you have only your single sample to rely upon... that, and the Central Limit Theorem. The Central Limit Theorem states that, under certain conditions, the sample mean follows a normal distribution. This allows us to make the inferential leap from our single sample to the full sampling distribution that describes every possible sample mean you might come across. But we need to look before we leap. Exercise 7 Does samp1 meet the conditions for the sample mean to be nearly normal, as described in section 4.2.3? If the conditions are met, then we can find the approximate sampling distribution by plugging in our best estimate for the population mean and standard error: x and s/ n. xbar <- mean(samp1) se <- sd(samp1)/sqrt(75) We can add a curve representing this approximation to our existing histogram using the command lines. The x-coordinates cover the range of the x in the histogram and the y- coordinates are the values that correspond to this particular normal distribution. hist(samp_means, probability = TRUE) lines(x = seq(3,10,0.01), y = dnorm(seq(3,10,0.01), mean = xbar, sd = se), col = "blue") We can see that the line does a decent job of tracing the histogram that we derived from having access to the population. In this case, our approximation based on the CLT is a good one. On Your Own So far we have only focused on estimating the mean number of rooms of the homes of Ames. Now we ll try to estimate the mean sale price. 1. Take a random sample of size 50 from ames$saleprice. Using this sample, what is your best point estimate of the population mean? Include a histogram of this sample in your answer. 2. Check the conditions for the sampling distribution of x SaleP rice to be nearly normal. 3. Since you have access to the population, compute the sampling distribution for x SaleP rice by taking 5000 samples from the population of size 5 and computing 5000 sample means. Store these means in a vector called samp_means_5. Describe the shape of this sampling 5

distribution. Based on this sampling distribution, what would you guess the mean sale price of homes in Ames to be? Include a histogram of the sampling distribution. 4. Change your sample size from 5 to 150, then compute the sampling distribution using the same method as above and store these means in a new vector called samp_means _150. Describe the shape of this sampling distribution (where n = 150) and compare it to the sampling distribution from earlier (where n = 5). Based on this sampling distribution, what would you guess the mean sale price of the homes in Ames to be? Include a histogram of the sampling distribution. 5. Based on their shape, which sampling distribution would you feel more comfortable approximating by the normal model? 6. Which sampling distribution has a smaller spread? If we re concerned with making estimates that are consistently close to the true value, is having a sampling distribution with a smaller spread more or less desirable? 7. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer. 6