Lab #3: Probability, Simulations, Distributions:

Similar documents
DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

Handling missing values in Analysis

The University of Cincinnati Space Management System (UCSMS) Updating Data Guide (authorized users only) January 2016

Lab #9: ANOVA and TUKEY tests

Using Proc Freq for Manageable Data Summarization

STA 570 Spring Lecture 5 Tuesday, Feb 1

3. Data Tables & Data Management

Biostatistics & SAS programming. Kevin Zhang

Level I: Getting comfortable with my data in SAS. Descriptive Statistics

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny

ADOBE DREAMWEAVER CS4 BASICS

Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp )

STAT10010 Introductory Statistics Lab 2

ST Lab 1 - The basics of SAS

GIS LAB 1. Basic GIS Operations with ArcGIS. Calculating Stream Lengths and Watershed Areas.

Stat 302 Statistical Software and Its Applications SAS: Data I/O

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC

Biostat Methods STAT 5820/6910 Handout #4: Chi-square, Fisher s, and McNemar s Tests

Module 1: Introduction RStudio

Running PeopleSoft Query Viewer and Running Query to Excel Basic Steps

Biostatistics & SAS programming. Kevin Zhang

DSCI 325: Handout 3 Creating and Redefining Variable in SAS Spring 2017

STAT:5400 Computing in Statistics

Club Leader Access to 4-H Online

3. Prepare all your graphs, illustrations and text by cutting them to size. For straight lines use a guillotine.

PHPM 672/677 Lab #2: Variables & Conditionals Due date: Submit by 11:59pm Monday 2/5 with Assignment 2

Assignment 0. Nothing here to hand in

Adobe Dreamweaver CC 17 Tutorial

Where Did My Files Go? How to find your files using Windows 10

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

Epidemiologic Analysis Using R

Using MS Publisher. Launch MS Publisher: Start > All Programs > Microsoft Office > Publisher. Setting up Document Size and Orientation

CS 1110 SPRING 2016: GETTING STARTED (Jan 27-28) First Name: Last Name: NetID:

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

Applied Multivariate Analysis

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

IPUMS Training and Development: Requesting Data

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

Writing to and reading from files

DOWNLOAD PDF MICROSOFT EXCEL ALL FORMULAS LIST WITH EXAMPLES

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

MATLAB INTRODUCTION. Matlab can be used interactively as a super hand calculator, or, more powerfully, run using scripts (i.e., programs).

Programming assignment A

Exam Questions A00-281

Basic tasks in Excel 2013

Designing & Creating your GIS Poster

Designing & Creating your GIS Poster

Excel Shortcuts Increasing YOUR Productivity

Question: How do I move my mobile account from the Corporate to my Personal Account?

How do I use BatchProcess

Adobe Dreamweaver CS5 Tutorial

SharePoint SITE OWNER TRAINING

Document Imaging User Guide

Cell means coding and effect coding

22S:166. Checking Values of Numeric Variables

R in Linguistic Analysis. Week 2 Wassink Autumn 2012

UAccess ANALYTICS Next Steps: Working with Bins, Groups, and Calculated Items: Combining Data Your Way

Reference Guide. Adding a Generic File Store - Importing From a Local or Network ShipWorks Page 1 of 21

1 Getting started with Processing

Microsoft Word 2011 Tutorial

Top 15 Excel Tutorials

NC User Conference Tips and Tricks for SAS FM June 16, 2009

GETTING STARTED. A Step-by-Step Guide to Using MarketSight

To create a notebook on desktop version 1. Open One Note 2. File > New. o FILE. Options

POWER WORLD LAB MANUAL

Catering to Your Tastes: Using PROC OPTEX to Design Custom Experiments, with Applications in Food Science and Field Trials

TS04. Running OpenCDISC from SAS. Mark Crangle

- 1 - ABSTRACT. Paper TU02

STEPS FOR USING TURNING POINT:

Chapter 2 Entering Data. Chapter Table of Contents

STA9750 Lecture I OUTLINE 1. WELCOME TO 9750!

Excel 2016: Part 1. Updated January 2017 Copy cost: $1.50

Programming Assignment #4 Arrays and Pointers

Using IDLE for

CS1114: Matlab Introduction

Statistics 13, Lab 1. Getting Started. The Mac. Launching RStudio and loading data

Teaching Manual Math 2131

Chapter Four: Loops. Slides by Evan Gallagher. C++ for Everyone by Cay Horstmann Copyright 2012 by John Wiley & Sons. All rights reserved

Introduction to GIS & Mapping: ArcGIS Desktop

My Lakeland Student Guide

Majors & SSCH by Department Dashboard Guide

Laboratory 1: Eclipse and Karel the Robot

Lab 1. Introduction to R & SAS. R is free, open-source software. Get it here:

lecture notes September 2, How to sort?

Fractional. Design of Experiments. Overview. Scenario

RECURSION 7. 1 Recursion COMPUTER SCIENCE 61A. October 15, 2012

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

1) Log on to the computer using your PU net ID and password.

Introduction to SAS Procedures SAS Basics III. Susan J. Slaughter, Avocet Solutions

Intro To Excel Spreadsheet for use in Introductory Sciences

Introduction to using QGIS for Archaeology and History Workshop by the Empirical Reasoning Center

MISSING VALUES: Everything You Ever Wanted to Know

Introduction to Computation and Problem Solving

Working with Data in Windows and Descriptive Statistics

Exercise sheet 1 To be corrected in tutorials in the week from 23/10/2017 to 27/10/2017

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe

LCD: 2 ( ) ( ) Interval Notation: (

DATA WAREHOUSE Extras

Part B. EpiData Analysis

Stat Wk 5. Random number generation. Special variables in data steps. Setting labels.

Transcription:

Lab #3: Probability, Simulations, Distributions: A. Objectives: 1. Reading from an external file 2. Create contingency table 3. Simulate a probability distribution 4. The Uniform Distribution Reading from external files: As we learned in the last lab the PROC IMPORT statement is the best way to enter external data sets. The CSV file we will be using is called Quest.csv. Download and save it in the desktop (easier to find) and mark the complete path to it. Then use the following code to import it, making sure you put the correct path on the DATAFILE argument. PROC IMPORT OUT= Quest1 DATAFILE= "put the complete path here \Quest1.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; The IMPORT statement reads the dataset and stores it as the value designated by OUT in this case it will be saved as Quest1 in the library Work. Our dataset is now loaded in SAS. If we want to check it out a bit we need to click on the tab view and hit the explorer link. A folder-based SAS environment appears. Our data set is in the WORK folder and it is named Quest1. Click on it and the table will appear in an excel-like format. When you are done with manipulating the table click close from the file tab. If you don t do that the updates on the table will not be saved and SAS will stop working if you utilize that table in the next few lines of code. As usual it makes sense to sort our data, let s say by gender. Let s go ahead and do that using the following command sequence: PROC SORT data=quest1; BY gender; You may once again want to explore your dataset using the dropdown menu VIEW and navigating to EXPLORER and directory WORK. It is still saved as Quest1. Don t forget to click close from the file tab. If you don t do that the updates on the table will not be saved and SAS will stop working if you utilize that table in the next few lines of code.

Creating a contingency table : As we discussed in class an easy way to compare two categorical variables with a few possible outcomes is to use a contingency table. SAS is really good at creating those and the appropriate command is PROC FREQ. Try the following code: Title Contingency Table 1 ; PROC FREQ DATA=Quest1; TABLES local * gender/ norow nocol nocum nopercent; Title Contingency Table 2 ; PROC FREQ DATA=Quest1; TABLES past_stats * past_programming/ norow nocol nocum nopercent; This code created 2 contingency tables, one comparing the variable local with the variable gender and another comparing the variable past_stats with past_programming. The basic argument is TABLE which contains the variables you are comparing. The arguments norow, nocol, nocum, nopercent; suppress extra information, like the marginal probabilities and the relative frequencies. Remove them and re-run your code to see what the output looks like. Note also the number of missing values is given. The labels on the top left corner of the table explain what each of the numbers mean. For example Row Pct corresponds to the marginal probability with respect to rows and Col Pct the marginal probability with respect to columns. Inputting a discrete probability distribution: SAS has a great feature that allows the user to simulate any discrete probability distribution he or she desires. The idea is to provide SAS with an empirical probability and use that to build a dataset drawn from that probability with however many entries you want. Such a process is extremely useful when we require more data that look like the ones we have but we cannot collect more. Let s start with a simple example, of a bag containing 2 Yellow, 2 Green, 2 Red and 4 marbles. There are various ways to input that into SAS but we will just stick to the basic input as datalines as we learned in the very first Lab. Namely we will use the following code:

Title Random Bag ; DATA Bag; INPUT colors $; datalines; Green Green Yellow Yellow Red Red ; This should have a created a dataset named Bag, with the entries listed above. If you want to view that dataset again go to view, explorer, work, and click on Bag. If we want to see just the counts of the various outcomes we should use the PROC FREQ command as follows: PROC FREQ data=bag; TABLES colors/ norow nocol nocum; Once again, note the arguments norow, nocol, nocum, that suppress some information on the frequency table. Experiment with removing those. By reading the relative frequency table (percentages) we get the following: colors Random Bag The FREQ Procedure Frequency Percent 4 40.00 Green 2 20.00 Red 2 20.00 Yellow 2 20.00 Note that this is indeed a probability distribution since the percentages add up to 1. What we would like to do now is simulate 200 repetitions of drawing a marble from the bag, observing its color and putting it back in. To do that we will ask SAS to create a specific probability distribution, with the numbers provided above.

The command for that is again the generation of a dataset, only this time the input is not given by the user. Instead it is decided by SAS based on the parameters we define. The following commands will do the trick: Title Simulation of Bag ; DATA Simul(keep=Type); CALL streaminit(1233); ARRAY p[4] (0.2 0.2 0.2 0.4); DO i = 1 to 200; Type = RAND("Table", of p[*]); output; END; Again the argument DATA will give the name Simul to our dataset. And the variable of interest is denoted by the use of KEEP so for this example it is going to be Type. The argument streaminit(1233) is important. What it does is it generates the following table randomly, but saves the random process into a stream, so that if two people run this code with the same stream they will get the same answers, or if you re-run the code you will get the same answers. This allows for the TA to check your work, and for you to be able to replicate it exactly. The argument ARRAY p[4] basically gives the probabilities for each outcome in a string (vector or array). The 4 is obviously to denote how many different outcomes there are. In this part of the code the outcomes don t have a name yet. They are defaulted by SAS to 1,2,3,4. We will change that in a bit. So you insert the corresponding probabilities in any order, as long as you mark it down. The line DO i = 1 to 200; is a basic loop that instructs SAS to choose from the distribution above 100 different entries numbered 1 through 200. The next line say to define the Type variable (our simulated outcome) using a randomized selection. That is the command RAND which is a build in function is SAS that produces randomized results from the table of probabilities given (basically whatever is inside the parenthesis after RAND) We close the code loop that started with the DO i =1 to 100 with the argument END; this tells SAS that it can stop repeating things and move to the next part of the code. Let me remind you that you can always view the dataset you just created using the command PROC PRINT. The following code should do the trick: PROC PRINT data=simul; TITLE "Unsorted Simulated Data"; As we see the outcomes are not, Green, Red, and Yellow but instead SAS uses 1,2,3,4. To fix that we need to set the format by specifying certain associations within the value CALL. The following code does that for us:

PROC FORMAT; VALUE Call 1='Green' 2='Yellow' 3='Red' 4=''; We append the associations 1='Green' 2='Yellow' 3='Red' 4='' to a list named Call, which we will prompt SAS to use every time it sees 1,2,3,4 as values in our dataset. For example the commands: PROC PRINT data=simul; TITLE "Unsorted Simulated Data good labels"; FORMAT Type Call.; Will generate our 200 observations using the correct labels. To finish off this problem, we want to compute the new probabilities coming from the simulated dataset. These are basically the proportions of each outcome vs the total number (200). This is the frequentist type of probability that we discussed in class (p n). Once again, the limit of that as the number of observations (n) tends to infinity will match the probabilities we gave before. We again invoke the PROC FREQ command as follows: PROC FREQ data=simul; TITLE Simulated probabilities with good labels ; FORMAT Type Call.; TABLES Type / nocum; The argument nocum, asks SAS not to output the cumulative probability, but keep everything else. If you want more information about how to simulate data with SAS follow this link to an official SAS guide: https://support.sas.com/resources/papers/proceedings15/sas1387-2015.pdf Analyzing the uniform distribution: As we learned in the lecture, the simplest of the discrete probability distributions is the uniform one. For this distribution, all outcomes have the same probability, or in other words they are equally likely to occur. Let s create a uniform distribution in SAS and then simulate 30 draws from it. Let s assume we want to simulate an 8 sided die. Such a die can be created by gluing together two square bottomed pyramids, but simulating it in SAS is much easier. All we need to do is define a table of probabilities as we did before. Obviously since all outcomes are equally likely and we have 8 of them the probability for each is 12.5% or 1/8. The following code does the trick:

Title Simulation of 8 Sided Dice ; DATA Dice(keep=Number); CALL streaminit(1233); ARRAY p[8] (0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125); DO i = 1 to 30; Number = RAND("Table", of p[*]); output; END; To see again the simulated probabilities we use the PROC FREQ command as follows. (Note this time we don t even need to change the labels so there is no need to invoke any formatting) PROC FREQ data=dice; TITLE Simulated probabilities of dice throws ; TABLES Number / nocum;