Stat-340 Term Test Spring Term

Similar documents
Introduction to Statistical Analyses in SAS

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Lab #9: ANOVA and TUKEY tests

SAS Training Spring 2006

Introductory Guide to SAS:

Stat-340 Assignment Spring Term

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Data Management - 50%

ST Lab 1 - The basics of SAS

A. Using the data provided above, calculate the sampling variance and standard error for S for each week s data.

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Dr. Barbara Morgan Quantitative Methods

EXST SAS Lab Lab #6: More DATA STEP tasks

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Stat 302 Statistical Software and Its Applications SAS: Data I/O

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Centering and Interactions: The Training Data

Brief Guide on Using SPSS 10.0

EXST SAS Lab Lab #8: More data step and t-tests

Multiple Regression White paper

STA9750 Lecture I OUTLINE 1. WELCOME TO 9750!

AURA ACADEMY SAS TRAINING. Opposite Hanuman Temple, Srinivasa Nagar East, Ameerpet,Hyderabad Page 1

Stat 5100 Handout #14.a SAS: Logistic Regression

Level I: Getting comfortable with my data in SAS. Descriptive Statistics

Table Of Contents. Table Of Contents

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Intermediate SAS: Statistics

An introduction to SPSS

SAS data statements and data: /*Factor A: angle Factor B: geometry Factor C: speed*/

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

SPSS. (Statistical Packages for the Social Sciences)

Week 6, Week 7 and Week 8 Analyses of Variance

Excel 2010 with XLSTAT

Creating a data file and entering data

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

8. MINITAB COMMANDS WEEK-BY-WEEK

PSY 9556B (Feb 5) Latent Growth Modeling

Example1D.1.sas. * Procedures : ; * 1. print to show the dataset. ;

Applied Regression Modeling: A Business Approach

The SAS interface is shown in the following screen shot:

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

WELCOME! Lecture 3 Thommy Perlinger

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC

Chemical Reaction dataset ( )

SAS Online Training: Course contents: Agenda:

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

Factorial ANOVA with SAS

Robust Linear Regression (Passing- Bablok Median-Slope)

Applied Regression Modeling: A Business Approach

2. Don t forget semicolons and RUN statements The two most common programming errors.

Base and Advance SAS

STAT 7000: Experimental Statistics I

PR3 & PR4 CBR Activities Using EasyData for CBL/CBR Apps

EXST3201 Mousefeed01 Page 1

ANSWERS -- Prep for Psyc350 Laboratory Final Statistics Part Prep a

Quantitative - One Population

PLS205 Lab 1 January 9, Laboratory Topics 1 & 2

Paper S Data Presentation 101: An Analyst s Perspective

Week 4: Simple Linear Regression III

Getting Started with the SGPLOT Procedure

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

MHPE 494: Data Analysis. Welcome! The Analytic Process

Preparing for Data Analysis

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Correctly Compute Complex Samples Statistics

Factorial ANOVA. Skipping... Page 1 of 18

STAT:5400 Computing in Statistics

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

IQR = number. summary: largest. = 2. Upper half: Q3 =

Applied Regression Modeling: A Business Approach

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

Poisson Regressions for Complex Surveys

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

STA 570 Spring Lecture 5 Tuesday, Feb 1

An Introduction to SAS University Edition

Chapter 6: DESCRIPTIVE STATISTICS

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Statistical Tests for Variable Discrimination

Macros and ODS. SAS Programming November 6, / 89

MINITAB 17 BASICS REFERENCE GUIDE

WINKS SDA Windows KwikStat Statistical Data Analysis and Graphs Getting Started Guide

Computational Mathematics/Information Technology. Worksheet 2 Iteration and Excel

3. Almost always use system options options compress =yes nocenter; /* mostly use */ options ps=9999 ls=200;

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

STATS PAD USER MANUAL

2) familiarize you with a variety of comparative statistics biologists use to evaluate results of experiments;

Baruch College STA Senem Acet Coskun

A Step by Step Guide to Learning SAS

- 1 - Fig. A5.1 Missing value analysis dialog box

Lab 3 (80 pts.) - Assessing the Normality of Data Objectives: Creating and Interpreting Normal Quantile Plots

Experiment 1 CH Fall 2004 INTRODUCTION TO SPREADSHEETS

Lab 1: Introduction to Data

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Multiple Linear Regression Excel 2010Tutorial For use when at least one independent variable is qualitative

Contents of SAS Programming Techniques

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Transcription:

Stat-340 Term Test 1 2015 Spring Term Part 1 - Multiple Choice Enter your answers to the multiple choice questions on the provided bubble sheets. Each of the multiple choice question is worth 1 mark there is no correction for guessing. Be sure your student name and number are completed on the bubble sheets. 1. How many observations and variables are contained in the following dataset? data blah; infile datalines; length name $10 sex $1 partnername $10 partnersex $1; input name $ sex $ / partnername $ partnersex $; datalines; Carl M Lois F Matthew M Fred M Selina F David M Tim M Kim. ;;;; (a) 8 observations; 2 variables. (b) 4 observations, 4 variables. (c) 8 observation, 4 variables. (d) 3 observations, 4 variables. (e) 4 observations, 2 variables. Solution: (b) Option A - 14% chose 2015. Notice the slash in the input which makes SAS go to a new line for the last 2 variables. Option B - 60% chose 2015. Option C - 24% chose 2015. See (a) 1

2. Which of the following is TRUE about By group processing? (a) A different analysis can be performed for each BY group. (b) The BY variable must be a numeric or date variable. (c) The data does not have to be grouped together by values of the BY variables (d) The BY groups can have different numbers of variables. (e) BY group processing can be done for any procedure. Solution: (e) Option D - 11% chose 2015. All of the by groups are subsets of the data and so have the same number of variables. Option E - 88% chose 2015. By variables can be any type. 3. Which of the following is correct about a standard error of a statistic. (a) The se measures how much the sample size changes in a simulation study. (b) The se measures the standard deviation of the population slope over bootstraps samples from the data. (c) The se measures the standard deviation of the Gini-estimate of the standard deviation between different populations. (d) The se measures the increase in the number of calories for each additional gram of fat. (e) The se measures now much a statistic will vary when new samples are taken from a population. Solution: (e) Option A - SE never measure variation of sample size. Option B - Population parameters (slopes) are fixed and do not vary. Option C - This doesn t even make sense - there is only one population. Option D - This is the definition of a slope and not a standard error. Option E - 95% chose 2015. 4. Consider the following segment of code: data birthdays; infile datalines; length name $30; input name $ bdate:yymmdd10.; format bdate mmddyy8.; datalines; carl 63/02/01 lois 48/14/02 fred 58/06/03 tim 52/07/04 dave 63/12/31 ;;;; proc print data=birthdays; c 2015 Carl James Schwarz 2

Which of the following is correct? (a) The birth day for Carl will be displayed as 02/01/63. (b) The birth day for Lois will be displayed as 14/02/48. (c) The birth day for Fred will be displayed as 1958-06-03; (d) The birth day for Tim will be displayed as 04/07/1952. (e) The birth day for Dave will be display as a missing value. Solution: (a) Option A - 73% chose 2015. Option D - 14% chose 2015. The mmmddyy out-format only has length 8 so show 2 digit years. 5. Consider the following code data blah; infile datalines; length name sex $10.; input name sex age weight; if age > 30 then delete; drop weight; datalines; A f 27 90 B m 35 120 C F 23 60 D M 24 75 E F. 43 ;;;; Which of the following is correct? (a) The blah dataset has 5 observations and 4 variables. (b) The blah dataset has 4 observations and 4 variables/ (c) The blah dataset has 3 observations and 3 variables; (d) The blah dataset has 4 observation and 3 variables; (e) The blah dataset has 5 observations and 3 variables. Solution: (d) Option B - 15% chose 2015. The Drop statement removes a variable. Option D - 65% chose 2015. Option E - 11% chose 2015. The If statement removes an observation. 6. Which of the following is correct? (a) PROC GLM is used to test hypotheses about population mean proportions. (b) PROC FREQ is used to test hypotheses about sample proportions. c 2015 Carl James Schwarz 3

(c) PROC REG is used to test hypotheses about population slopes. (d) PROC GENMOD is used to test hypotheses about sample proportions. (e) PROC TTEST is used to test hypotheses about paired sample means. Solution: (c) Option A - 14% chose 2015. There is no such thing as a MEAN proportion! Option C - 61% chose 2015. Option D - 10% chose 2015. Hypotheses are ALWAYS about POPULATION parameters, not sample statistics. Option E - 15% chose 2015. Hypotheses are ALWAYS about POPULATION parameters, not sample statistics. 7. Consider the following SAS code: data blah; infile datalines dlm=, input v1 v2 v3 v4 v5 v6; datalines; 1,,2,3,4,5,6,7,8,9 2,3,.,5,6,7,8,9,0 3,4,5,.,6,7,8,9,0,1,2 9,8,7,,6,5,4,3,2,1,0 7,,6,,5,,9,,4,,3,,1,, ;;;; dsd missover; Which of the following is correct? (a) The value of v2 in the first observation is 2. (b) The value of v3 in the second observation is 5. (c) The value of v4 in the third observation is missing. (d) The value of v6 in the fourth observation is 4. (e) The value of v3 in the fifth observation is 5. Solution: (c) Option C - 95% chose in 2015. 8. Consider the following SAS code: data blah; infile datalines; length surname $10 sex $1; input surname sex age; datalines; schwarz m 56 c 2015 Carl James Schwarz 4

schwarz f 53 zhao f 48 zhao m 52 sun m 27 chao f 23 chao m 27 ;;;; proc sort data=blah; by surname; proc transpose data=blah out=transblah; by surname; var age; id sex; Which of the following is correct? (a) The resulting transblah dataset has 3 observations. (b) The value of the variable M for the first observation in the transblah dataset is 56. (c) The observation for surname Sun will have the value of 27 for the the ages of both sexes. (d) The value of the variable F for last observation in the transblah dataset is 23. (e) The 4th observation in the transblah dataset will have 52 as the value for the M variable. Solution: (e) Option A - 16% chose. There are 4 distinct values for the Surname variable so the resulting dataset will have 4 observations. Option B - 28% chose. Don t forget to sort before transposing. Option C - 11% chose. Because Sun does not have a complete set of a variables, the missing variables will be set to missing. Option E - 39% chose. 9. Consider the following SAS code: proc tabulate data=accidents missing; class month Accident_Severity; var fatality; table Accident_Severity ALL, month*fatality*mean*f=7.2; Which of the following is correct? (a) The Accident_Severity variable will be along the top of the table (the columns). (b) The mean number of fatalities in each month and Accident_Severity will be found. (c) Each row of the table will correspond to a different value of the Accident_Severity variable, with the final row a summary over all codes. c 2015 Carl James Schwarz 5

(d) The missing option on the Proc statement ensures that missing values are ignored during the tabulation. (e) If the Accident_Severity variable had 3 levels, and if the month variable had 12 levels, the table would have 36 cells. Solution: (c) Option B - 40% chose 2015. Month is not used in the Table statement. Option C - 35% chose 2015. Option D - 12% chose 2015. The missing option also tabulates the missing values. Option E - 12% chose 2015. The ALL option will generate a row at the end for all codes. 10. Consider the following piece of SAS code: data blah; infile datalines; length name $10 sex $1; input name sex YearOfBirth; Age = 2015 - YearOfBirth; datalines; Carl M 1956 Lois. 1943 Fred.. Matthew M 1926 Marianne F -1 David M 1922 Julia F 2016 ;;;; Which of the following is correct: (a) The computed value of Age for Carl is 59. (b) The computed value of Age for Fred is 0. (c) The computed value of Age for Marianne is missing. (d) The computed value of Age for David is 1922. (e) The computed value of Age for Julia is missing. Solution: (a) Option A - 96% chose 2015. c 2015 Carl James Schwarz 6

Part II - Long Answer Stat-340-2015 Spring Term - Term Test 1 Name Student Number: Put your name and student number on the upper right of each of the following pages as well in case the pages get separated. Answer the following questions in the space provided. Be sure that your answers are legible. The marks given to these questions are 5, 6, 3, 4, and 7 respectively. c 2015 Carl James Schwarz 7

1. Interpretation - 5 Marks: Consider the following output from an analysis of the cereal dataset: Write a SHORT paragraph here summarizing the results. Solution: The relationship between the calories/serving and the grams of fat/serving was investigated using linear regression (Figure 1). The fitted equation is Calories = 95 + 9.8(F at) There was strong evidence that the slope is different from 0 (p <.0001). For every gram of fat, the calories/serving is expected to increase by 9.8 (SE 2.2) calories/gram of fat. Common problems in solutions from students include: Reporting too many decimal places. Seldom do you need to report more than two significant digits. The intercept is usually not of interest and so you don t usually spend anytime discussing it. The whole point of regression is to estimate the slope. So the discussion needs to be about the slope. Many students discussed differences in means (which is not sensible), or differences in the mean among groups which is again not sensible. These students were likely confusing regression with ANOVA. Don t just give the table values as facts add some interpretation to the information in the table. For example, many student had sentences such as The parameter estimate for Fat was 9.8. The c 2015 Carl James Schwarz 8

standard error was 2.21. The t-value was 4.44 and the p-value was <.0001 so we rejected the null hypothesis. These types of sentence provide no useful information to the reader over and above the table. c 2015 Carl James Schwarz 9

2. Reading and Recodes - 6 Marks: The csv file named atus.csv contains the following fields on television viewing from the American Time of Use Study. ID Number Name (up to 30 characters) Sex (single letter code) Age at time of interview. For example 26y3m indicates the subject was 26 years and 3 months old. Number of minutes of television watched. The first few lines of the data file are as follows: ID, name, sex, age, tvmin 123ABCDEF, Schwarz, m, 58y10m, 20 LJD1234LJ, Lank, m, 61y2m, 40 93234LLJJ, Swartz, F, 21y10m, 75 LLKD2343K, Duncan, f, 87y2m, 150 OUEROE, Smith, f, 8y2m, 236 Write SAS code to do the following: Read in the data from the csv file as noted above. Convert the year/month age data to a decimal year, e.g. 26y3m is converted to 26.25 years (3 months is 1/4 of a year). Recode the sex variable. Either f or F is recoded as female; either m or M is recoded as male; other values are recoded as illegal sex. Recode the decimal age to 3 age classes. Ages 16-25 (including 16 but excluding 25) are recoded to 16-24; ages 25-40 (including 25 but excluding 40) are recoded to 25-39; ages 40-70 (including 40 but excluding 70) are recoded to 40-69. Other ages are recoded to out of frame. Check your recodes for both sex and age using appropriate procedures. Put your SAS code here and the page overleaf (if needed) c 2015 Carl James Schwarz 10

One possible solution data atus; infile datalines dlm=, dsd missover firstobs=2; /* Need dsd, dlm and firstobs= length id $10 name $20 sex $1 cage $10; length cagey cagem $10; /* temporary character values */ length newsex $10 ageclass $20; /* recoded values need longer lengths */ input id $ name $ sex $ cage $ minutes; /* convert input age to decimal age */ wherey = index(cage, "y"); /* where is the y */ cagey = substr(cage, 1, wherey-1); /* extract the age in years */ agey = input(cagey, f30.0); /* convert to age in years to number */ wherem = index(cage, "m"); /* where is the m */ cagem = substr(cage, wherey+1, wherem-wherey-1); /* extract the months */ agem = input(cagem, f30.0); /* extract the months */ age = agey + agem/12; /* make decimal age */ /* recode the sex */ sex = upcase(sex); /* convert to upper case */ newsex = illegal ; if sex = F then newsex = female ; if sex = "M" then newsex = male ; /* recode the age classes */ ageclass = out of frame ; if 16 <= age < 25 then ageclass = 16-24 ; if 25 <= age < 40 then ageclass = 25-39 ; if 40 <= age < 70 then ageclass = 40-69 ; datalines; 123ABCDEF, Schwarz, m, 58y10m, 20 LJD1234LJ, Lank, m, 61y2m, 40 93234LLJJ, Swartz, F, 21y10m, 75 LLKD2343K, Duncan, f, 87y2m, 150 OUEROE, Smith, f, 8y2m, 236 ;;;; proc print data=atus; title2 Data after coding ; /* check the recodes */ /* You need to use Proc Tabulate/SGplot and compare the OLD values to the NEW values */ proc tabulate data=atus missing; c 2015 Carl James Schwarz 11

title2 check the recodes ; class sex newsex age ageclass; table sex, newsex *n*f=5.0; /* check sex coding */ table age, ageclass*n*f=5.0; /* possible but very long table */ /* because age is a continuous variable, it is better to use sgplot to check the recodes*/ proc sgplot data=atus; title2 check the recodes for age ; scatter x=ageclass y=age; Comments about student responses: I ve used the Datalines option, but you could replace it with the actual file named people.csv. You could use Proc Import as well to read in the data using proc import file= people.csv out=atus replace; Many students didn t use/forgot to correct for upper/lower case of the gender values. Rather than if gender = f then gender = F ; if gender = m then gender = M ; use the upcase() function directly as shown above. Be careful of code such as data blah; length sex $1; input sex; if sex = f then sex = female ; Because sex is defined with length 1, the new value of female gets truncated to 1 character. So you either have to define sex with a longer length, or define a new variable (as I did above) with a longer length to hold the new values. Be careful of code such as data blah;... if 16 <= age < 25 then age = 16-24 ; Here you are using the age variable as both character and numeric. This won t work. You likely want a separate character variable for the age class as I did in my solution. Some students always thought that the month started in the 4th position. It may not. See the solution above for a completely general solution. Always try and code stuff in the most general fashion possible so that it works in all cases. c 2015 Carl James Schwarz 12

Using Proc Print to check your recodes is not sufficient, as you will only be able to check if the recoding worked for the first few records. You need to use Proc Tabulateand Proc SGplot as shown above and as was done in our assignments. Notice that proc tabulate data=blah; class newsex; table newsex; doesn t provide enough information to see that the values of oldest have been properly recoded to the newest variable. See the solution above. Some student tried code along the lines of data blah; infilel... input... age yyymmm; There is no informat in SASto hand this case and you need to use the methods as shown above. The only useful infomats needed are for dates, times, and datetime values. c 2015 Carl James Schwarz 13

3. Trends in TV watching - 3 Marks: We are now interested in comparing the average TV watched between sexes and among age classes (see previous question), and examining if the trends over age classes are the same for both sexes. Here is some output from such an analysis Source DF Type III SS Mean Square F Value Pr > F sex 1 3893.5 3893.5 253.72 <.0001 ageclass 2 484.8 242.4 15.82 <.0001 sex*ageclass 2 7.0 3.5 0.32 0.7843 (a) Write a (very) short paragraph on your conclusions from the above analysis. WRITE YOUR PARAGRAPH HERE Solution: We performed an analysis of variance (ANOVA) to investigate if the changes in the mean number of minutes of TV viewing across the age classes were similar for the two sexes. There was no evidence that the change in the mean TV watched across the age classes varied between the sexes (p = 0.78)., i.e. there was no evidence that the trends across age classes were not parallel for the two sexes. There was strong evidence that there were difference in the mean amount of TV watched between the sexes and among the age classes (both p <.0001). Comments on student answers: We never say that there was evidence of parallelism, bur rather we say that there was no evidence of non-parallelism. The reason for this is that with a large enough sample size, we can always find evidence that the trends are non-parallel, but the non-parallelism may be miniscule. (b) Give the SAS code that would give the above results. Just the procedure code is needed - no data step is needed. You may assume that the dataset is called atus and contains variables sex, ageclass, and tvwatched for the number of minutes of TV watched by the respondent. Assume that the data were collected from an SRS, so it is NOT necessary to weight the analysis. Put your SAS code here: Solution proc glm data=atus; class sex ageclass; model tvwatched= sex ageclass sex*ageclass; Comments on student answers: Many students used Proc Genmod. This procedure is usually only used for logistic and similar models and not for standard ANOVAs. You need terms for the main effects and the interactions to produce the above table. c 2015 Carl James Schwarz 14

4. Profile Plot - 4 Marks: The output from the procedure to analyze the ATUS included estimates of the marginal means (the LSmeans) along with the upper and lower confidence limits on each each marginal mean. Create a suitable profile plot comparing the changes in mean TV watched across the age classes for the two sexes. Be sure to label the axes properly. You can assume that the analysis procedure created a data set (named mylsmeans) with the following variables. sex ageclass estimate of the marginal mean TV watched (minutes) lcl, the lower confidence bound on the mean ucl, the upper confidence bound on the mean Put your SAS code here: One possible solution proc sgplot data=mylsmeans; title2 profile plot of mean tv watched ; scatter x=ageclass y=estimate / group=sex; series x=ageclass y=estimate / group=sex; highlow x=ageclass lower=lcl upper=ucl / group=sex; xaxis label= Age class ; yaxis label= Mean TV watched (minutes) with 95% confidence interval ; Comments on student solutions: Several students used a Proc Means to try and find some averages. I m guessing that they just copied a solution that looked similar on past exams. Here the dataset is ready to be plotted and no further processing is needed before using Proc SGplot. c 2015 Carl James Schwarz 15

5. More analyses of the ATUS study. - 7 Marks There are two files for the ATUS study. The first dataset (named tvwatch) records TV watching habits and has the following information ID - the ID Number of the family MinTV - Number of minutes of television watched for the selected person from the household. The second dataset (named demoinfo) contains demographic and other information about the respondent s household (including the respondent) with the following information: ID - the ID Number of the family name of household member sex - the sex of the household member coded as f or m. empstatus - the employment status (employed or unemployed, coded as em or un) of the household member at the time of interview So for each subject in the tvwatch dataset, there can be 1 or more observations in the demoinfo dataset. Write SAS code to accomplish the following tasks Processes the demoinfo data to count the number of household members, the number of males, and the number of employed members. Hint: remember how your counted the number of females in the vehicles dataset from the Accidents analysis. Combines the TV time dataset and the data set from the previous step. Removes any records where there are more than 4 people in the household. Computes the mean number of minutes watched for each combination of number of males and the number of employed members and saves the results to a data set. [You can make up an ODS table name if needed]. Put your SAS code here and overleaf (if needed). c 2015 Carl James Schwarz 16

/* create variables for male/female and employment status */ proc sort data=demoinfo; by id; data demoinfo; set demoinfo; ismale = 0; if sex = m then ismale=1; /* code 1 or 0 for number of males */ isemp = 0; if empstatus = em then isemp = 1; proc means data=demoinfo noprint; /* count number of males */ by id; var ismale isemp; output out=sumdemo n=nmembers sum=nmale nemp; /* combine the two datasets */ data both; merge tvwatch demoinfo; by id; if nmembers > 4 then delete; /* remove households with more than 4 members */ /* get the mean tv watched */ proc sort data=both; by nmale hemp; proc means data=both; by nmale nemp; var mintv; output out=meantv mean=mean_tv; /* or you could use proc glm and a lsmeans */ proc glm data=both; class nmale nemp; model mintv = nmale nemp nmale*nemp; lsmeans nmale*nemp; ods output lsmeans=mylsmeans; Comments about student solutions: Many students had difficulty with part 1 of the question. This was the hardest part of the question. You could also try variants of a Proc Tabulate but that is likely to be more difficult to do. Most students had no problems with the merges and deletion step. You could also use Proc Tabulate for the final step, but this is actually more difficult to implement in practise than the given solutions. c 2015 Carl James Schwarz 17

Statistics about the term test: c 2015 Carl James Schwarz 18

There is some evidence that grades on the assignments is related to the grades on the term tests as seen in the pairwise plots below. c 2015 Carl James Schwarz 19

c 2015 Carl James Schwarz 20