Stat-340 Term Test Spring Term

Stat-340 Term Test 1 2015 Spring Term Part 1 - Multiple Choice Enter your answers to the multiple choice questions on the provided bubble sheets. Each of the multiple choice question is worth 1 mark there is no correction for guessing. Be sure your student name and number are completed on the bubble sheets. 1. How many observations and variables are contained in the following dataset? data blah; infile datalines; length name $10 sex $1 partnername $10 partnersex $1; input name $ sex $ / partnername $ partnersex $; datalines; Carl M Lois F Matthew M Fred M Selina F David M Tim M Kim. ;;;; (a) 8 observations; 2 variables. (b) 4 observations, 4 variables. (c) 8 observation, 4 variables. (d) 3 observations, 4 variables. (e) 4 observations, 2 variables. Solution: (b) Option A - 14% chose 2015. Notice the slash in the input which makes SAS go to a new line for the last 2 variables. Option B - 60% chose 2015. Option C - 24% chose 2015. See (a) 1

2. Which of the following is TRUE about By group processing? (a) A different analysis can be performed for each BY group. (b) The BY variable must be a numeric or date variable. (c) The data does not have to be grouped together by values of the BY variables (d) The BY groups can have different numbers of variables. (e) BY group processing can be done for any procedure. Solution: (e) Option D - 11% chose 2015. All of the by groups are subsets of the data and so have the same number of variables. Option E - 88% chose 2015. By variables can be any type. 3. Which of the following is correct about a standard error of a statistic. (a) The se measures how much the sample size changes in a simulation study. (b) The se measures the standard deviation of the population slope over bootstraps samples from the data. (c) The se measures the standard deviation of the Gini-estimate of the standard deviation between different populations. (d) The se measures the increase in the number of calories for each additional gram of fat. (e) The se measures now much a statistic will vary when new samples are taken from a population. Solution: (e) Option A - SE never measure variation of sample size. Option B - Population parameters (slopes) are fixed and do not vary. Option C - This doesn t even make sense - there is only one population. Option D - This is the definition of a slope and not a standard error. Option E - 95% chose 2015. 4. Consider the following segment of code: data birthdays; infile datalines; length name $30; input name $ bdate:yymmdd10.; format bdate mmddyy8.; datalines; carl 63/02/01 lois 48/14/02 fred 58/06/03 tim 52/07/04 dave 63/12/31 ;;;; proc print data=birthdays; c 2015 Carl James Schwarz 2

Which of the following is correct? (a) The birth day for Carl will be displayed as 02/01/63. (b) The birth day for Lois will be displayed as 14/02/48. (c) The birth day for Fred will be displayed as 1958-06-03; (d) The birth day for Tim will be displayed as 04/07/1952. (e) The birth day for Dave will be display as a missing value. Solution: (a) Option A - 73% chose 2015. Option D - 14% chose 2015. The mmmddyy out-format only has length 8 so show 2 digit years. 5. Consider the following code data blah; infile datalines; length name sex $10.; input name sex age weight; if age > 30 then delete; drop weight; datalines; A f 27 90 B m 35 120 C F 23 60 D M 24 75 E F. 43 ;;;; Which of the following is correct? (a) The blah dataset has 5 observations and 4 variables. (b) The blah dataset has 4 observations and 4 variables/ (c) The blah dataset has 3 observations and 3 variables; (d) The blah dataset has 4 observation and 3 variables; (e) The blah dataset has 5 observations and 3 variables. Solution: (d) Option B - 15% chose 2015. The Drop statement removes a variable. Option D - 65% chose 2015. Option E - 11% chose 2015. The If statement removes an observation. 6. Which of the following is correct? (a) PROC GLM is used to test hypotheses about population mean proportions. (b) PROC FREQ is used to test hypotheses about sample proportions. c 2015 Carl James Schwarz 3

(c) PROC REG is used to test hypotheses about population slopes. (d) PROC GENMOD is used to test hypotheses about sample proportions. (e) PROC TTEST is used to test hypotheses about paired sample means. Solution: (c) Option A - 14% chose 2015. There is no such thing as a MEAN proportion! Option C - 61% chose 2015. Option D - 10% chose 2015. Hypotheses are ALWAYS about POPULATION parameters, not sample statistics. Option E - 15% chose 2015. Hypotheses are ALWAYS about POPULATION parameters, not sample statistics. 7. Consider the following SAS code: data blah; infile datalines dlm=, input v1 v2 v3 v4 v5 v6; datalines; 1,,2,3,4,5,6,7,8,9 2,3,.,5,6,7,8,9,0 3,4,5,.,6,7,8,9,0,1,2 9,8,7,,6,5,4,3,2,1,0 7,,6,,5,,9,,4,,3,,1,, ;;;; dsd missover; Which of the following is correct? (a) The value of v2 in the first observation is 2. (b) The value of v3 in the second observation is 5. (c) The value of v4 in the third observation is missing. (d) The value of v6 in the fourth observation is 4. (e) The value of v3 in the fifth observation is 5. Solution: (c) Option C - 95% chose in 2015. 8. Consider the following SAS code: data blah; infile datalines; length surname $10 sex $1; input surname sex age; datalines; schwarz m 56 c 2015 Carl James Schwarz 4

schwarz f 53 zhao f 48 zhao m 52 sun m 27 chao f 23 chao m 27 ;;;; proc sort data=blah; by surname; proc transpose data=blah out=transblah; by surname; var age; id sex; Which of the following is correct? (a) The resulting transblah dataset has 3 observations. (b) The value of the variable M for the first observation in the transblah dataset is 56. (c) The observation for surname Sun will have the value of 27 for the the ages of both sexes. (d) The value of the variable F for last observation in the transblah dataset is 23. (e) The 4th observation in the transblah dataset will have 52 as the value for the M variable. Solution: (e) Option A - 16% chose. There are 4 distinct values for the Surname variable so the resulting dataset will have 4 observations. Option B - 28% chose. Don t forget to sort before transposing. Option C - 11% chose. Because Sun does not have a complete set of a variables, the missing variables will be set to missing. Option E - 39% chose. 9. Consider the following SAS code: proc tabulate data=accidents missing; class month Accident_Severity; var fatality; table Accident_Severity ALL, month*fatality*mean*f=7.2; Which of the following is correct? (a) The Accident_Severity variable will be along the top of the table (the columns). (b) The mean number of fatalities in each month and Accident_Severity will be found. (c) Each row of the table will correspond to a different value of the Accident_Severity variable, with the final row a summary over all codes. c 2015 Carl James Schwarz 5

(d) The missing option on the Proc statement ensures that missing values are ignored during the tabulation. (e) If the Accident_Severity variable had 3 levels, and if the month variable had 12 levels, the table would have 36 cells. Solution: (c) Option B - 40% chose 2015. Month is not used in the Table statement. Option C - 35% chose 2015. Option D - 12% chose 2015. The missing option also tabulates the missing values. Option E - 12% chose 2015. The ALL option will generate a row at the end for all codes. 10. Consider the following piece of SAS code: data blah; infile datalines; length name $10 sex $1; input name sex YearOfBirth; Age = 2015 - YearOfBirth; datalines; Carl M 1956 Lois. 1943 Fred.. Matthew M 1926 Marianne F -1 David M 1922 Julia F 2016 ;;;; Which of the following is correct: (a) The computed value of Age for Carl is 59. (b) The computed value of Age for Fred is 0. (c) The computed value of Age for Marianne is missing. (d) The computed value of Age for David is 1922. (e) The computed value of Age for Julia is missing. Solution: (a) Option A - 96% chose 2015. c 2015 Carl James Schwarz 6

Part II - Long Answer Stat-340-2015 Spring Term - Term Test 1 Name Student Number: Put your name and student number on the upper right of each of the following pages as well in case the pages get separated. Answer the following questions in the space provided. Be sure that your answers are legible. The marks given to these questions are 5, 6, 3, 4, and 7 respectively. c 2015 Carl James Schwarz 7

1. Interpretation - 5 Marks: Consider the following output from an analysis of the cereal dataset: Write a SHORT paragraph here summarizing the results. Solution: The relationship between the calories/serving and the grams of fat/serving was investigated using linear regression (Figure 1). The fitted equation is Calories = 95 + 9.8(F at) There was strong evidence that the slope is different from 0 (p <.0001). For every gram of fat, the calories/serving is expected to increase by 9.8 (SE 2.2) calories/gram of fat. Common problems in solutions from students include: Reporting too many decimal places. Seldom do you need to report more than two significant digits. The intercept is usually not of interest and so you don t usually spend anytime discussing it. The whole point of regression is to estimate the slope. So the discussion needs to be about the slope. Many students discussed differences in means (which is not sensible), or differences in the mean among groups which is again not sensible. These students were likely confusing regression with ANOVA. Don t just give the table values as facts add some interpretation to the information in the table. For example, many student had sentences such as The parameter estimate for Fat was 9.8. The c 2015 Carl James Schwarz 8

standard error was 2.21. The t-value was 4.44 and the p-value was <.0001 so we rejected the null hypothesis. These types of sentence provide no useful information to the reader over and above the table. c 2015 Carl James Schwarz 9

2. Reading and Recodes - 6 Marks: The csv file named atus.csv contains the following fields on television viewing from the American Time of Use Study. ID Number Name (up to 30 characters) Sex (single letter code) Age at time of interview. For example 26y3m indicates the subject was 26 years and 3 months old. Number of minutes of television watched. The first few lines of the data file are as follows: ID, name, sex, age, tvmin 123ABCDEF, Schwarz, m, 58y10m, 20 LJD1234LJ, Lank, m, 61y2m, 40 93234LLJJ, Swartz, F, 21y10m, 75 LLKD2343K, Duncan, f, 87y2m, 150 OUEROE, Smith, f, 8y2m, 236 Write SAS code to do the following: Read in the data from the csv file as noted above. Convert the year/month age data to a decimal year, e.g. 26y3m is converted to 26.25 years (3 months is 1/4 of a year). Recode the sex variable. Either f or F is recoded as female; either m or M is recoded as male; other values are recoded as illegal sex. Recode the decimal age to 3 age classes. Ages 16-25 (including 16 but excluding 25) are recoded to 16-24; ages 25-40 (including 25 but excluding 40) are recoded to 25-39; ages 40-70 (including 40 but excluding 70) are recoded to 40-69. Other ages are recoded to out of frame. Check your recodes for both sex and age using appropriate procedures. Put your SAS code here and the page overleaf (if needed) c 2015 Carl James Schwarz 10

One possible solution data atus; infile datalines dlm=, dsd missover firstobs=2; /* Need dsd, dlm and firstobs= length id $10 name $20 sex $1 cage $10; length cagey cagem $10; /* temporary character values */ length newsex $10 ageclass $20; /* recoded values need longer lengths */ input id $ name $ sex $ cage $ minutes; /* convert input age to decimal age */ wherey = index(cage, "y"); /* where is the y */ cagey = substr(cage, 1, wherey-1); /* extract the age in years */ agey = input(cagey, f30.0); /* convert to age in years to number */ wherem = index(cage, "m"); /* where is the m */ cagem = substr(cage, wherey+1, wherem-wherey-1); /* extract the months */ agem = input(cagem, f30.0); /* extract the months */ age = agey + agem/12; /* make decimal age */ /* recode the sex */ sex = upcase(sex); /* convert to upper case */ newsex = illegal ; if sex = F then newsex = female ; if sex = "M" then newsex = male ; /* recode the age classes */ ageclass = out of frame ; if 16 <= age < 25 then ageclass = 16-24 ; if 25 <= age < 40 then ageclass = 25-39 ; if 40 <= age < 70 then ageclass = 40-69 ; datalines; 123ABCDEF, Schwarz, m, 58y10m, 20 LJD1234LJ, Lank, m, 61y2m, 40 93234LLJJ, Swartz, F, 21y10m, 75 LLKD2343K, Duncan, f, 87y2m, 150 OUEROE, Smith, f, 8y2m, 236 ;;;; proc print data=atus; title2 Data after coding ; /* check the recodes */ /* You need to use Proc Tabulate/SGplot and compare the OLD values to the NEW values */ proc tabulate data=atus missing; c 2015 Carl James Schwarz 11

title2 check the recodes ; class sex newsex age ageclass; table sex, newsex *n*f=5.0; /* check sex coding */ table age, ageclass*n*f=5.0; /* possible but very long table */ /* because age is a continuous variable, it is better to use sgplot to check the recodes*/ proc sgplot data=atus; title2 check the recodes for age ; scatter x=ageclass y=age; Comments about student responses: I ve used the Datalines option, but you could replace it with the actual file named people.csv. You could use Proc Import as well to read in the data using proc import file= people.csv out=atus replace; Many students didn t use/forgot to correct for upper/lower case of the gender values. Rather than if gender = f then gender = F ; if gender = m then gender = M ; use the upcase() function directly as shown above. Be careful of code such as data blah; length sex $1; input sex; if sex = f then sex = female ; Because sex is defined with length 1, the new value of female gets truncated to 1 character. So you either have to define sex with a longer length, or define a new variable (as I did above) with a longer length to hold the new values. Be careful of code such as data blah;... if 16 <= age < 25 then age = 16-24 ; Here you are using the age variable as both character and numeric. This won t work. You likely want a separate character variable for the age class as I did in my solution. Some students always thought that the month started in the 4th position. It may not. See the solution above for a completely general solution. Always try and code stuff in the most general fashion possible so that it works in all cases. c 2015 Carl James Schwarz 12

Using Proc Print to check your recodes is not sufficient, as you will only be able to check if the recoding worked for the first few records. You need to use Proc Tabulateand Proc SGplot as shown above and as was done in our assignments. Notice that proc tabulate data=blah; class newsex; table newsex; doesn t provide enough information to see that the values of oldest have been properly recoded to the newest variable. See the solution above. Some student tried code along the lines of data blah; infilel... input... age yyymmm; There is no informat in SASto hand this case and you need to use the methods as shown above. The only useful infomats needed are for dates, times, and datetime values. c 2015 Carl James Schwarz 13

3. Trends in TV watching - 3 Marks: We are now interested in comparing the average TV watched between sexes and among age classes (see previous question), and examining if the trends over age classes are the same for both sexes. Here is some output from such an analysis Source DF Type III SS Mean Square F Value Pr > F sex 1 3893.5 3893.5 253.72 <.0001 ageclass 2 484.8 242.4 15.82 <.0001 sex*ageclass 2 7.0 3.5 0.32 0.7843 (a) Write a (very) short paragraph on your conclusions from the above analysis. WRITE YOUR PARAGRAPH HERE Solution: We performed an analysis of variance (ANOVA) to investigate if the changes in the mean number of minutes of TV viewing across the age classes were similar for the two sexes. There was no evidence that the change in the mean TV watched across the age classes varied between the sexes (p = 0.78)., i.e. there was no evidence that the trends across age classes were not parallel for the two sexes. There was strong evidence that there were difference in the mean amount of TV watched between the sexes and among the age classes (both p <.0001). Comments on student answers: We never say that there was evidence of parallelism, bur rather we say that there was no evidence of non-parallelism. The reason for this is that with a large enough sample size, we can always find evidence that the trends are non-parallel, but the non-parallelism may be miniscule. (b) Give the SAS code that would give the above results. Just the procedure code is needed - no data step is needed. You may assume that the dataset is called atus and contains variables sex, ageclass, and tvwatched for the number of minutes of TV watched by the respondent. Assume that the data were collected from an SRS, so it is NOT necessary to weight the analysis. Put your SAS code here: Solution proc glm data=atus; class sex ageclass; model tvwatched= sex ageclass sex*ageclass; Comments on student answers: Many students used Proc Genmod. This procedure is usually only used for logistic and similar models and not for standard ANOVAs. You need terms for the main effects and the interactions to produce the above table. c 2015 Carl James Schwarz 14

4. Profile Plot - 4 Marks: The output from the procedure to analyze the ATUS included estimates of the marginal means (the LSmeans) along with the upper and lower confidence limits on each each marginal mean. Create a suitable profile plot comparing the changes in mean TV watched across the age classes for the two sexes. Be sure to label the axes properly. You can assume that the analysis procedure created a data set (named mylsmeans) with the following variables. sex ageclass estimate of the marginal mean TV watched (minutes) lcl, the lower confidence bound on the mean ucl, the upper confidence bound on the mean Put your SAS code here: One possible solution proc sgplot data=mylsmeans; title2 profile plot of mean tv watched ; scatter x=ageclass y=estimate / group=sex; series x=ageclass y=estimate / group=sex; highlow x=ageclass lower=lcl upper=ucl / group=sex; xaxis label= Age class ; yaxis label= Mean TV watched (minutes) with 95% confidence interval ; Comments on student solutions: Several students used a Proc Means to try and find some averages. I m guessing that they just copied a solution that looked similar on past exams. Here the dataset is ready to be plotted and no further processing is needed before using Proc SGplot. c 2015 Carl James Schwarz 15

5. More analyses of the ATUS study. - 7 Marks There are two files for the ATUS study. The first dataset (named tvwatch) records TV watching habits and has the following information ID - the ID Number of the family MinTV - Number of minutes of television watched for the selected person from the household. The second dataset (named demoinfo) contains demographic and other information about the respondent s household (including the respondent) with the following information: ID - the ID Number of the family name of household member sex - the sex of the household member coded as f or m. empstatus - the employment status (employed or unemployed, coded as em or un) of the household member at the time of interview So for each subject in the tvwatch dataset, there can be 1 or more observations in the demoinfo dataset. Write SAS code to accomplish the following tasks Processes the demoinfo data to count the number of household members, the number of males, and the number of employed members. Hint: remember how your counted the number of females in the vehicles dataset from the Accidents analysis. Combines the TV time dataset and the data set from the previous step. Removes any records where there are more than 4 people in the household. Computes the mean number of minutes watched for each combination of number of males and the number of employed members and saves the results to a data set. [You can make up an ODS table name if needed]. Put your SAS code here and overleaf (if needed). c 2015 Carl James Schwarz 16

/* create variables for male/female and employment status */ proc sort data=demoinfo; by id; data demoinfo; set demoinfo; ismale = 0; if sex = m then ismale=1; /* code 1 or 0 for number of males */ isemp = 0; if empstatus = em then isemp = 1; proc means data=demoinfo noprint; /* count number of males */ by id; var ismale isemp; output out=sumdemo n=nmembers sum=nmale nemp; /* combine the two datasets */ data both; merge tvwatch demoinfo; by id; if nmembers > 4 then delete; /* remove households with more than 4 members */ /* get the mean tv watched */ proc sort data=both; by nmale hemp; proc means data=both; by nmale nemp; var mintv; output out=meantv mean=mean_tv; /* or you could use proc glm and a lsmeans */ proc glm data=both; class nmale nemp; model mintv = nmale nemp nmale*nemp; lsmeans nmale*nemp; ods output lsmeans=mylsmeans; Comments about student solutions: Many students had difficulty with part 1 of the question. This was the hardest part of the question. You could also try variants of a Proc Tabulate but that is likely to be more difficult to do. Most students had no problems with the merges and deletion step. You could also use Proc Tabulate for the final step, but this is actually more difficult to implement in practise than the given solutions. c 2015 Carl James Schwarz 17

Statistics about the term test: c 2015 Carl James Schwarz 18

There is some evidence that grades on the assignments is related to the grades on the term tests as seen in the pairwise plots below. c 2015 Carl James Schwarz 19

c 2015 Carl James Schwarz 20