1 Solution to Tumor growth in mice Exercise 1 1. Import the data to R Data is in the file tumorvols.csv which can be read with the read.csv2 function. For a succesful import you need to tell R where exactly the file is stored on your computer. One possible way of doing this is to set the path for the working directory. I like doing so because it allows me control also where R saves my working history and output. I works like this (on my computer, you should of course set your own path): setwd("c:/users/nxr382/documents/teaching/basicstatistics/2016/data") tumordat <- read.csv2('tumorvols.csv', header=true) Another possiblity is: # tumordat <- read.csv2(file.choose()) which opens a file browser so that you can click your way to the datafile. In either case, the dataset tumordat should appear on the list in the Work Environment. If you click on the table-icon following the description of the data you can see what is inside the spreadsheet. As appears records have been made for each mouse at each day of follow. We have information on tumor volume, treatment group, and on sacrifice. Note that after a mouse has been sacrificed is recorded as dead and the variable volume has a missing value; called NA " in R. 2. Get a quick overview of the data To find out how large the data is (i.e. how many records it contains) and which variables it contains I use the functions dim and names, and summary : dim(tumordat) ## [1] names(tumordat) ## [1] "day" "mouseid" "volume" "treatment" "dead" ## [6] "sacrificed" From this we see that the dataset contains 442 records and six variables (day, mouseid, volume, treatment, dead, and sacrificed). 1

2 The function summary gives a little bit more information about the contents of the data: summary(tumordat) ## day mouseid volume treatment dead ## Min. : 1 Min. :21.0 Min. : 17.1 chemo:136 Min. : ## 1st Qu.:11 1st Qu.:27.0 1st Qu.: contr:170 1st Qu.: ## Median :20 Median :35.5 Median : radio:136 Median : ## Mean :20 Mean :39.5 Mean : Mean : ## 3rd Qu.:29 3rd Qu.:54.0 3rd Qu.: rd Qu.: ## Max. :39 Max. :60.0 Max. : Max. : ## NA's :226 ## sacrificed ## Min. : ## 1st Qu.: ## Median : ## Mean : ## 3rd Qu.: ## Max. : ## We can see that follow-up lasted for 39 days and mice are labeled with id-numbers from 21 to 60. There are more records from the control group reflecting that this group contained 10 mice while the active treatment groups had only 8 mice each. A little bit of basic calculus reveals that there must be 17 records for each mouse. The variables dead and sacrificed are binary. It is clear to us that 0 and 1 should be interpreted as no and yes, but R take all numbers literally and report summary statistics that are apropriate for numerical variables. The only variable that is recognized as nominal is treatment since this variable has text values. Further note that in the summary the treatment groups are listed in alphabetic order. By default R always use the group that is first in alphabetic order as reference point in figures, tables, and statistical analyses. We can tell R to another group as referece by using the relevel function as follows: tumordat$treatment <- relevel(tumordat$treatment, ref='contr') summary(tumordat['treatment']) ## treatment ## contr:170 ## chemo:136 ## radio:136 Now the control-group is reference and appears first in the summary. 3. Extract the baseline data In what follows we will only be concerned with the baseline data, that is the records from the first day of the trial. We pick out the relevant records with the subset function and save them in a new dataframe called day1: day1 <- subset(tumordat, day==1) # Don't forget the two "="s here! 2

3 We run a quick summary on the new dataset to confirm that it only contains records from the first day. summary(day1['day']) ## day ## Min. :1 ## 1st Qu.:1 ## Median :1 ## Mean :1 ## 3rd Qu.:1 ## Max. :1 4. Tabulate the distribution on treatment groups. We can use the table function to count the number of mice in each treatment group. Just for the exercise we also make a barplot to display the counts. treat.table <- table(day1$treatment) print(treat.table) ## ## contr chemo radio ## barplot(treat.table) contr chemo radio 3

4 5. View the distribution of the tumor volumes. To get an impression of the distribution of the tumor volumes we first make a histogram: hist(day1$volume, probability = TRUE) Histogram of day1$volume Density day1$volume The argument probability=true standardized the area of the histogram to one which makes the histogram comparabel to theoretical probabilty distributions. If omitted R would show the number of observations in each interval on the y-axis instead. It is possible to specify more arguments to the hist function, e.g. if you want other breakpoints for the boxes than Rs defaults. Check out hist in the R help for other options that will change the appearance of your histogram. As summary statistics we compute the mean and the standard deviation. mean(day1$volume) ## [1] sd(day1$volume) ## [1] BUT: Are these apropriate summary statistics? The histogram is obviously skewed to the left. It doesn t look like a normaldistribution. We compute the normalrange (mean +/- 2*sd), recalling that in a normaldistribution 95% of the distribution falls within this range while 2.5% should be below this range and 2.5% above. 4

5 c(mean(day1$volume)-2*sd(day1$volume),mean(day1$volume)+2*sd(day1$volume)) ## [1] According to this a tumor volume of -100 mmˆ3 falls within the normalrange! 6. Do logarithmic transformation of skew data To obtain a normally distributed outcome we can try to tranform the tumorvolumes with the logarithm. I use the natural logarithm ( log in R), but other logarithms as log2 or log10 would serve just as well as they are all proportional. We add the variable logvol to the data using the transform function: day1 <- transform(day1, logvol=log(volume)) Let have a look at the histogram: hist(day1$logvol, probability = TRUE) Histogram of day1$logvol Density day1$logvol The histogram of the log-volumes is not skew as that of the raw data. It doesn t look exactly as a normal curve either but this looks more like a small sample variation than a systematic deviation. 5

6 We compute the normal range: c(mean(day1$logvol)-2*sd(day1$logvol), mean(day1$logvol)+2*sd(day1$logvol)) ## [1] The limits of the estimated normal range are more reasonable (close to the range of the data). 7. Use QQplots to check for normality. To compare data with a normaldistribution We could overlay the histograms with normal curves. However, this it not what I would recommend because there is an arbitrariness in the histogram due to the choice of breakpoints. QQplots (which compare the ordered data points to the corresponding quantiles of the normaldistribution) is a better option. par(mfrow=c(1,2)) # plot the figures side by side qqnorm(scale(day1$volume), xlim=c(-2.2,2.2), ylim=c(-2.2,2.2)) abline(0,1) qqnorm(scale(day1$logvol), xlim=c(-2.2,2.2), ylim=c(-2.2,2.2)) abline(0,1) Normal Q Q Plot Normal Q Q Plot Sample Quantiles Sample Quantiles Theoretical Quantiles Theoretical Quantiles I ve chosen to standardize data because then, if data is normally distributed, the points in the qqplot should be on the straight line with intercept 0 and slope 1. 6

7 We see that the qqplot of the log-transformed data do not deviate systematically from the straight line. The qqplot of the raw data is smiling which is an indication of skewness. 8. Compare the treatment groups in a boxplot. To compare the distribution of the tumor volumes between the three treatment groups we make side by side boxplots. boxplot(day1$volume~day1$treatment) # The tilde "~" means "depending on" in R contr chemo radio The boxes represent the inter quartile range (25% quantile to 75% quantile) with the median shown as the thick line. In the control group whiskers are drawn at the minimum and maximum value. In the active treatment groups the maximum value is marked out as an outlier and the upper whisker is drawn at the second largest tumor volume. This is due to the rule that whiskers cannot exceed 1.5 times the length of the box. If data has a normal distribution poits that exceed this limit are rare. The reason for pointing out outliers is firstly that they might be registration errors which should be corrected and secondly that they may have an unnproportionally high infuence on the results of the statistical analysis so that sensitivity analyses or robust statistics should be considered. Considering the comparison of the treatment groups: The median tumor volume appears to be larger in the control groups. However, this has to be a spurious finding since treatment was randomized and baseline volumes was measured just before treatment was initiated! What we see in the picture is pure random variation. We have three random samples from the exact same populaiton. 7

8 9. Display small samples in a stripchart. The sample sizes of the three groups are all rather small, so we can expect a good deal of random variation in the boxplot. Note that a quarter of the data in the active treatment groups amount to only two observations! An alternative display of tiny datasets is the stripchart which shows the individual data points in each group. In R stripcharts can be made with the stripchart function. Here I have added two optional arguments to the function, vertical=true impies that the strips are displayed vertically with groups on the x-axis, ylab= Tumor volumes (mm3) changes the label on the y-axis. stripchart(day1$volume~day1$treatment, vertical=true, ylab='tumor volume (mm3)') Tumor volume (mm3) contr chemo radio Additional graphical arguments could be supplied if you want a nicer looking figure for presentation. Check out stripchart in R help or par if you want the full list of all of R s graphichal parameters. 8

9 Exercise 2 1. Survivors at end of follow-up We can tabulate the day variable to see on which days of follow up tumor volumes were (last) recorded table(tumordat$day) ## ## ## We see that the last day of follow up was day 39. Lets pick out the data from this day to see how many mice survived throughout the trial: day39 <- subset(tumordat, day==39) table(day39$dead) ## ## 0 1 ## 2 24 Only two mice were survived throughout the trial. We can identify them as: subset(day39, dead==0) ## day mouseid volume treatment dead sacrificed ## radio 0 1 ## chemo 0 1 Mouse number 35 has a tumor volume close to 1000 mm 3 so it would most likely have been sacrificed on the next day had the trial continued. Mouse number 23 on the other hand has a very small tumor. Let s pick out all the records on the mouse and plot its growth curve to see what has happened: m23 <- subset(tumordat, mouseid==23) plot(m23$day, m23$volume, type='b') 9

10 m23$volume m23$day Either this mouse has a very slowly growing tumor or maybe the tumor cells didn t grow in the first place so that all that is measured is the thickness of its skin. 2. Comparison of suvival times between the groups Next we pick out the data from when the mice were sacrificed. This will tell us how long the mice survived in the trial. sacrificed <- subset(tumordat, sacrificed==1) summary(sacrificed) ## day mouseid volume treatment dead ## Min. : 4.0 Min. :21.00 Min. : 52.3 contr:10 Min. :0 ## 1st Qu.: 6.5 1st Qu.: st Qu.: chemo: 8 1st Qu.:0 ## Median :19.0 Median :35.50 Median : radio: 8 Median :0 ## Mean :18.5 Mean :39.50 Mean : Mean :0 ## 3rd Qu.:27.0 3rd Qu.: rd Qu.: rd Qu.:0 ## Max. :39.0 Max. :60.00 Max. : Max. :0 ## sacrificed ## Min. :1 ## 1st Qu.:1 ## Median :1 ## Mean :1 ## 3rd Qu.:1 ## Max. :1 10

11 It seems that mice 35 and 23 were sacrificed due to end of study and not because their tumor volumes exceeded the ethical limit of 1000 mm 3. Let s look at data from all the mice who had smaller tumor volumes at the time of sacrfice. subset(sacrificed, volume<1000) ## day mouseid volume treatment dead sacrificed ## radio 0 1 ## radio 0 1 ## radio 0 1 ## radio 0 1 ## chemo 0 1 ## chemo 0 1 ## chemo 0 1 ## chemo 0 1 ## chemo 0 1 ## contr 0 1 ## contr 0 1 There is quite a few mice that were sacrificed before their tumor reached the critical limit. We learn from the investigator that in practice mice have been sacrificed already when the tumor volume reached as size of approximately 900 mm 3. This was considered morst ethical since mice are only followed up on Mondays, Wednesdays, and Fridays. Lowering the limit to 875 mm 3 leaves us with the following mice: subset(sacrificed, volume<875) ## day mouseid volume treatment dead sacrificed ## radio 0 1 ## chemo 0 1 ## contr 0 1 Besides mouse number 23 that was killed at end of study, we see that mouse number 34 and mouse number 60 were sacrificed before their tumors reached the critical size (both had wounds and were mistriving). We are now confident that we have the correct survival data and that sacrifice is predominately due to progression in tumor growth (we have one censoring at end of study and two deaths due to other causes). Strictly speaking the survival time is the day of sacrifice minus one, so we need to add this variable to the data before we do descriptive statistics. For completeness we also add a censoring variable sacrificed <- transform(sacrificed, time=day-1, cens=(day==39)&(volume>875)) summary(sacrificed[c('day','time','cens')]) ## day time cens ## Min. : 4.0 Min. : 3.0 Mode :logical ## 1st Qu.: 6.5 1st Qu.: 5.5 FALSE:25 ## Median :19.0 Median :18.0 TRUE :1 ## Mean :18.5 Mean :17.5 NA's :0 ## 3rd Qu.:27.0 3rd Qu.:26.0 ## Max. :39.0 Max. :38.0 At last we can compare survival in the three groups. This would usually be done in a Kaplan-Meier plot, but since follow up is the same for all mice and there is only one censoring making a boxplot of the time variable is ok. 11

12 boxplot(sacrificed$time~sacrificed$treatment) contr chemo radio As appears survival is better in the active treatment groups than in the control group. Also the mice who got chemo therapy has survived somewhat longer than those who got radio therapy. Whether these are significant findings we cannot say. We would have to conduct a formal statitical analysis. 12

More information