Biology 6317 Project 1 Data and illustrations courtesy of Professor Tony Frankino, Department of Biology/Biochemistry 1. Background The data set www.math.uh.edu/~charles/wing_xy.dat has measurements related to the wing shape of three fruit fly species, Drosophila Mauritiana, D. Sechellia, and D. Simulans. The data set consists of 30 wing measurements made on each of 138 flies, 42 of Mauritiana, 48 of Sechellia and 48 of Simulans. The measurements are two-dimensional coordinates of 15 landmarks defined by intersections of veins with each other or with the wing margin. See the figures below.
The purpose of this exercise is to determine how well these measurements can distinguish among the three species. You are going to construct a classification tree for assigning one of the species categories to each set of values of the numeric measurements. The results of these classifications will be compared to the actual species labels to estimate a classification error rate. A small error rate will show that the three species are easily distinguishable from the wing shape measurements. A binary classification tree begins with all the cases (flies) gathered into one group (the root node of the tree). Then that group is split into two subgroups (daughter nodes) based on a threshold value of one of the measurement variables. The variable and its threshold value are chosen to maximize the decrease in total deviance. The deviance at a node of the tree which has N flies in species proportions pp 1, pp 2, pp 3 is 3 NN pp ii log pp ii ii=1
It can be shown that in splitting a node, the sum of the daughter deviances is less than the parent deviance. Daughter nodes are split in turn until they reach a minimum size or the reduction in deviance falls below a specified level. At the end of the construction, each terminal node one that is not split - is labeled with the species that is most numerous at that node. Classification and regression trees were first developed by Breiman, Olshen and Stone [1]. A good textbook treatment is by Venables and Ripley [2]. The figure below shows a classification tree made for another application. The criterion for each split is printed at the parent node. If the criterion is satisfied, the left-hand branch of the tree is followed. Otherwise, the right-hand branch is followed. reg.irr=reg meas>=0.04783 stria.yn=n prim.sec=y meas< 0.03103 serr meas>=0.04071 micro meas< 0.05467meas>=0.02855 sm serr meas< 0.0276 micro serr micro sm serr sm
2. Constructing the Tree with R Begin by importing the data into R with the read.table function or with Rstudio. Name your data frame anything you like. I used flies as the name. Since the data file does not have a header row, R will assign meaningless names to the variables. The 30 variables are the wing measurements and you can use the names R assigns to them. The next thing to do is to add a column to the data set containing the species of each of the 138 flies. > species=c(rep( maur,42), rep( sech,48), rep( simul,48)) > flies=cbind(flies,species) > summary(flies) To construct the tree, you will need the tree library of R. Load it by > library(tree) The function that creates the tree is also called tree. Read the help file about it. > help(tree) Click on the index link at the bottom of the help page and read about some of the other functions in the library, particularly plot.tree, text.tree, and predict.tree. You are going to divide the data into a training set and a test set. Your classification tree will be built with the training data and tested for accuracy on the test data. Randomly pick about 2/3 of the cases in flies for the training data. > train=sample(138,92,replace=f) #1 The training data will be in the data frame flies[train, ] and the test data is flies[-train, ]. The tree is built with the command > flies.tree=tree(species~., data=flies[train, ]) #2
The formula species~. tells R that the classes to be separated are the levels of the variable species and all the other variables are inputs to the classification procedure. Print the results with > flies.tree #3 > plot(flies.tree,uniform=t) #4 > text(flies.tree) #5 Finally, classify the test data and count the number of classification errors. > flies.pred=predict(flies.tree, flies[-train, ], type= class ) #6 > sum(flies.pred!= flies[-train, species ]) #7 This number divided by 46 is your estimate of the classification error rate. Repeat steps #1 through #7 several times and observe the results. In particular, note how many terminal nodes the trees have and which variables are involved in the splits. Turn in the results of #3, #5, and #7 for one of these simulations and comment on your observations. 3. Dimension Reduction with Principal Components The numeric variables in this data set are highly correlated. Furthermore, the number of variables is a substantial fraction of the number of cases. This suggests that using fewer variables for classification might be feasible and might result in a more robust classification procedure. We can do this by expressing the 30 dimensional data vector for each fly as a linear combination of the 30 orthogonal unit eigenvectors of the variance covariance matrix. These linear combinations are uncorrelated. Each eigenvalue is the variance of the data in the direction of the associated eigenvector. Thus, we may be able to take only the first few components of the data relative to the principal eigenvectors and capture most of the variation. Create a principal components object as follows.
> newflies=prcomp(flies[,1:30]) This finds the eigenvalues and eigenvectors of the variance-covariance matrix and the components of each fly relative to the basis of eigenvectors. The components are stored in a 138x30 matrix newflies$x. Have a look at the standard deviations (the square roots of the eigenvalues) in the eigendirections by > newflies$sdev and decide how many components you want to use for a classification tree. Suppose for the sake of argument that you want 6. Make a data frame with a column for species by > newflies=data.frame(newflies$x[, 1:6], species) Now construct the tree with > newflies.tree=tree(species~., data=newflies[train, ]) Repeat the steps you went through with the raw data above. Compare the results. Turn in your work. References [1] Breiman L, Friedman JH, Olshen RA Stone CJ (1984) Classification and Regression Trees. Wadsworth. [2] Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. Springer, 2002.