Homework Assignment #5-5, Data Mining SOLUTIONS. (a) Create a plot showing the location of each state, with longitude on the horizontal axis, latitude on the vertical axis, and the states names or abbreviations in the appropriate positions. Include your code. Answer: This is basically the same as plotting tumor-types against principal components from HW 4. See Figure. (b) Using the factanal command from R with the scores="regression" option, do a one-factor analysis of state.x. Include the command you used and R s output. Answer: > state.fa <- factanal(state.x,factors=,scores="regression") > state.fa Call: factanal(x = state.x, factors =, scores = "regression") Uniquenesses: Population Income Illiteracy Life Exp Murder HS Grad.5..5.4..4 Frost Area.. Loadings: Factor Population -. Income.45 Illiteracy -.5 Life Exp.5 Murder -. HS Grad. Frost. Area Factor SS loadings.
latitude 5 4 45 5 AK HI WA MT OR ID WY NV UT CO CA AZ NM ND MN ME SD WI VT MI NY NH MA IA NE CTRI PA IL IN OH NJ MD KS MO WV DE KY VA OK TN AR NC SC MS AL GA TX LA FL - - - - - - longitude plot(state.center,type="n") text(state.center,state.abb) Figure : The states in their locations.
Proportion Var. Test of the hypothesis that factor is sufficient. The chi square statistic is. on degrees of freedom. The p-value is.4e- (c) Describe the factor you obtained in the previous part in terms of the observable features. Answer: The factor has strong positive loadings on to high school graduation, frost and life-expectancy, and big negative loadings on to illiteracy and homicide rates. So high-factor states tend to be well-educated, long-lived and peaceful, while life in low-factor states tends to be nasty, brutish and short. There is a weaker positive relationship between the factor and income, and much weaker ones to area and population. (No factor loading is printed for area because it s so small.) (d) Plot the states by location, with the labels of the states being a linearly increasing function of their factor scores. You should control the minimum and maximum size of the labels. (Remember that many of the factor scores will be negative.) Include your code, and comment on the map it produces. Hint: The cex option to functions like text can be a vector. Alternately, use the scatterplotd command, from the package of that name, to make a three-dimensional plot, with the z axis being the factor score. If you do this, make sure to orient the plot so it is legible, and the states are clearly distinguished. Answer: The scaled-size plot is definitely easier to accomplish. The basic idea is plot(state.center,type="n",xlab="longitude",ylab="latitude") text(state.center,state.abb,cex=state.fa$score[,]) This however will not work very well, since some of the factor scores are negative (Figure ). However, it s easy enough to fix, as in Code Example. The result (after a little tweaking of the minimum and maximum sizes to keep things legible) is Figure. To check that the linear rescaling is working, I also (Figure 4) plot the original factor scores against the rescaled factor scores. The three-dimensional plot is a bit trickier not in principle, but just The text function adds textual labels to an existing plot, but it only knows about two-dimensional coordinate systems in fact your screen only knows about D coordinates! You could imagine that the D plotting library would have a command like textd, but the problem is that it wouldn t know how to translate from threedimensional coordinates to the two-dimensional graphics window. To get around this, the object returned by the scatterplotd function
HI OK PA MD latitude 5 4 45 5 AK WA OR CA NV ID UT MT WY CO AZ NM ND MN SD WI MI IA NE IN IL KS MO KY TN AR SC MS AL GA TX LA OH WV VA NC ME VT NY NH MA CTRI NJ DE FL - - - - - - longitude Figure : First attempt at producing a map with sizes proportional to factor scores. Doesn t work very well. 4
latitude 5 4 45 5 AK HI WA MT OR ID WY NV UT CO CA NM AZ ND SD NE KS OK MN ME WI VT MI NY NH MA CTRI PA IL IN OH NJ MD MO WV DE VA IA KY TN NC AR SC MS AL GA TX LA FL - - - - - - longitude rescaled.scores = plot.states_scaled(state.fa$score[,],min.size=., max.size=.5,xlab="longitude",ylab="latitude") Figure : States with labels proportional to factor scores. 5
rescaled for plotting.4.....4 GA AL SC MS NM KY TN AR TX NC WV AZ FL NY VA NV INRI DE OH NJ MD OK PA HI CA MI IL MO AK MN IA NE ND SD UT WI KS CO NH WA OR CT ID ME MT VT WY MA LA -. -.5 -. -.5..5. raw factor score plot(state.fa$score[,],rescaled.scores,type="n",xlab="raw factor score", ylab="rescaled for plotting") text(state.fa$score[,],rescaled.scores,state.abb) Figure 4: Check that the linear rescaling in Code Example and Figure is working properly: the raw and the rescaled scores should fall on a straight line with positive slope, and they do.
# Plot the state abbrevations in position, with scaled sizes # Linearly scale the sizes from the given minimum to the maximum # Inputs: vector of raw numbers, minimum size for plot, # maximum size # Outputs: Rescaled sizes (invisible) plot.states_scaled <- function(sizes,min.size=.4,max.size=,...) { out.range = max.size - min.size in.range = max(sizes)-min(sizes) scaled.sizes = out.range*((sizes-min(sizes))/in.range) sizes = scaled.sizes + min.size plot(state.center,type="n",...) text(state.center,state.abb,cex=sizes) invisible(sizes) } Code Example : Plot the states abbreviations in position, with controllable sizes, scaled linearly from the minimum to the maximum. Returns the rescaled sizes, in order, invisibly (for testing/debugging). actually does this translation, since one of its attributes is a function, xyz.convert. (See the help files for that function, and the examples it gives, including ones of using text.) See Code Example and Figure 5. (e) Part of the output of the factanal command is the p-value of the likelihood ratio test for comparing the fitted factor model to the unrestricted multivariate Gaussian. Plot this p-value against q, the number of factors. Include your code. Answer: The p-value is stored in the $PVAL attribute of the returned object. R won t let us fit more than four factors to features, so this is all we can do: > pvalues = sapply(:4,function(q){factanal(state.x,factors=q)$pval}) > signif(pvalues,) objective objective objective objective.e-.e-5 4.e- 4.e- (Exercise: what s going on inside sapply here?) Figure shows the plot. (f) Is it plausible that there is really only one factor? Explain, and justify your answer in terms of R s output, not your general knowledge of US geography. Answer: It s astoundingly implausible. The p-value is., which is as close to zero as you could hope to see.. (a) Do a PCA of zip.train, being sure to omit the first column. What command do you use? Why should you omit the first column?
factor score -.5 -. -.5 -. -.5..5..5 NDMN IA ID MT NE SD UT WA WI OR NH KS VT CO CT ME WY MA RI NV IN AK OH NJ DE PA MIMD OK MO IL CA HI NY WV VA AZ NM FL KY TN AR NC TX GA AL SC MS LA 5 5 - - - - - - - - 4 45 5 latitude longitude Figure 5: Output of Code Example. Further tweaking with the plotting options there could have changed the perspective, which I would recommend for a serious presentation e.g., right now Utah partially occludes Washington state, and Vermont New Hampshire but that would be overkill for this assignment.
pvalue e- e- e- e-5 e-..5..5..5 4. q (number of factors) plot(:4,pvalues,xlab="q (number of factors)", ylab="pvalue", log="y",ylim=c(e-,.4)) abline(h=.5,lty=) Figure : Plot of p-values versus the number of factors for the state.x data. The y axis is on a logarithmic scale, to accomodate the wide range of p-values. The horizontal line near the top shows the 5% significance level which is the conventional limit for publishability; the actual q = value is 44.%, just below the limit.
require(scatterplotd) state.xyz <- cbind(state.center$x,state.center$y, state.fa$scores[,]) colnames(state.xyz)=c("x","y","z") state.d <- scatterplotd(state.xyz,type="h", xlab="longitude", ylab="latitude", zlab="factor score", cex.symbol=.,color="grey") text(state.d$xyz.convert(state.xyz),state.abb) Code Example : Code to make the states factor scores the z axis. The type="h" option draws lines (here set to grey) to connect states to the x y (longitude-latitude) plane, for visual clarity. (The lines have to end in plotting symbols, but those are made invisibly small by the cex.symbol=. option.) The output of scatterplotd is an object which contains several functions to be used in further decorating the graph. One of them, xyz.convert, changes three-dimensional coordinates for the plot into two-dimensional coordinates for the graphics device; we call that inside the text function that adds the states labels. Answer: We omit the first column because it s really a discrete label (,,... ), and not a numerical feature at all! Actually doing the PCA: require(elemstatlearn) data(zip.train) zip.pca = prcomp(zip.train[,-]) Running the last command inside system.time, it takes me.5 seconds to do PCA on the complete data. (b) Make plots of the projections of the data on to the first two and three principal components. (For the D plot, use the function scatterplotd from that package.) Include the commands you used as well as the plots. On both plots, indicate which points come from which digits, and make sure that this is legible in what you turn in. (E.g., if you use colors, make sure they look distinct on your printout. You might try pch=as.character(zip.train[,]).) Comment on the results. Answer: See Figures and for the plots and the commands used to make them. Mostly, these big impenetrable blurs. There seems to be a compact cluster of s with low scores on both of the first two components, and a diffuse fan of zeroes with high scores (and a much broader range than the s). 4s tend to go near the s. In the D plot, there is a cluster of s, slightly to the right of the 4s. Mostly, Some of these observations are clearer with color as well as symbol; add the option
however, there s a big, big mess in the middle, where lots of different digits are all intermingled. (c) Use the code from lecture to do an LLE with q =. Include the commands you used. Answer: At this point, running the procedures on the full data becomes impractical. So I ll take the first 5 rows of the data frame and just use them. > source("~/teaching/5/lectures/4/lecture-4.r") > zip.small <- zip.train[:5,] > system.time(zip.lle <- lle(zip.small[,-],,)) user system elapsed.44.5. > dim(zip.lle) [] 5 Notes: You don t need to include the source command, which in any case should point to where the file is on your system, not mine! Running lle inside system.time does the assignment for us, but also gives us the amount of time taken to execute it. Notice that lle with just 5 data points takes half as much time as PCA on the full data; lle on the full data is very slow indeed. Also, the last line isn t necessary, but it does check that we re getting the right sort of output (a matrix of three-dimensional coordinates). (d) Make D and D plots of the data, as before, but with the LLE coordinates. Comment. Answer: See Figures and. The two-dimensional plot is hard to interpret (at least for me). The three-dimensional plot shows the points falling on a shape like a saddle, or a sail. At one point (negative on all three coordinates) we have the s; moving right and up along this edge these turn into s, 4s, and especially into s all shapes with a single vertical line on the right. At the upper left there are zeroes, which change into s and then 5s as we move right but stay up (preserving a rounded stroke at the base of the numeral), but turn into s, s and 5s as we move down (preserving a rounded stroke in the top part of the numeral). s sit towards the middle of the figure, around the intersection of round on top and round on the bottom. (e) Run k-means with k = on (i) the raw data, (ii) the D PCA projections and (iii) the D LLE. Calculate the variation-of-information color=(zip.train[,]+) the d plotting command. (The + is because color is the background color.) At least with my very-far-from-optimized R implementation! People do use these procedures on large real-world data, but with better coding. system.time is one of the few R commands which works through side-effects here, evaluating its argument, and modifying the workspace if need be.
jpeg(file="zip-pca-d.jpg") plot(zip.pca$x[,:],pch=as.character(zip.train[,]),cex=.5) dev.off() Figure : Plotting the first two principal components of the zip.train data. The first line of the code tells R to redirect graphics commands to a jpeg file; the third line turns off the re-direction. Because there are many thousands of data points, each fractionally different from each other, the PDF file R would normally produce would be Mb, while the jpeg (with the same visible detail) is only 4 kb.
jpeg(file="zip-pca-d.jpg") scatterplotd(zip.pca$x[,:],pch=as.character(zip.train[,]),cex.symbol=.5) dev.off() Figure : Plotting the first three principal components of the zip.train data. The first and third lines write the image to a jpeg file, to keep file sizes under control; see Fig..
..5. -.5 zip.lle[, :][,] 4 4 4 4 4 54 5 4 4 4 4 444 5 4444 5 4 444 55 555 5 55 4 4 5 4 44 5 55 5 4 55 5 5 4 5 5 5 4 5 5 4 4 4 4 4 -. -.5..5. zip.lle[, :][,] plot(zip.lle[,:],pch=as.character(zip.small[,])) Figure : LLE coordinates in two dimensions. 4
. zip.lle[,]..5.5 -.5 zip.lle[,].5. 4 55 5 5 5 5 555 5 4 5 5 5 5 55 5 5 5 55 5 4 4 5 5 444 5 4 4 4 454 444 44 5 4 4 4 44 44 4 4 4 4 4 4 4 4 4 4 4. -. -.5 -. -. -.5 -. -.5..5..5. zip.lle[,] scatterplotd(zip.lle,pch=as.character(zip.small[,]),cex.symbol=.5) Figure : LLE coordinates in dimensions. 5
distance of all three clusterings from the true classes (as given by the first column of zip.train). Comment. Answer: We need the variation-of-information code from homework 4, which in turn needs functions from lecture 5. > source("~/teaching/5/lectures/5/lecture-5.r") > source("~/teaching/5/hw/4/solutions-4.r") > raw.cluster = kmeans(zip.train[,-],centers=)$cluster > pca.cluster = kmeans(zip.pca$x[,:],centers=)$cluster > lle.cluster = kmeans(zip.lle,centers=)$cluster > variation.of.info(raw.cluster,zip.train[,]) [].55 > variation.of.info(pca.cluster,zip.train[,]) [].55 > variation.of.info(lle.cluster,zip.small[,]) []. Even though the LLE coordinates are only three dimensional, clustering using them is almost as accurate as clustering using the complete data (with 5 dimensions); clustering using the first principal components is not so accurate. This suggests that the LLE does a better job than PCA of retaining information in the features which are relevant to the classes as we might have guessed from the figures. Of course, whether any given classifier method can actually use that information is a different question.. (Extra credit) Download the diffusionmap package from CRAN. Prepare a D scatterplot of the data, as in problem, using diffuse. Repeat the clustering from the end of problem with the diffusionkmeans function, and calculate the distance of this clustering from the true classes. Comment on these results. Answer: > require(diffusionmap) > system.time(zip.diff <- diffuse(dist(zip.small),maxdim=)) [] "Performing eigendecomposition" [] "Computing Diffusion Coordinates" [] "Used default value: dimensions" user system elapsed...55 > scatterplotd(zip.diff$x,pch=as.character(zip.small[,]),cex.symbol=.5) The diffuse function computes the actual diffusion map; it takes as its argument not the data set, but a distance matrix made from the data set. (This lets it work with arbitrary distance functions, including ones for qualitative data.) It will compute a default number of dimensions to use, but here I insist on. There s a default plotting method for diffusion map
objects (plot.dmap), but it doesn t allow for much control of the results, so I called scatterplotd directly. The shape of the data here is similar to that of the LLE, but the s form a more compact triangular cluster, and the zeros a broader flap or tongue going off from the main cluster. Parts of this look sensible the close proximity of 4s and s, for example, and of s and s with each other and with s. The central core is however a bit messy. > diff.cluster = diffusionkmeans(zip.diff,k=)$part > variation.of.info(diff.cluster,zip.small[,]) []. This is a little bit worse than the LLE, as we suspected from the figure. However, the difference is quite small, and re-running k-means, it sometimes comes out the other way.