Homework Assignment #5

Similar documents
A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis

Telecommunications and Internet Access By Schools & School Districts

DEPARTMENT OF HOUSING AND URBAN DEVELOPMENT. [Docket No. FR-6090-N-01]

Fall 2007, Final Exam, Data Structures and Algorithms

MAKING MONEY FROM YOUR UN-USED CALLS. Connecting People Already on the Phone with Political Polls and Research Surveys. Scott Richards CEO

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

2018 NSP Student Leader Contact Form

Panelists. Patrick Michael. Darryl M. Bloodworth. Michael J. Zylstra. James C. Green

Ocean Express Procedure: Quote and Bind Renewal Cargo

CostQuest Associates, Inc.

Figure 1 Map of US Coast Guard Districts... 2 Figure 2 CGD Zip File Size... 3 Figure 3 NOAA Zip File Size By State...

The Lincoln National Life Insurance Company Universal Life Portfolio

Presented on July 24, 2018

B.2 Measures of Central Tendency and Dispersion

DSC 201: Data Analysis & Visualization

Accommodating Broadband Infrastructure on Highway Rights-of-Way. Broadband Technology Opportunities Program (BTOP)

Distracted Driving- A Review of Relevant Research and Latest Findings

Department of Business and Information Technology College of Applied Science and Technology The University of Akron

NSA s Centers of Academic Excellence in Cyber Security

2018 Supply Cheat Sheet MA/PDP/MAPD

Post Graduation Survey Results 2015 College of Engineering Information Networking Institute INFORMATION NETWORKING Master of Science

IT Modernization in State Government Drivers, Challenges and Successes. Bo Reese State Chief Information Officer, Oklahoma NASCIO President

MERGING DATAFRAMES WITH PANDAS. Appending & concatenating Series

Team Members. When viewing this job aid electronically, click within the Contents to advance to desired page. Introduction... 2

State IT in Tough Times: Strategies and Trends for Cost Control and Efficiency

Geographic Accuracy of Cell Phone RDD Sample Selected by Area Code versus Wire Center

Silicosis Prevalence Among Medicare Beneficiaries,

Touch Input. CSE 510 Christian Holz Microsoft Research February 11, 2016

CSE 781 Data Base Management Systems, Summer 09 ORACLE PROJECT

Tina Ladabouche. GenCyber Program Manager

Charter EZPort User Guide

Global Forum 2007 Venice

CIS 467/602-01: Data Visualization

Name: Business Name: Business Address: Street Address. Business Address: City ST Zip Code. Home Address: Street Address

The Outlook for U.S. Manufacturing

Moonv6 Update NANOG 34

DTFH61-13-C Addressing Challenges for Automation in Highway Construction

2015 DISTRACTED DRIVING ENFORCEMENT APRIL 10-15, 2015

Prizm. manufactured by. White s Electronics, Inc Pleasant Valley Road Sweet Home, OR USA. Visit our site on the World Wide Web

Sideseadmed (IRT0040) loeng 4/2012. Avo

AASHTO s National Transportation Product Evaluation Program

Best Practices in Rapid Deployment of PI Infrastructure and Integration with OEM Supplied SCADA Systems

PulseNet Updates: Transitioning to WGS for Reference Testing and Surveillance

ACCESS PROCESS FOR CENTRAL OFFICE ACCESS

Presentation Outline. Effective Survey Sampling of Rare Subgroups Probability-Based Sampling Using Split-Frames with Listed Households

A Capabilities Presentation

Contact Center Compliance Webinar Bringing you the ANSWERS you need about compliance in your call center.

Amy Schick NHTSA, Occupant Protection Division April 7, 2011

State HIE Strategic and Operational Plan Emerging Models. February 16, 2011

Today s Lecture. Factors & Sampling. Quick Review of Last Week s Computational Concepts. Numbers we Understand. 1. A little bit about Factors

Data Visualization (CIS/DSC 468)

What Did You Learn? Key Terms. Key Concepts. 68 Chapter P Prerequisites

How to Make an Impressive Map of the United States with SAS/Graph for Beginners Sharon Avrunin-Becker, Westat, Rockville, MD

CMPE 180A Data Structures and Algorithms in C++ Spring 2018

Unit 7 Day 5 Graph Theory. Section 5.1 and 5.2

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Expanding Transmission Capacity: Options and Implications. What role does renewable energy play in driving transmission expansion?

Strengthening connections today, while building for tomorrow. Wireless broadband, small cells and 5G

Presentation to NANC. January 22, 2003

Selling Compellent Hardware: Controllers, Drives, Switches and HBAs Chad Thibodeau

Jurisdictional Guidelines for Accepting a UCC Record Presented for Filing 2010 Amendments & the 2011 IACA Forms

On All Forms. Financing Statement (Form UCC1) Statutory, MARS or Other Regulatory Authority to Deviate

STATE DATA BREACH NOTIFICATION LAWS OVERVIEW OF REQUIREMENTS FOR RESPONDING TO A DATA BREACH UPDATED JUNE 2017

Your Name: Section: 2. To develop an understanding of the standard deviation as a measure of spread.

Real World Algorithms: A Beginners Guide Errata to the First Printing

2013 Product Catalog. Quality, affordable tax preparation solutions for professionals Preparer s 1040 Bundle... $579

Configuring Oracle GoldenGate OGG 11gR2 local integrated capture and using OGG for mapping and transformations

Medium voltage Marketing contacts

Unit 9, Lesson 1: What Influences Temperature?

Steve Stark Sales Executive Newcastle

OPERATOR CERTIFICATION

Variable selection is intended to select the best subset of predictors. But why bother?

Panasonic Certification Training January 21-25, 2019

Frequency Distributions


Topic. Section 4.1 (3, 4)

Development and Maintenance of the Electronic Reference Library

National Continuity Programs

Pro look sports football guide

MOVR. Mobile Overview Report January March The first step in a great mobile experience

Administrivia. Next Monday is Thanksgiving holiday. Tuesday and Wednesday the lab will be open for make-up labs. Lecture as usual on Thursday.

ARE WE HEADED INTO A RECESSION?

Elevation Data Acquisition Update

WHAT S NEW IN CHECKPOINT

ASR Contact and Escalation Lists

Azure App Service. Jorge D. Wong Technical Account Manager Microsoft Azure. Ingram Micro Inc.

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Free or Reduced Air from Select Gateways for 1st & 2nd guest on reservation

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

The Altice Business. Carrier Advantage

Fundamentals Drive the Market. Population Employment Money

Recovery Auditor for CMS

An introduction to plotting data

Data Visualization (DSC 530/CIS )

The evolution of US state government home pages from 1997 to 2002

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

1/29/2009. Disparity arises at every decision point in the system. Pamela Oliver Department of Sociology University of Wisconsin - Madison

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Welcome to the Comcast Business SmartOffice Video Monitoring Solution

Data Frames and Control September 2014

Transcription:

Homework Assignment #5-5, Data Mining SOLUTIONS. (a) Create a plot showing the location of each state, with longitude on the horizontal axis, latitude on the vertical axis, and the states names or abbreviations in the appropriate positions. Include your code. Answer: This is basically the same as plotting tumor-types against principal components from HW 4. See Figure. (b) Using the factanal command from R with the scores="regression" option, do a one-factor analysis of state.x. Include the command you used and R s output. Answer: > state.fa <- factanal(state.x,factors=,scores="regression") > state.fa Call: factanal(x = state.x, factors =, scores = "regression") Uniquenesses: Population Income Illiteracy Life Exp Murder HS Grad.5..5.4..4 Frost Area.. Loadings: Factor Population -. Income.45 Illiteracy -.5 Life Exp.5 Murder -. HS Grad. Frost. Area Factor SS loadings.

latitude 5 4 45 5 AK HI WA MT OR ID WY NV UT CO CA AZ NM ND MN ME SD WI VT MI NY NH MA IA NE CTRI PA IL IN OH NJ MD KS MO WV DE KY VA OK TN AR NC SC MS AL GA TX LA FL - - - - - - longitude plot(state.center,type="n") text(state.center,state.abb) Figure : The states in their locations.

Proportion Var. Test of the hypothesis that factor is sufficient. The chi square statistic is. on degrees of freedom. The p-value is.4e- (c) Describe the factor you obtained in the previous part in terms of the observable features. Answer: The factor has strong positive loadings on to high school graduation, frost and life-expectancy, and big negative loadings on to illiteracy and homicide rates. So high-factor states tend to be well-educated, long-lived and peaceful, while life in low-factor states tends to be nasty, brutish and short. There is a weaker positive relationship between the factor and income, and much weaker ones to area and population. (No factor loading is printed for area because it s so small.) (d) Plot the states by location, with the labels of the states being a linearly increasing function of their factor scores. You should control the minimum and maximum size of the labels. (Remember that many of the factor scores will be negative.) Include your code, and comment on the map it produces. Hint: The cex option to functions like text can be a vector. Alternately, use the scatterplotd command, from the package of that name, to make a three-dimensional plot, with the z axis being the factor score. If you do this, make sure to orient the plot so it is legible, and the states are clearly distinguished. Answer: The scaled-size plot is definitely easier to accomplish. The basic idea is plot(state.center,type="n",xlab="longitude",ylab="latitude") text(state.center,state.abb,cex=state.fa$score[,]) This however will not work very well, since some of the factor scores are negative (Figure ). However, it s easy enough to fix, as in Code Example. The result (after a little tweaking of the minimum and maximum sizes to keep things legible) is Figure. To check that the linear rescaling is working, I also (Figure 4) plot the original factor scores against the rescaled factor scores. The three-dimensional plot is a bit trickier not in principle, but just The text function adds textual labels to an existing plot, but it only knows about two-dimensional coordinate systems in fact your screen only knows about D coordinates! You could imagine that the D plotting library would have a command like textd, but the problem is that it wouldn t know how to translate from threedimensional coordinates to the two-dimensional graphics window. To get around this, the object returned by the scatterplotd function

HI OK PA MD latitude 5 4 45 5 AK WA OR CA NV ID UT MT WY CO AZ NM ND MN SD WI MI IA NE IN IL KS MO KY TN AR SC MS AL GA TX LA OH WV VA NC ME VT NY NH MA CTRI NJ DE FL - - - - - - longitude Figure : First attempt at producing a map with sizes proportional to factor scores. Doesn t work very well. 4

latitude 5 4 45 5 AK HI WA MT OR ID WY NV UT CO CA NM AZ ND SD NE KS OK MN ME WI VT MI NY NH MA CTRI PA IL IN OH NJ MD MO WV DE VA IA KY TN NC AR SC MS AL GA TX LA FL - - - - - - longitude rescaled.scores = plot.states_scaled(state.fa$score[,],min.size=., max.size=.5,xlab="longitude",ylab="latitude") Figure : States with labels proportional to factor scores. 5

rescaled for plotting.4.....4 GA AL SC MS NM KY TN AR TX NC WV AZ FL NY VA NV INRI DE OH NJ MD OK PA HI CA MI IL MO AK MN IA NE ND SD UT WI KS CO NH WA OR CT ID ME MT VT WY MA LA -. -.5 -. -.5..5. raw factor score plot(state.fa$score[,],rescaled.scores,type="n",xlab="raw factor score", ylab="rescaled for plotting") text(state.fa$score[,],rescaled.scores,state.abb) Figure 4: Check that the linear rescaling in Code Example and Figure is working properly: the raw and the rescaled scores should fall on a straight line with positive slope, and they do.

# Plot the state abbrevations in position, with scaled sizes # Linearly scale the sizes from the given minimum to the maximum # Inputs: vector of raw numbers, minimum size for plot, # maximum size # Outputs: Rescaled sizes (invisible) plot.states_scaled <- function(sizes,min.size=.4,max.size=,...) { out.range = max.size - min.size in.range = max(sizes)-min(sizes) scaled.sizes = out.range*((sizes-min(sizes))/in.range) sizes = scaled.sizes + min.size plot(state.center,type="n",...) text(state.center,state.abb,cex=sizes) invisible(sizes) } Code Example : Plot the states abbreviations in position, with controllable sizes, scaled linearly from the minimum to the maximum. Returns the rescaled sizes, in order, invisibly (for testing/debugging). actually does this translation, since one of its attributes is a function, xyz.convert. (See the help files for that function, and the examples it gives, including ones of using text.) See Code Example and Figure 5. (e) Part of the output of the factanal command is the p-value of the likelihood ratio test for comparing the fitted factor model to the unrestricted multivariate Gaussian. Plot this p-value against q, the number of factors. Include your code. Answer: The p-value is stored in the $PVAL attribute of the returned object. R won t let us fit more than four factors to features, so this is all we can do: > pvalues = sapply(:4,function(q){factanal(state.x,factors=q)$pval}) > signif(pvalues,) objective objective objective objective.e-.e-5 4.e- 4.e- (Exercise: what s going on inside sapply here?) Figure shows the plot. (f) Is it plausible that there is really only one factor? Explain, and justify your answer in terms of R s output, not your general knowledge of US geography. Answer: It s astoundingly implausible. The p-value is., which is as close to zero as you could hope to see.. (a) Do a PCA of zip.train, being sure to omit the first column. What command do you use? Why should you omit the first column?

factor score -.5 -. -.5 -. -.5..5..5 NDMN IA ID MT NE SD UT WA WI OR NH KS VT CO CT ME WY MA RI NV IN AK OH NJ DE PA MIMD OK MO IL CA HI NY WV VA AZ NM FL KY TN AR NC TX GA AL SC MS LA 5 5 - - - - - - - - 4 45 5 latitude longitude Figure 5: Output of Code Example. Further tweaking with the plotting options there could have changed the perspective, which I would recommend for a serious presentation e.g., right now Utah partially occludes Washington state, and Vermont New Hampshire but that would be overkill for this assignment.

pvalue e- e- e- e-5 e-..5..5..5 4. q (number of factors) plot(:4,pvalues,xlab="q (number of factors)", ylab="pvalue", log="y",ylim=c(e-,.4)) abline(h=.5,lty=) Figure : Plot of p-values versus the number of factors for the state.x data. The y axis is on a logarithmic scale, to accomodate the wide range of p-values. The horizontal line near the top shows the 5% significance level which is the conventional limit for publishability; the actual q = value is 44.%, just below the limit.

require(scatterplotd) state.xyz <- cbind(state.center$x,state.center$y, state.fa$scores[,]) colnames(state.xyz)=c("x","y","z") state.d <- scatterplotd(state.xyz,type="h", xlab="longitude", ylab="latitude", zlab="factor score", cex.symbol=.,color="grey") text(state.d$xyz.convert(state.xyz),state.abb) Code Example : Code to make the states factor scores the z axis. The type="h" option draws lines (here set to grey) to connect states to the x y (longitude-latitude) plane, for visual clarity. (The lines have to end in plotting symbols, but those are made invisibly small by the cex.symbol=. option.) The output of scatterplotd is an object which contains several functions to be used in further decorating the graph. One of them, xyz.convert, changes three-dimensional coordinates for the plot into two-dimensional coordinates for the graphics device; we call that inside the text function that adds the states labels. Answer: We omit the first column because it s really a discrete label (,,... ), and not a numerical feature at all! Actually doing the PCA: require(elemstatlearn) data(zip.train) zip.pca = prcomp(zip.train[,-]) Running the last command inside system.time, it takes me.5 seconds to do PCA on the complete data. (b) Make plots of the projections of the data on to the first two and three principal components. (For the D plot, use the function scatterplotd from that package.) Include the commands you used as well as the plots. On both plots, indicate which points come from which digits, and make sure that this is legible in what you turn in. (E.g., if you use colors, make sure they look distinct on your printout. You might try pch=as.character(zip.train[,]).) Comment on the results. Answer: See Figures and for the plots and the commands used to make them. Mostly, these big impenetrable blurs. There seems to be a compact cluster of s with low scores on both of the first two components, and a diffuse fan of zeroes with high scores (and a much broader range than the s). 4s tend to go near the s. In the D plot, there is a cluster of s, slightly to the right of the 4s. Mostly, Some of these observations are clearer with color as well as symbol; add the option

however, there s a big, big mess in the middle, where lots of different digits are all intermingled. (c) Use the code from lecture to do an LLE with q =. Include the commands you used. Answer: At this point, running the procedures on the full data becomes impractical. So I ll take the first 5 rows of the data frame and just use them. > source("~/teaching/5/lectures/4/lecture-4.r") > zip.small <- zip.train[:5,] > system.time(zip.lle <- lle(zip.small[,-],,)) user system elapsed.44.5. > dim(zip.lle) [] 5 Notes: You don t need to include the source command, which in any case should point to where the file is on your system, not mine! Running lle inside system.time does the assignment for us, but also gives us the amount of time taken to execute it. Notice that lle with just 5 data points takes half as much time as PCA on the full data; lle on the full data is very slow indeed. Also, the last line isn t necessary, but it does check that we re getting the right sort of output (a matrix of three-dimensional coordinates). (d) Make D and D plots of the data, as before, but with the LLE coordinates. Comment. Answer: See Figures and. The two-dimensional plot is hard to interpret (at least for me). The three-dimensional plot shows the points falling on a shape like a saddle, or a sail. At one point (negative on all three coordinates) we have the s; moving right and up along this edge these turn into s, 4s, and especially into s all shapes with a single vertical line on the right. At the upper left there are zeroes, which change into s and then 5s as we move right but stay up (preserving a rounded stroke at the base of the numeral), but turn into s, s and 5s as we move down (preserving a rounded stroke in the top part of the numeral). s sit towards the middle of the figure, around the intersection of round on top and round on the bottom. (e) Run k-means with k = on (i) the raw data, (ii) the D PCA projections and (iii) the D LLE. Calculate the variation-of-information color=(zip.train[,]+) the d plotting command. (The + is because color is the background color.) At least with my very-far-from-optimized R implementation! People do use these procedures on large real-world data, but with better coding. system.time is one of the few R commands which works through side-effects here, evaluating its argument, and modifying the workspace if need be.

jpeg(file="zip-pca-d.jpg") plot(zip.pca$x[,:],pch=as.character(zip.train[,]),cex=.5) dev.off() Figure : Plotting the first two principal components of the zip.train data. The first line of the code tells R to redirect graphics commands to a jpeg file; the third line turns off the re-direction. Because there are many thousands of data points, each fractionally different from each other, the PDF file R would normally produce would be Mb, while the jpeg (with the same visible detail) is only 4 kb.

jpeg(file="zip-pca-d.jpg") scatterplotd(zip.pca$x[,:],pch=as.character(zip.train[,]),cex.symbol=.5) dev.off() Figure : Plotting the first three principal components of the zip.train data. The first and third lines write the image to a jpeg file, to keep file sizes under control; see Fig..

..5. -.5 zip.lle[, :][,] 4 4 4 4 4 54 5 4 4 4 4 444 5 4444 5 4 444 55 555 5 55 4 4 5 4 44 5 55 5 4 55 5 5 4 5 5 5 4 5 5 4 4 4 4 4 -. -.5..5. zip.lle[, :][,] plot(zip.lle[,:],pch=as.character(zip.small[,])) Figure : LLE coordinates in two dimensions. 4

. zip.lle[,]..5.5 -.5 zip.lle[,].5. 4 55 5 5 5 5 555 5 4 5 5 5 5 55 5 5 5 55 5 4 4 5 5 444 5 4 4 4 454 444 44 5 4 4 4 44 44 4 4 4 4 4 4 4 4 4 4 4. -. -.5 -. -. -.5 -. -.5..5..5. zip.lle[,] scatterplotd(zip.lle,pch=as.character(zip.small[,]),cex.symbol=.5) Figure : LLE coordinates in dimensions. 5

distance of all three clusterings from the true classes (as given by the first column of zip.train). Comment. Answer: We need the variation-of-information code from homework 4, which in turn needs functions from lecture 5. > source("~/teaching/5/lectures/5/lecture-5.r") > source("~/teaching/5/hw/4/solutions-4.r") > raw.cluster = kmeans(zip.train[,-],centers=)$cluster > pca.cluster = kmeans(zip.pca$x[,:],centers=)$cluster > lle.cluster = kmeans(zip.lle,centers=)$cluster > variation.of.info(raw.cluster,zip.train[,]) [].55 > variation.of.info(pca.cluster,zip.train[,]) [].55 > variation.of.info(lle.cluster,zip.small[,]) []. Even though the LLE coordinates are only three dimensional, clustering using them is almost as accurate as clustering using the complete data (with 5 dimensions); clustering using the first principal components is not so accurate. This suggests that the LLE does a better job than PCA of retaining information in the features which are relevant to the classes as we might have guessed from the figures. Of course, whether any given classifier method can actually use that information is a different question.. (Extra credit) Download the diffusionmap package from CRAN. Prepare a D scatterplot of the data, as in problem, using diffuse. Repeat the clustering from the end of problem with the diffusionkmeans function, and calculate the distance of this clustering from the true classes. Comment on these results. Answer: > require(diffusionmap) > system.time(zip.diff <- diffuse(dist(zip.small),maxdim=)) [] "Performing eigendecomposition" [] "Computing Diffusion Coordinates" [] "Used default value: dimensions" user system elapsed...55 > scatterplotd(zip.diff$x,pch=as.character(zip.small[,]),cex.symbol=.5) The diffuse function computes the actual diffusion map; it takes as its argument not the data set, but a distance matrix made from the data set. (This lets it work with arbitrary distance functions, including ones for qualitative data.) It will compute a default number of dimensions to use, but here I insist on. There s a default plotting method for diffusion map

objects (plot.dmap), but it doesn t allow for much control of the results, so I called scatterplotd directly. The shape of the data here is similar to that of the LLE, but the s form a more compact triangular cluster, and the zeros a broader flap or tongue going off from the main cluster. Parts of this look sensible the close proximity of 4s and s, for example, and of s and s with each other and with s. The central core is however a bit messy. > diff.cluster = diffusionkmeans(zip.diff,k=)$part > variation.of.info(diff.cluster,zip.small[,]) []. This is a little bit worse than the LLE, as we suspected from the figure. However, the difference is quite small, and re-running k-means, it sometimes comes out the other way.