Data Mining The first one will centre the data and ensure unit variance (i.e. sphere the data)..

Similar documents
Exploratory Projection Pursuit

The first one will centre the data and ensure unit variance (i.e. sphere the data):

SECTION 1-B. 1.2 Data Visualization with R Reading Data. Flea Beetles. Data Mining 2018

Statistics 251: Statistical Methods

Comparative Study of ROI Extraction of Palmprint

Tutorial 1 Engraved Brass Plate R

Page 1. Graphical and Numerical Statistics

INTRODUCTION TO MEDICAL IMAGING- 3D LOCALIZATION LAB MANUAL 1. Modifications for P551 Fall 2013 Medical Physics Laboratory

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Statistical Programming with R

Calypso Construction Features. Construction Features 1

Univariate Data - 2. Numeric Summaries

Solid Bodies and Disjointed bodies

Bar Graphs with One Grouping Variable 1

Lecture 9: Hough Transform and Thresholding base Segmentation

The Three Dimensional Coordinate System

Chapter 4. Clustering Core Atoms by Location

The Curse of Dimensionality

Functions of Several Variables

MATH11400 Statistics Homepage

Chapter 5 An Introduction to Basic Plotting Tools

Data Mining: Unsupervised Learning. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Solid Bodies and Disjointed Bodies

INTRODUCTION TO R. Basic Graphics

Introduction to Minitab 1

form are graphed in Cartesian coordinates, and are graphed in Cartesian coordinates.

Math 7 Elementary Linear Algebra PLOTS and ROTATIONS

SAS Visual Analytics 8.2: Working with Report Content

Publication Number spse01695

Lesson: Static Stress Analysis of a Connecting Rod Assembly

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Stat 290: Lab 2. Introduction to R/S-Plus

Brief Guide on Using SPSS 10.0

Lecture 3 Sections 2.2, 4.4. Mon, Aug 31, 2009

PH36010 MathCAD worksheet Advanced Graphics and Animation

BRYCE 5 Mini Tutorial

Interactive Math Glossary Terms and Definitions

To graph the point (r, θ), simply go out r units along the initial ray, then rotate through the angle θ. The point (1, 5π 6

DentMILL. Basic Quick Guide. Copyright 2011.Roland DG Corporation R

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Workshop 1: Basic Skills

Spiky Sphere. Finding the Sphere tool. Your first sphere

Answers to practice questions for Midterm 1

PRINT A 3D MODEL PLANE

Dimension reduction : PCA and Clustering

Variography Setting up the Parameters When a data file has been loaded, and the Variography tab selected, the following screen will be displayed.

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /6/ /13

Lecture 8 Object Descriptors

Publication Number spse01695

3D Design with 123D Design

To graph the point (r, θ), simply go out r units along the initial ray, then rotate through the angle θ. The point (1, 5π 6. ) is graphed below:

Static Stress Analysis

CATIA V5 Parametric Surface Modeling

Feature-Based Modeling and Optional Advanced Modeling. ENGR 1182 SolidWorks 05

Getting Started. What is SAS/SPECTRAVIEW Software? CHAPTER 1

Multivariate Analysis (slides 9)

Intro to R Graphics Center for Social Science Computation and Research, 2010 Stephanie Lee, Dept of Sociology, University of Washington

PC-MATLAB PRIMER. This is intended as a guided tour through PCMATLAB. Type as you go and watch what happens.

Blender Lesson Ceramic Bowl

Introduction to Geospatial Analysis

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Exploratory Data Analysis EDA

ARCHITECTURE & GAMES. A is for Architect Simple Mass Modeling FORM & SPACE. Industry Careers Framework. Applied. Getting Started.

9.5 Polar Coordinates. Copyright Cengage Learning. All rights reserved.

SketchUp. SketchUp. Google SketchUp. Using SketchUp. The Tool Set

Insight: Measurement Tool. User Guide

Learning Photoshop Elements Step-by-step Lesson Two

FARO Scanning Plugin

Writing grid Code. Paul Murrell. December 14, 2009

Advanced Programming Features

Graphing Interface Overview

Bar Charts and Frequency Distributions

Gallios TM Quick Reference

CoE4TN4 Image Processing

OPERATION MANUAL. MV-410HS Layout Editor. Version higher. Command

GCC vinyl cutter, cutting plotter for sign making

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Matrix algebra. Basics

DA-CAD User Manual Dürkopp Adler AG

AMTH142 Lecture 10. Scilab Graphs Floating Point Arithmetic

Working with PhotoVCarve files

Information Technology and Media Services. Office Excel. Charts

HOW TO SETUP AND RUN SHIVA...

Network Traffic Measurements and Analysis

Lab Practical - Limit Equilibrium Analysis of Engineered Slopes

MET71 COMPUTER AIDED DESIGN

INSTRUCTIONS FOR THE USE OF THE SUPER RULE TM

Rotation and Scaling Image Using PCA

SolidWorks Intro Part 1b

Section 33: Advanced Charts

Univariate Data - 2. Numeric Summaries

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

User Interface Guide

Introduction to AutoCAD 2012

Directional Derivatives. Directional Derivatives. Directional Derivatives. Directional Derivatives. Directional Derivatives. Directional Derivatives

DSCI 325: Handout 18 Introduction to Graphics in R

Introduction to MATLAB Practical 1

ENVI Classic Tutorial: Introduction to ENVI Classic 2

Downloaded from

Transcription:

Data Mining 2017 180 SECTION 5 B-PP Exploratory Projection Pursuit We are now going to look at an exploratory tool called projection pursuit (Jerome Friedman, PROJECTION PURSUIT METHODS FOR DATA ANALYSIS, June. 1980, SLAC PUB-2768) Read in required files: drive - D: code.dir - paste(drive, DATA/Data Mining R-Code, sep / ) data.dir - paste(drive, DATA/Data Mining Data, sep / ) source(paste(code.dir, 3DRotations.r, sep / )) source(paste(code.dir, EllipseCloud.r, sep / )) source(paste(code.dir, BorderHist.r, sep / )) source(paste(code.dir, CreateGaussians.r, sep / )) source(paste(code.dir, ReadFleas.r, sep / )) and set up a couple of required functions. The first one will centre the data and ensure unit variance (i.e. sphere the data).. Centre and sphere data Sphere.Data - function(data) { data - as.matrix(data) data - t(t(data) - apply(data, 2, mean)) data.svd - svd(var(data)) sphere.mat - t(data.svd$v %*% (t(data.svd$u) * (1/sqrt(data.svd$d)))) return(data %*% sphere.mat) (Recall that we had an earlier function that does a similar thing.) Standardize data f.data.std - function(data) { data - as.matrix(data) bar - apply(data, 2, mean) s - apply(data, 2, sd) t((t(data) - bar)/s) Projection Pursuit Mills 2017 18

Data Mining 2017 It might be worthwhile to look at how the two differ by considering an example. The following reads in an elliptic cloud: data.prp - read.table(filepaste(data.dir, PPellipse.tab, sep / )) and plots it plot(data.prp, xlimc(-30,30), ylimc(-30,30)) X2-30 -20-10 0 10 20 30 X2-30 -20-10 0 10 20 30 189 216 220 100-30 -20-10 0 10 20 30 X1-30 -20-10 0 10 20 30 X1 Figure 1. Elliptic cloud Figure 2. Elliptic cloud with ID and It is possible to determine which of the data points are extreme by identify(data.ppr) Go to the plot. Your mouse will now display a sign. Click on one of the extreme points and the observation corresponding to it will show on the graph. Once you are finished selecting the extreme points, to exit the identify function, right click in the Graphics window and select Stop from the menu that appears. Now colour these newly found points (Figure 2): points(data.prp[220,],pch8,col red ) points(data.prp[189,],pch7,col red ) points(data.prp[216,],pch8,col blue ) points(data.prp[100,],pch7,col blue ) We can now standardize (as required): data.prp.std - f.data.std(data.prp) apply(data.prp.std, 2, mean) X1 X2 5.392388e-17 2.597714e-16 apply(data.prp.std, 2, sd) X1 X2 1 1 Mills 2017 181 Projec

Data Mining 2017 182 and sphere (as required): data.prp.sph - Sphere.Data(data.prp) colnames(data.prp.sph) - list( X1, X2 ) apply(data.prp.sph, 2, mean) X1 X2 1.125332e-16-2.181861e-16 apply(data.prp.sph, 2, sd) X1 X2 1 1 Now plot them and show what happens to the extreme values: plot(data.prp.std, asp1, main Standardized,xlimc(-4,4), ylimc(-4,4)) points(matrix(data.prp.std[220,],ncol2),pch8,col red ) points(matrix(data.prp.std[189,],ncol2),pch7,col red ) points(matrix(data.prp.std[216,],ncol2),pch8,col blue ) points(matrix(data.prp.std[100,],ncol2),pch7,col blue ) x11() plot(data.prp.sph, asp1, main Sphered,xlimc(-4,4), ylimc(-4,4)) points(matrix(data.prp.sph[220,],ncol2),pch8,col red ) points(matrix(data.prp.sph[189,],ncol2),pch7,col red ) points(matrix(data.prp.sph[216,],ncol2),pch8,col blue ) points(matrix(data.prp.sph[100,],ncol2),pch7,col blue ) Standardized Sphered X2-4 -2 0 2 4 X2-4 -2 0 2 4-4 -2 0 2 4 X1-4 -2 0 2 4 X1 Figure 3. Figure 4. We can see that the standardized cloud has retained its elliptic shape and orientation while the sphered cloud is more circular and that lines joining the blue points and joining the red would appear to cross at right angles. The re-orientation comes from the use of the eigenvectors which, as we saw in the PCA, identifies the directions of decreasing variance and re-orients them to be orthogonal. Projection Pursuit Mills 2017 18

Data Mining 2017 We are going to look at a second new function gives an index that allows us to determine if the data has most of its information on either side of the centre (see later): Holes Index Holes.Index - function(data) { 1 - mean(exp(-data^2)) We will create two elliptic clouds by use of the functions in 3DRotations.r and CreateGaussians.r: 2D rotation R.2D - function(data, Ang) { T - diag(rep(1, 2)) 2 x 2 identity matrix c - cos(ang) s - sin(ang) T[1, 1] - c T[2, 2] - c cos sin T[2, 1] - -s -sin cos T[1, 2] - s data %*% T Create a Gaussian cloud mean (x,y) variance (varx,vary) which is rotated by the angle rot Make.Gaussian - function(n, x, y, varx1, vary1, rot) { X.1 - rnorm(n, x, varx) Y.1 - rnorm(n, y, vary) R.2D(cbind(X.1, Y.1), rot) The first cloud has 500 points - mean 0, 0, variance5, 3 - and is rotated by/4, then centred at 5,15 : sz.1-500 data.1 - Make.Gaussian(sz.1, 0, 0, varx5, vary3, pi/4) data.1 - t(t(data.1) c(5,15)) The second cloud has 500 points - mean 0, 0, variance5, 3 - and is rotated by/4, then centred at 15,5 : sz.2-500 data.2 - Make.Gaussian(sz.2, 0, 0, varx5, vary3, pi/4) data.2 - t(t(data.2) c(15, 5)) Mills 2017 183 Projec

Data Mining 2017 184 Combine the two clouds (rbind makes a 1000 point data set), and plot: G.1.data - rbind(data.1, data.2) plot(g.1.data, asp1, main Two elliptic clouds ) Figure 5. Viewing then in the 2-d plane is like looking at the clouds from above. Suppose that we can only see them as projections on the x and y-axes. Our way of viewing such data is as a histogram. The following function will display the data and histograms together. border.hist - function(x, y) { Create a graph with a histogram of the data on the border. oldpar - par(no.readonly TRUE) Read the current state xhist - hist(x, plotfalse, breaks 20) Create histograms yhist - hist(y, plotfalse, breaks 20) top - max(c(xhist$counts, yhist$counts)) Get max count xrange - range(x) min,max yrange - range(y) Create a plot canvas that has 2 rows and 2 columns on which the c(0,2,3,1) gives the order of plotting 0 no plot, and c(1,3), c(1,3) give the row and col size ratios. nf - layout(matrix(c(0,2,3,1),2,2,byrowtrue), c(1,3), c(1,3), TRUE) layout.show(nf) Set the margin sizes (bottom, left, top, right) par(marc(3,3,1,1)) plot(x, y, xlimxrange, ylimyrange, xlab, ylab ) par(marc(0,3,1,1)) barplot(xhist$counts, axesfalse, ylimc(0, top), space0, colheat.colors(length(xhist$breaks))) par(marc(3,0,1,1)) barplot(yhist$counts, axesfalse, xlimc(top, 0), space0, horiztrue, colheat.colors(length(yhist$breaks))) Projection Pursuit Mills 2017 18

Data Mining 2017 par(oldpar) reset to default border.hist(g.1.data[,1], G.1.data[,2]) Figure 6. Border Histogram for two unsphered clouds Figure 7. Border Histogram for two sphered clouds Figure 6 shows the border histograms for unsphered clouds; Figure 7 shows the same thing for the sphered data. Note that the number of bins is 15 rather than the 20 requested for the number of breaks in the call to hist. That number can be over-ridden by the function if it is not optimal. G.1.data.sph - Sphere.Data(G.1.data) cor(g.1.data.sph) [,1] [,2] [1,] 1.000000e00-2.942152e-16 [2,] -2.942152e-16 1.000000e00 cov(g.1.data.sph) Same result border.hist(g.1.data.sph[,1], G.1.data.sph[,2]) Mills 2017 185 Projec

Data Mining 2017 186 We now wish to rotate our viewpoint (i.e. project the data onto different axes). For this purpose we could put the centre of the data at 0, 0 by subtracting the mean. This would not have any effect on the nature of the distribution of data. We will also sphere the data in order to be able to quantify the appearance of the histograms. The process of sphering is basically a PCA with scaling. NOTE: This will change the spatial nature of the data. The problem is that it may distort the characteristics that we wish to discover (as shown below). Two small clouds Cloud 3 sz.3-10 data.3 - Make.Gaussian(sz.3, -8, 0, varx3, vary1, 0) Cloud 4 sz.4-10 data.4 - Make.Gaussian(sz.4, 8, 0, varx3, vary1, 0) G.2.data - rbind(data.3, data.4) plot(g.2.data, xlimc(-15,15), asp1, main Two small elliptic clouds ) plot(sphere.data(g.2.data), xlimc(-15,15), asp1, main Two small elliptic clouds after sphering ) Figure 8. Two small clusters Figure 9. Two small clusters after sphering We see that the clusters which seemed to be evident in Figure 3 have been pushed together in Figure 4. While this makes it difficult for us to visualize structure, it does not affect the Gaussian/non-Gaussian nature of the data so an index which looks for non-normality is not affected. Projection Pursuit Mills 2017 18

Data Mining 2017 With this in mind consider the following: G.1.data.sph - Sphere.Data(G.1.data) print(cor(g.1.data.sph)) print(cov(g.1.data.sph)) [,1] [,2] [1,] 1.000000e00-5.524345e-17 [2,] -5.524345e-17 1.000000e00 [,1] [,2] [1,] 1.000000e00-5.524345e-17 [2,] -5.524345e-17 1.000000e00 border.hist(g.1.data.sph[,1], G.1.data.sph[,2]) Figure 10. Border Histogram for two clouds (sphered) We notice that the data has what appears to be a normal distribution when projected on both the x and the y directions. We can see no structure in this data from this view. Now suppose we look at the projections on a sequence of lines set at angles to the x-axis: oldpar - par(mfrow c(3,3)) par(marc(2,2,2,2)0.1) for (i in 0:8) { Ax - cbind(c(cos(i*pi/6), sin(i*pi/6))) Set up a rotation G.1.proj - G.1.data.sph%*%Ax Rotate the data h.i - Holes.Index(G.1.proj) hist(g.1.proj, 50, mainpaste(i, *pi/6, - Holes Index,floor(h.i*1000)/1000,sep )) Mills 2017 187 Projec

Data Mining 2017 188 Figure 11. Histograms on different projections Now we see that we can determine structure (a distinct gap) in the projections. Can we systematically search for such structure? (It might be noted that there are different types of structure that we can look for but they all have a common characteristic - they are all non-gaussian.) In a situation such as this, we might use a Holes index (i.e. look for a space in the middle). The Holes index simply takes e x2 for every projected point, finds the average and subtracts from 1. Because the data has been centred, a Gaussian will have a small index, while a bimodal structure will have a larger one. If all the data is at the edges, the index will be even smaller. In another situation we might want to discover if the data is pushed to the middle or has a thick tail. The pursuit part of the name comes from the fact that a maximization routine can be used so that, from any random projection, a local maximum for an appropriate index can be found. As usual, there is no assurance that the local maximum is the best one so several restarts are needed. The process here becomes similar to the Grand Tour in Ggobi. In the Grand Tour the direction vectors for the projection planes are selected randomly, while in projection pursuit an optimization routine is used to select the next direction vector for the projection. Projection Pursuit Mills 2017 18

Data Mining 2017 Here we will look at flea beetle data in Ggobi; :project the data onto different axes: library(rggobi) g - ggobi(d.flea) Because Ggobi uses indices that require sphering for the projection pursuit, we select [Sphering] from the [Tools] menu. Figure 12. Ggobi sphering This brings up a dialog box to enable us to sphere our data. In this, we select all the variables, then click [Update scree plot] (Figure 14) In Figure 15, we have clicked [Apply sphering, add PCs to data] to do the sphering. Mills 2017 189 Projec

Data Mining 2017 190 Figure 13. Setup for sphering Figure 14. PCA Figure 15. Data sphered Projection Pursuit Mills 2017 19

Data Mining 2017 We close the dialog box and see that the console has 6 new variables. We select a [2D Tour] from the [View] menu (Figure 16) with the original variables. Figure 16. Select 2D Tour Figure 17. Original variables selected We now select the PC variables and deselect the original ones. Figure 18. PC variables selected Figure 19. Startup of tour on PC Mills 2017 191 Projec

Data Mining 2017 192 To select/deselect variables to be included, we click on the X beside the name. In Figure 18, the new variables are selected. Figure 19 shows the data as it might appear on start-up and indicates that variable 7-12 (sphered) are used. In order to show the numbers rather than the name, on the scatterplot window, deselect [Show Axes Labels]from the [Tour2D] menu. To start the pursuit, in the GGobi window, select [Projection pursuit]. This opens another window (Figure 20) which will display the selected index (numerically and graphically), allow us to change the index, and allow a choice of the tour proceeding as a random tour or of optimizing the index. Figure 21 shows what happens if we select optimize; here, Ggobi tries to optimize the Hole s index. Figure 20. Pursuit index window Figure 21. Pursuit index at optimum Figure 22. Clouds at optimum Holes Index Figure 22 shows the three clusters once the index indicates an optimum value. Projection Pursuit Mills 2017 19