Biology Project 1

Similar documents
Linear Methods for Regression and Shrinkage Methods

Random Forest A. Fornaser


Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

2. Data Preprocessing

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Spectral Classification

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Tree-based methods for classification and regression

The Curse of Dimensionality

Data Preprocessing. Data Preprocessing

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

MUSI-6201 Computational Music Analysis

Network Traffic Measurements and Analysis

3. Data Preprocessing. 3.1 Introduction

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

7. Boosting and Bagging Bagging

Unsupervised Learning

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009

DESIGN OF EXPERIMENTS and ROBUST DESIGN

Lecture 20: Classification and Regression Trees

Supervised vs unsupervised clustering

Cyber attack detection using decision tree approach

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks

Image Processing. Image Features

Clustering and Visualisation of Data

Chapter 1. Using the Cluster Analysis. Background Information

6 Randomized rounding of semidefinite programs

Stat 602X Exam 2 Spring 2011

Classification/Regression Trees and Random Forests

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

vector space retrieval many slides courtesy James Amherst

Artificial Neural Networks (Feedforward Nets)

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Expectation Maximization (EM) and Gaussian Mixture Models

10-701/15-781, Fall 2006, Final

Principal Components Analysis with Spatial Data

Discriminate Analysis

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Unsupervised Learning

K-Means Clustering 3/3/17

Chapter 6: Linear Model Selection and Regularization

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Exploratory data analysis for microarrays

Fuzzy Partitioning with FID3.1

Cluster Analysis for Microarray Data

ALTER < 40.5 ALTER < 70.5 GRA < 98.5 FIE < ZZLQ < 4650 ZZLQ < ZZLQ < yes. yes ALTER < 49. yes

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CS 195-5: Machine Learning Problem Set 5

Comparing Univariate and Multivariate Decision Trees *

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Lecture 25: Review I

USING PRINCIPAL COMPONENTS ANALYSIS FOR AGGREGATING JUDGMENTS IN THE ANALYTIC HIERARCHY PROCESS

Classification and Regression Trees

Hybrid Feature Selection for Modeling Intrusion Detection Systems

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Figure 1. Overview of a semantic-based classification-driven image retrieval framework. image comparison; and (3) Adaptive image retrieval captures us

2 Second Derivatives. As we have seen, a function f (x, y) of two variables has four different partial derivatives: f xx. f yx. f x y.

A Course in Machine Learning

Using Pairs of Data-Points to Define Splits for Decision Trees

8.11 Multivariate regression trees (MRT)

Face Detection and Alignment. Prof. Xin Yang HUST

Unsupervised Learning

Introduction to Classification & Regression Trees

Face detection and recognition. Detection Recognition Sally

Contextual Co-occurrence Information for Object Representation and Categorization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Machine Learning with MATLAB --classification

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Probabilistic Graphical Models

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Perception. Autonomous Mobile Robots. Sensors Vision Uncertainties, Line extraction from laser scans. Autonomous Systems Lab. Zürich.

Univariate and Multivariate Decision Trees

Decision tree learning

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Network Traffic Measurements and Analysis

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Bayes Classifiers and Generative Methods

Work 2. Case-based reasoning exercise

Transcription:

Biology 6317 Project 1 Data and illustrations courtesy of Professor Tony Frankino, Department of Biology/Biochemistry 1. Background The data set www.math.uh.edu/~charles/wing_xy.dat has measurements related to the wing shape of three fruit fly species, Drosophila Mauritiana, D. Sechellia, and D. Simulans. The data set consists of 30 wing measurements made on each of 138 flies, 42 of Mauritiana, 48 of Sechellia and 48 of Simulans. The measurements are two-dimensional coordinates of 15 landmarks defined by intersections of veins with each other or with the wing margin. See the figures below.

The purpose of this exercise is to determine how well these measurements can distinguish among the three species. You are going to construct a classification tree for assigning one of the species categories to each set of values of the numeric measurements. The results of these classifications will be compared to the actual species labels to estimate a classification error rate. A small error rate will show that the three species are easily distinguishable from the wing shape measurements. A binary classification tree begins with all the cases (flies) gathered into one group (the root node of the tree). Then that group is split into two subgroups (daughter nodes) based on a threshold value of one of the measurement variables. The variable and its threshold value are chosen to maximize the decrease in total deviance. The deviance at a node of the tree which has N flies in species proportions pp 1, pp 2, pp 3 is 3 NN pp ii log pp ii ii=1

It can be shown that in splitting a node, the sum of the daughter deviances is less than the parent deviance. Daughter nodes are split in turn until they reach a minimum size or the reduction in deviance falls below a specified level. At the end of the construction, each terminal node one that is not split - is labeled with the species that is most numerous at that node. Classification and regression trees were first developed by Breiman, Olshen and Stone [1]. A good textbook treatment is by Venables and Ripley [2]. The figure below shows a classification tree made for another application. The criterion for each split is printed at the parent node. If the criterion is satisfied, the left-hand branch of the tree is followed. Otherwise, the right-hand branch is followed. reg.irr=reg meas>=0.04783 stria.yn=n prim.sec=y meas< 0.03103 serr meas>=0.04071 micro meas< 0.05467meas>=0.02855 sm serr meas< 0.0276 micro serr micro sm serr sm

2. Constructing the Tree with R Begin by importing the data into R with the read.table function or with Rstudio. Name your data frame anything you like. I used flies as the name. Since the data file does not have a header row, R will assign meaningless names to the variables. The 30 variables are the wing measurements and you can use the names R assigns to them. The next thing to do is to add a column to the data set containing the species of each of the 138 flies. > species=c(rep( maur,42), rep( sech,48), rep( simul,48)) > flies=cbind(flies,species) > summary(flies) To construct the tree, you will need the tree library of R. Load it by > library(tree) The function that creates the tree is also called tree. Read the help file about it. > help(tree) Click on the index link at the bottom of the help page and read about some of the other functions in the library, particularly plot.tree, text.tree, and predict.tree. You are going to divide the data into a training set and a test set. Your classification tree will be built with the training data and tested for accuracy on the test data. Randomly pick about 2/3 of the cases in flies for the training data. > train=sample(138,92,replace=f) #1 The training data will be in the data frame flies[train, ] and the test data is flies[-train, ]. The tree is built with the command > flies.tree=tree(species~., data=flies[train, ]) #2

The formula species~. tells R that the classes to be separated are the levels of the variable species and all the other variables are inputs to the classification procedure. Print the results with > flies.tree #3 > plot(flies.tree,uniform=t) #4 > text(flies.tree) #5 Finally, classify the test data and count the number of classification errors. > flies.pred=predict(flies.tree, flies[-train, ], type= class ) #6 > sum(flies.pred!= flies[-train, species ]) #7 This number divided by 46 is your estimate of the classification error rate. Repeat steps #1 through #7 several times and observe the results. In particular, note how many terminal nodes the trees have and which variables are involved in the splits. Turn in the results of #3, #5, and #7 for one of these simulations and comment on your observations. 3. Dimension Reduction with Principal Components The numeric variables in this data set are highly correlated. Furthermore, the number of variables is a substantial fraction of the number of cases. This suggests that using fewer variables for classification might be feasible and might result in a more robust classification procedure. We can do this by expressing the 30 dimensional data vector for each fly as a linear combination of the 30 orthogonal unit eigenvectors of the variance covariance matrix. These linear combinations are uncorrelated. Each eigenvalue is the variance of the data in the direction of the associated eigenvector. Thus, we may be able to take only the first few components of the data relative to the principal eigenvectors and capture most of the variation. Create a principal components object as follows.

> newflies=prcomp(flies[,1:30]) This finds the eigenvalues and eigenvectors of the variance-covariance matrix and the components of each fly relative to the basis of eigenvectors. The components are stored in a 138x30 matrix newflies$x. Have a look at the standard deviations (the square roots of the eigenvalues) in the eigendirections by > newflies$sdev and decide how many components you want to use for a classification tree. Suppose for the sake of argument that you want 6. Make a data frame with a column for species by > newflies=data.frame(newflies$x[, 1:6], species) Now construct the tree with > newflies.tree=tree(species~., data=newflies[train, ]) Repeat the steps you went through with the raw data above. Compare the results. Turn in your work. References [1] Breiman L, Friedman JH, Olshen RA Stone CJ (1984) Classification and Regression Trees. Wadsworth. [2] Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. Springer, 2002.