Classification with PAM and Random Forest

Similar documents
Classification by Nearest Shrunken Centroids and Support Vector Machines

Supervised vs unsupervised clustering

Network Traffic Measurements and Analysis

Machine Learning. Chao Lan

Clustering and Classification. Basic principles of clustering. Clustering. Classification

The Basics of Decision Trees

7. Boosting and Bagging Bagging

CS 229 Midterm Review

Data Mining Lecture 8: Decision Trees

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

An introduction to random forests

Classification/Regression Trees and Random Forests

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Classification by Support Vector Machines

Supervised Learning for Image Segmentation

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Applying Supervised Learning

Random Forests and Boosting

Tree-based methods for classification and regression

Lecture 7: Decision Trees

8. Tree-based approaches

Machine Learning in Biology

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Random Forest A. Fornaser

Classification with Decision Tree Induction

Classification: Decision Trees

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Decision Tree CE-717 : Machine Learning Sharif University of Technology

The Curse of Dimensionality

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Classification. Instructor: Wei Ding

Machine Learning Techniques for Data Mining

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Slides for Data Mining by I. H. Witten and E. Frank

CS Machine Learning

INF 4300 Classification III Anne Solberg The agenda today:

Exploratory data analysis for microarrays

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Lecture 2 :: Decision Trees Learning

Logical Rhythm - Class 3. August 27, 2018

Machine learning techniques for binary classification of microarray data with correlation-based gene selection

Cyber attack detection using decision tree approach

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lecture outline. Decision-tree classification

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

BITS F464: MACHINE LEARNING

Classification Algorithms in Data Mining

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CSE4334/5334 DATA MINING

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Knowledge Discovery and Data Mining

MUSI-6201 Computational Music Analysis

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Microarray Analysis Classification by SVM and PAM

Classification by Support Vector Machines

Instance-based Learning

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

STA 4273H: Statistical Machine Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Decision Tree Learning

Statistical Methods for Data Mining

Gene Clustering & Classification

Classification and Regression Trees

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Unsupervised: no target value to predict

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Pattern Recognition ( , RIT) Exercise 1 Solution

Cross-validation and the Bootstrap

Comparing Univariate and Multivariate Decision Trees *

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Decision Trees / Discrete Variables

CS Introduction to Data Mining Instructor: Abdullah Mueen

What is machine learning?

Ensemble Methods, Decision Trees

Feature Selection in Knowledge Discovery

Classification and Regression Trees

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Nonparametric Classification Methods

Univariate and Multivariate Decision Trees

Lecture 20: Bagging, Random Forests, Boosting

3 Feature Selection & Feature Extraction

1) Give decision trees to represent the following Boolean functions:

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Multi-label Classification. Jingzhou Liu Dec

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Voxel selection algorithms for fmri

Fast or furious? - User analysis of SF Express Inc

Transcription:

5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg

Two roads to classification Given: patient profiles already diagnosed by an expert. Task: infer a general rule to diagnose new patients. Basically, there are two ways to solve this task 1. model class probabilities QDA, LDA,... 2. model class boundaries directly Optimal Separating Hyperplanes, Random Forest 5/7/2007 Page 2

What s the problem? In classification you have to trade off underfitting versus overfitting bias versus variance. Curse of dimensionality! In 12 000 dimensions even linear methods are very complex! High variance! Simplify your models 5/7/2007 Page 3

Discriminant analysis and gene selection 5/7/2007 Page 4

Comparing Gaussian likelihoods Assumption: each group of patients is well described by a normal density. Training: estimate mean and covariance matrix for each group. Prediction: assign new patient to group with higher likelihood. Constraints on covariance structure lead to different forms of discriminant analysis. 5/7/2007 Page 5

Discriminant analysis in a nutshell QDA DLDA LDA Nearest centroid Characterize each class by mean and covariance structure. 1. Quadratic D.A. different COVs 2. Linear D.A. requires same COVs. 3. Diagonal linear D.A. same diagonal COVs. 4. Nearest centroids forces COVs to σ 2 I. 5/7/2007 Page 6

Discriminant analysis in a nutshell Choose k that maximizes the linear discriminant function t 1 1 t 1 δ k ( x) = x Σ µ k µ kσ µ k + log πk 2 The given data Covariance matrix: same for all groups Group centroid: one for each group Prior class probability For DLDA: Choose k that minimizes 2 ( xg ( µ k ) g ) δk ( x) = 2log πk σ g 2 g variance of gene g 5/7/2007 Page 7

Feature selection Next simplification: Base the classification only on a small number of genes. Feature selection: Find the most discriminative genes. 1. Filter: Rank genes according to discriminative power by t-statistic, Wilcoxon,... Use only the first k genes for classification. Discrete, hard thresholding. 2. Shrinkage: Continously shrink genes until only a few have influence on classification. Example: Nearest Shrunken Centroids. 5/7/2007 Page 8

Shrunken Centroids overall centroid first class centroid second class centroid µ A1 µ A µ A 2 gene A µ B1 µ B 2 gene B µ B 5/7/2007 Page 9

Nearest Shrunken Centroids The difference between the group centroid for gene i and class k and the overall centroid can be written as follows: µ µ = 1 n + 1 n ( s + s ) d k gk g g 0 gk where s i is the pooled within-class standard deviation of gene i and s 0 is an offset to guard against genes with low expression levels. Shrinkage: Each d gk is reduced by in absolute value, until it reaches zero. Genes with d gk =0 for all classes do not contribute to the classification. ( ) µ µ k g 0 µ ' gk µ g = gk g 1 n + 1 n ( s + s ) (Tibshirani et al., 2002) 5/7/2007 Page 10

Shrunken Centroids µ A1 µ A µ A 2 gene A µ B 1 µ B µ B 2 gene B µ ' B 1 µ B µ ' B 2 µ C 1 µ C µ C 2 gene C µ ' C 1 µ C µ ' C 2 δ k * * 2 * 2 x B ( xb µ ' Bk ) ( xc µ ' Ck ) = + * 2 2 x ( sb + s0) ( sc + s0) C 2logπ k 5/7/2007 Page 11

Amount of shrinkage Small, many genes, poor performance due to overfitting High, few genes, poor performance due to lack of information underfitting - The optimal is somewhere in the middle 5/7/2007 Page 12

Shortcomings of filter and shrinkage methods Filter and Shrinkage work only on single genes. They don t find interactions between groups of genes. Filter and Shrinkage methods are only heuristics. Search for best subset is infeasible for more than 30 genes. 5/7/2007 Page 13

5/7/2007 Page 14 Random Forest

Random Forest Growing one tree Growing many trees (a forest) 5/7/2007 Page 15

A binary tree a classification example Title Year (Y) Running time (RT) Spielberg Jaws 75 124 X Duel 71 90 X RT>117 The Color Purple 85 154 X Schindler s List 93 195 X Y>75 Y>83 Jurassic Park Back to the future 93 85 127 111 X The Godfather 72 175 Spielberg RT>139 Spielberg Das Boot Life of Brian 81 79 150 94 Stand by me 86 89 Spielberg Hook 91 144 X Harold and Maude 71 91 5/7/2007 Page 16

A binary tree for classification The root node contains all samples. Root node Each node contains a fraction of samples. leaf Rule 2 leaf Rule 1 Rule 3 leaf leaf Each rule splits up the samples into two groups. Every rule is of the form X > t for continuous X X A for categorial X Only one variable per rule. Each leaf should be more or less pure (contains only samples of one type). A new sample is run through the tree and one looks for the leaf it ends up. 5/7/2007 Page 17

A binary tree a classification example RT>117 90 85 Y>75 Y>83 80 RT>139 75 70 90 100 110 120 130 150 200 Title Year(Y) Running time (RT) Spielberg Title Year(Y) Running time (RT) Spielberg Back to the future 85 111 Jaws 75 124 X The Godfather 72 175 Duel 71 90 X Das Boot 81 150 The Color Purple 85 154 X Life of Brian 79 94 Schindler s List 93 195 X Stand by me 86 89 Jurassic Park 93 127 X 5/7/2007 Page 18

Not every partition can be achieved through a tree 90 90 85 85 80 80 75 75 70 70 90 100 110 120 130 150 200 90 100 110 120 130 150 200 5/7/2007 Page 19

Trees for microarray data Samples are microarray chips (containing many genes) divided into two or more classes, e.g. ER status +/- in breast cancer. Rules are based on the expression of one gene. Examples: ERB2 is lower than 3.21 GAPDH is higher than 6.23 You cannot base rules on more that one gene or two or more thresholds. Examples: GAPDH is lower that ERB2 ERB2 is lower that 2.21 or higher than 5.32 5/7/2007 Page 20

How to construct a tree? Choose a good node and a good gene/threshold pair (xi,ti) for each split ( Splitting rule ) Which gene should we choose? Which threshold should we use? Decide how many nodes your tree should have ( Split-stopping rule, pruning ). Assign a class to each leaf ( Class assignment rule ). 5/7/2007 Page 21

Splitting rule Impurity function Φ: [0,1] k R ( k: number of different classes) A low impurity function is achievable. Attributes: minimal for (1,0,0,...,0) and all permutations. maximal for (1/k,...,1/k) is symmetric function Φ(p 1,...p n ) = Φ(ϕ(p 1 ),.. ϕ(p n )) for all permutations ϕ There are various impurity functions: entropy, Gini index 5/7/2007 Page 22

Gini index Gini index Φ(p1,...,pk)= 1 - pi²= pi (1-p i )= p i p j Minimum 0 Maximum 1-1/k Gini index tries to separate the largest class from the rest. Rule based on Gini index Gini index for two classes: Φ ( p, p ) = 2 p p 1 2 1 2 5/7/2007 Page 23

Using the impurity function Define a function i(v) (v is a node of the tree) i(v)=φ(p(1 v),...,p(k v)). P(j v) := probability that you are member of class j if you are in node v If the probabilities for all classes are equal one has i(v)=φ(n1(v)/n(v),...,nk(v)/n(v)). n (v) = all samples in node v nj(v)= all samples of class j in node v Two classes, Gini index, equal probability: 2 n 1 ( v ) n 2 ( v ) i ( v ) = 2 n ( v ) 5/7/2007 Page 24

Goodness of fit decrease of impurity Determine gene x and threshold t that maximizes the following term: G=i(v) - (pu i(u) + pw i(w)) v pu = fraction of samples in node u pw = fraction of samples in node w u w Because you have only a finite number of genes and a finite number of values, you have to test a finite number of pairs. Two classes, Gini index, equal probability: 2 n 1 ( v ) n 2 ( v ) n 1 ( u ) n 2 ( u ) n 1 ( w ) n 2 ( w ) G = n ( v ) n ( v ) n ( u ) n ( w ) 5/7/2007 Page 25

Stop splitting rules stop decision Decide at each node if the node should be split further or not. Is there a high decrease in the Gini index? rule1 rule2 Stop splitting Stop splitting 5/7/2007 Page 26

Stop splitting rules - pruning First split until each node is pure or has a maximum number of elements, then prune the tree. Remove the splits with low decrease in the impurity function. First step Second step rule1 rule1 rule2 rule3 rule2 rule3 rule5 rule4 Prune rule4 5/7/2007 Page 27

Advantages and Disadvantages for trees Advantages At first sight good interpretation Can cope with any data structure or type Uses conditional information effectively Invariant under transformations of the gene expressions Disadvantages not robust classification performance is not that good Genes can be hidden due to other genes 5/7/2007 Page 28

From the tree to the forest General idea: combine collection of weak learners to construct a classification algorithm with better properties. Here: Construct a collection of trees and combine results of individual trees. Because the algorithm is deterministic, we have to introduce some kind of randomness. For each tree we will have two different sources of randomness: random training set (bootstrap) random gene selection 5/7/2007 Page 29

Bootstrap Complete data : K samples/patients Training data: draw K times with replacement from the data set Out of bag: sample that do not belong to the training set Training data contains approximately 2/3 of the elements of the complete data set. Complete data Training data Out of bag 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 2 4 8 9 10 6 1 1 7 6 3 5 10 3 5 6 1 8 5 2 4 6 7 9 10 1 10 5 4 1 10 7 2 2 3 6 8 9 1 10 10 4 1 4 10 1 9 9 2 3 5 6 7 8 1 2 3 4 5 6 7 8 9 10 9 6 1 9 2 3 5 10 9 2 4 7 8 1 2 3 4 5 6 7 8 9 10. 10 10 8 5 8 7 9 8 3 8.. 1 2 4 6. 5/7/2007 Page 30

Random Forest random variable selection Set the parameters m and s. These are the same for all nodes and all trees. Specify the number of trees you want to grow. For each tree: Use bootstrap to get a training set T of samples. You use only data from this set to build the tree! For each node, first choose m genes x 1,...x m at random. Upon these genes choose the best pair (x i,t i ). Go on until each node contains elements of one class or consists of s elements. 5/7/2007 Page 31

Random Forest Classification rule: First classify a new sample with each tree. In the end form a major vote. Complete data 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Training data 2 4 8 9 10 6 1 1 7 6 10 3 5 6 1 8 5 2 4 6 10 1 10 5 4 1 10 7 2 2 1 10 10 4 1 4 10 1 9 9 9 6 1 9 2 3 5 10 9 2 10 10 8 5 8 7 9 8 3 8 5/7/2007 Page 32

Estimating the test error For each sample x predict the class but use only the trees for which x belongs to the OOB set. Good estimate for the test error because the information provided by x was not used for building these trees. Training set Out of bag 1 Test set 2 3 4... 10 1 10 5 4 1 10 7 2 2 3 6 8 9 1 10 10 4 1 4 10 1 9 9 2 3 5 6 7 8 9 6 1 9 2 3 5 10 9 2 10 10 8 5 8 7 9 8 3 8 1 2 10 6 8 7 8 6 6 3 1 10 10 4 1 4 10 1 9 9 8 8 3 9 2 4 6 8 8 10 4 7 8 1 2 4 6 4 5 9 2 3 5 6 7 8 1 5 7 5/7/2007 Page 33

Tradeoff correlation - prediction The forest error rate depends on two things: The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.. Low correlation Weak individual classifiers High correlation Strong individual classifiers Small m Large Use the OOB error estimate to find a good parameter m. 5/7/2007 Page 34

Literature S. Dudoit, J. Fridlyand: Classification in microarray experiments L. Breiman: Random Forests Random Features L. Breiman: Statistical Modeling: The two Cultures J. Oh, M. Laubach, A. Luczak: Estimating Neuronal Variable Importance with Random Forest T. Hastie, R. Tibshirani, Jerome Friedman: The Elements of Statistical Learning. Springer 2001. R.Tibshirani, T. Hastie, B. Narasimhan, G. Chu: Diagnosis of multiple cancer types by shrunken centroids of gene expression, PNAS, 99(10), 6567 6572, 2002. 5/7/2007 Page 35

5/7/2007 Acknowledgement: Florian Markowetz for all the slides (PAM) Thank you

Variable importance Idea: Change the values of a gene x and check whether the OOB error changes dramatically. For each gene x do the following: For each tree t of a forest permute the values of x for the samples that belong to the out-of-bag set. Estimate the OOB error as before (the samples have new values for the gene x). Compare the OOB error of the samples with the original values to the OOB error of the samples with permuted values. This gives you an importance measure. 5/7/2007 Page 37