Decision Trees, Random Forests and Random Ferns. Peter Kovesi

Size: px

Start display at page:

Download "Decision Trees, Random Forests and Random Ferns. Peter Kovesi"

Bryce Cox
5 years ago
Views:

1 Decision Trees, Random Forests and Random Ferns Peter Kovesi

2 What do I want to do? Take an image. Iden9fy the dis9nct regions of stuff in the image. Mark the boundaries of these regions. Recognize and label the stuff in each region.

3 What do I want to do? Take an image. Iden9fy the dis9nct regions of stuﬀ in the image. Mark the boundaries of these regions. Recognize and label the stuﬀ in each region.

4 What do I want to do? Take an image. Iden9fy the dis9nct regions of stuﬀ in the image. Mark the boundaries of these regions. Recognize and label the stuﬀ in each region.

5 Recognizing Textures

6 Manual Classiﬁca;on

7 Textons Fundamental micro- structures in natural images Apply a bank of filters to a set of sample images of a texture. Perform clustering on the filter outputs to find groupings of filter outputs that tend to co- occur together for that texture. These clusters form textons that are stored in a dic9onary for future use. Filter Bank

8 Step 1: Build the texton dic;onary Varma and Zisserman 2005

9 Texton dic9onary built from coral images

10 Step 2: Build models of the textures A set of training images for each texture are filtered and the dic9onary textons closest to the filter outputs are found. The histogram of textons found in the image forms the model corresponding to the training image.

11 Step 3: Texture recogni;on Image of the unknown texture is filtered and the dic9onary textons closest to the filter outputs are found. The histogram of textons found in the image is then compared against the histograms of the training texture images to find the closest match.

12 Problems While the papers report good results I am having trouble replica9ng them. Cluster centres seem to change drama9cally on different training sets. How many clusters should I use? Clustering takes a long 9me. Don t know which filters produce the most useful data for separa9ng different textures. I have lots of 8 megapixel images, each pixel with 36 features

13 Machine Learning Algorithms K- means clustering: An unsupervised algorithm that learns which things go together. User has to specify K. Bayes Classifier: Assumes features are Gaussian distributed and independent of each other. For each class find mean and variance of its a]ributes. Then, given some a]ributes compute the probability that it is a member of each class and take the most probable one. Works surprisingly well and can handle large data sets. Decision Trees: Finds data features and thresholds that best splits the data into separate classes. This is repeated recursively un9l data has been split into homogeneous (or mostly homogeneous) groups. Can immediately iden9fy the features that are most important. Boos;ng: A collec9on of weak classifiers (typically single level decision trees). During training each classifier learns a weight for its vote from its accuracy on the data. Each classifier is trained one by one, data that is poorly represented by earlier classifiers is given a higher weigh9ng so that subsequent classifiers pay more a]en9on to points where the errors are large. Random Forests: An ensemble of decision trees. During learning tree nodes are split using a random subset of data features. All trees vote to produce a final answer. Can be one of the most accurate techniques.

14 Machine Learning Algorithms Expecta;on maximiza;on (EM) Maximum Likelihood Es;ma;on (MLE): Typically we assume the data is a mixture of Gaussians. In this case EM fits N mul9dimensional Gaussians to the data. User has to specify N. Neural Networks / Mul;layer Perceptron: Slow to train but fast to run, design is a bit of an art but can the the best performer on some problems. Support Vector Machines: Algorithm finds hyperplanes that maximally separates classes. Projec9ng the data into higher dimensions makes the data more likely to be linearly separable. Works well when there is limited data.

Machine Learning Problems Model Bias: The model assump9ons are too strong. It cannot fit the data well. True Errors on training data and on test data will be large.

15 Machine Learning Problems Model Bias: The model assump9ons are too strong. It cannot fit the data well. True Errors on training data and on test data will be large. Model Variance: The model fits the training data too well and has included the noise. It cannot generalize. True Errors on training data will be small but errors on test data will be large.

16 Decision Tree for predic;ng Californian house prices from la;tude and longitude Latitude < Longitude < Latitude < Latitude < Latitude < Longitude < Longitude < Longitude < Latitude < Latitude < Longitude <

17 Recursive par;;oning of the data

19 Deciding how to split nodes A nice split Histogram of classes at node Condi9on? true false

20 Deciding how to split nodes A not so useful split Histogram of classes at node Condi9on? true false

21 Deciding how to split nodes Which a]ribute of the data at a node provides the highest informa9on gain? Entropy: H(X) = - Σ p i log p i Low entropy High entropy Specific Condi;onal Entropy: H(X Y=v) = The entropy of X among only those records in which Y = v Condi;onal Entropy: H(X Y) = The average specific condi9onal entropy of X = Σ Prob(Y=v j ) H(X Y = v j ) Informa;on Gain: IG(X Y) = H(X) H(X Y) H(X) indicates the randomness of X H(X Y) indicates the randomness of X assuming I know Y The difference, H(X) H(X Y), indicates the reduc9on in randomness achieved by knowing Y

22 Entropy Specific condi9onal entropy H(X) H(X Y = v 1 ) Condi9onal Entropy H(X Y = v 2 ) H(X Y) = Σ Prob(Y=v j ) H(X Y = v j ) H(X Y = v 3 )

23 Informa;on Gain from thresholding a real- valued awribute Define IG(X Y:t) as H(X) H(X Y:t) Define H(X Y:t) = H(X Y<t) P(Y < t) + H(X Y>= t) P(Y >= t) IG(X Y:t) is the informa9on gain for predic9ng X if all you know is whether Y is less than or greater than t

24 A Decision Tree represents a structured plan of a set of a]ributes to test in order to predict the output. To decide which a]ribute should be tested first, simply find the one with the highest informa9on gain. Then recurse Stop when: All records at a node have the same output, or All records at a node have the same a]ributes, in this case we make a classifica9on based on the majority output. The tree directly provides an ordering of the importance of each a]ribute in making the classifica9on. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classifica9on and Regression Trees. Wadsworth, Belmont, CA, Applet Demo

25 The tree maps each data point to a leaf. Each leaf stores the distribu9on of classes that end up there.

26 OverfiYng If we expand the tree as far as we can go we are likely to end up with many leaf nodes that contain only one record. It is likely that we have fi]ed the noise in the data. This will result in the training set error being very small, but the test set error being high. Pruning: Star9ng at the bo]om of the tree, delete splits that do not add predic9ve power. Use a chi- squared test to decide whether the distribu9ons generated by the split are significantly different. You have to provide a threshold which represents your willingness to fit noise.

27 Random Forests An ensemble of decision trees. During learning tree nodes are split using a random subset of data features. All trees vote to produce a final answer. Why do this? It was found that op9mal cut points can depend strongly on the training set used (high variance). This led to the idea of using mul9ple trees to vote for a result. For the use of mul9ple trees to be most effec9ve the trees should be independent as possible. Splinng using a random subset of features hopefully achieves this. Averaging the outputs of trees reduces overfinng to noise. Pruning is not needed. Leo Breiman, Random Forests Machine Learning Vol 45 No

28 Typically trees are used. Open only a few trees are needed. Results seem fairly insensi9ve to the number of random a]ributes that are tested for each split. A common default is to use the square root of the number of a]ributes. Trees are fast to generate because fewer a]ributes have to be tested for each split and no pruning is needed. Memory needed to store the trees can be large.

29 Extremely Randomized Trees Not only randomly select a subset of a]ributes to evaluate for each split but, in the case of numerical a]ributes, randomly select the threshold to split the value on. Seem to work slightly be]er than Random Forests. Though this may be a result of the slightly different informa9on gain score used. Completely random trees can also work well. Here a single a]ribute is selected at random for each split. No evalua9on of the a]ribute split is therefore needed. Trees are trivial to generate. Pierre Guerts, D Ernst, L Wehenkel Extremely Randomized Trees Machine Learning Vol 63 No

30 Some results taken from Geurts paper Single tree Random Forest Extremely Random Trees

31 Breiman s algorithm: 1. Train a classifier. Importance of a Par;cular Feature Variable 2. Perform valida9on to determine accuracy of classifier. 3. For each data point randomly choose a new value for the feature variable from among the values that the feature has in the rest of the data set. (This ensures the distribu9on of the feature values remains unchanged but the meaning of the feature variable is destroyed.) 4. Train the classifier on the altered data and measure its accuracy. If the accuracy is degraded badly then the feature is very important. 5. Restore the data and repeat the process using every other feature variable. The result will be an ordering of each feature variable by its importance.

32 Regression Trees vs Classifica;on Trees A Regression Tree a]empts to predict a con9nuous numerical value rather than a discrete classifica9on. Evalua9on of each node split has to be made on the variance of the split distribu9ons rather than the informa9on gain. Here entropy is equal but variance is not A large Random Forrest or set of Extremely Randomized Trees acts as a linear interpolator.

33 1 Extremely Random Tree 100 Extremely Random Trees

34 Random Ferns Extending the randomness and simplicity even further A fern can be thought of as a constrained tree where the same binary test is performed at each level of the tree Özuysal, Fua and Lepe9t CVPR 2007 Özuysal, Calonder, Lepe9t and Fua PAMI 2009 (Diagrams taken from Özuysal s pages)

Recognizing keypoints with Random Ferns Keypoint features f i are the sign of the intensity difference of two pixels at random loca9ons in a patch about the keypoint.

35 Recognizing keypoints with Random Ferns Keypoint features f i are the sign of the intensity difference of two pixels at random loca9ons in a patch about the keypoint. Each keypoint has N features but each fern is constructed from a random subset of S features. Fern 1 Fern 2 Fern 3 The output of each feature test can be concatenated to form a binary number. This corresponds to the index of the leaf node that we end up at in the equivalent constrained tree.

36 Training: Example views of each keypoint are passed through each fern. A histogram of the leaf indices that each keypoint class ends up at is built up Fern 1 Fern 2 Fern 3

37 Recogni;on: The output of the feature tests on the candidate keypoint places us at a leaf node on each fern. Each gives a probability for each of the possible keypoints. These are combined assuming independence between distribu9ons

38 Random Ferns are Semi- Naïve Bayes Classifiers Typical values used by Özuysal, Fua and Lepe9t Number of features: N = 450 Number of Ferns: M = 30 ~ 50 Number of features/fern: S = 11 Consider the op9ons: One large Fern made up from all the features - > 2 N parameters (too large!) N single feature Ferns (Naïve Bayes Classifier) - > N parameters Assuming each feature is independent is too simplis9c and keypoint pose varia9ons are not handled well. M Ferns each consis9ng of S features - > M x 2 S parameters Assumes that each group of S features are independent (Semi- Naïve Bayes Classifier). Varying M and S allow tuning of complexity and performance.

39 Recognizing Textures with Trees? Some are star9ng to do this Maree, Geurts and Wehenkel 2009 Sho]on, Johnson and Cipolla CVPR 2008

40 Tree/Fern based learning algorithms Simple and can perform very well. Training is fast and leaf histograms can be incrementally updated. Can require considerable memory to store. Stochas9c force seems to outgun careful design!

Machine Learning Crash Course: Part I

Machine Learning Crash Course: Part I Ariel Kleiner August 21, 2012 Machine learning exists at the intersec