University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Similar documents
University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

7. Decision or classification trees

8. Tree-based approaches

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CSE4334/5334 DATA MINING

Data Mining Lecture 8: Decision Trees

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Knowledge Discovery and Data Mining

Business Club. Decision Trees

Lecture outline. Decision-tree classification

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Decision tree learning

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Random Forest A. Fornaser

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

CS Machine Learning

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Lecture on Modeling Tools for Clustering & Regression

Extra readings beyond the lecture slides are important:

COMP 465: Data Mining Classification Basics

Tree-based methods for classification and regression

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Lecture 7: Decision Trees

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

A Systematic Overview of Data Mining Algorithms

Data Mining in Bioinformatics Day 1: Classification

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Notes based on: Data Mining for Business Intelligence

Classification and Regression Trees

Data Mining Concepts & Techniques

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

The Basics of Decision Trees

Lecture 5: Decision Trees (Part II)

Classification with Decision Tree Induction

Supervised vs unsupervised clustering

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Classification/Regression Trees and Random Forests

Decision Trees. Query Selection

Lecture 2 :: Decision Trees Learning

Nearest neighbor classification DSE 220

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Performance Evaluation of Various Classification Algorithms

Classification and Regression

Data Mining Classification - Part 1 -

Lecture 19: Decision trees

Induction of Decision Trees

Lazy Decision Trees Ronny Kohavi

Classification. Instructor: Wei Ding

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

The Curse of Dimensionality

Network Traffic Measurements and Analysis

1) Give decision trees to represent the following Boolean functions:

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Ensemble Methods, Decision Trees

Induction of Multivariate Decision Trees by Using Dipolar Criteria

An introduction to random forests

CS229 Lecture notes. Raphael John Lamarre Townshend

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Decision Tree Learning

Supervised Learning Classification Algorithms Comparison

BITS F464: MACHINE LEARNING

CS 229 Midterm Review

Clustering and Visualisation of Data

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Classifiers and Detection. D.A. Forsyth

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Machine Learning. Decision Trees. Manfred Huber

Univariate and Multivariate Decision Trees

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Part I. Instructor: Wei Ding

Artificial Intelligence. Programming Styles

Classification: Decision Trees

Classification and Regression Trees

Using Machine Learning to Optimize Storage Systems

Chapter 4: Algorithms CS 795

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Chapter 4: Algorithms CS 795

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Classification with PAM and Random Forest

Statistical Pattern Recognition

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

International Journal of Software and Web Sciences (IJSWS)

Statistical Pattern Recognition

Machine Learning Models for Pattern Classification. Comp 473/6731

SOCIAL MEDIA MINING. Data Mining Essentials

CS145: INTRODUCTION TO DATA MINING

Lecture: Analysis of Algorithms (CS )

Hybrid Feature Selection for Modeling Intrusion Detection Systems

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Transcription:

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees colour=green? size>20cm? colour=red? watermelon size>5cm? size>5cm? colour=yellow? apple grape apple grape lemon fig Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2015

2 Engineering Part IIB: 4F10 Advanced Pattern Processing Improved Detection of Prostate Cancer The aim is to build a classifier for patients suspected of having prostate cancer. The data collected included: serum prostate-specific antigen (PSA) level presence of hypoechoic lesion (binary) PSA density (PSAD) prostate volume demographic information (e.g. age) Using this information a CART decision tree was constructed. specificity = sensitivity = PPV = NPV = #true negatives #true negatives +#false positives #true positives #true positives +#false negatives #true positives #true positives +#false positives #true negatives #true negatives +#false negatives

4F10: Statistical Pattern Processing zlj0190521660001.jpeg (JPEG Image, 1280x1085 pixels) http://www.jco.org/content/vol23/issue19/images/large/zlj019052166000... 17

10. Classification and Regression Trees 1 Introduction So far we have only considered situations where the feature vectors have been real valued, or discrete valued. In these cases we have been able to find a natural distance measure between the vectors. Consider the situation where we have minal data in addition to the numerical data. For instance descriptions that are discrete and there is natural concept of ordering. The question is how can such data be used in a classification task. Most importantly how can we use such data to train a classifier. One approach to handling this problem is to use decision trees, an example of a n-metric classifier - they elegantly handle minal data. Multiple levels of classifier are used to obtain a final decision. In contrast to the multi-layer perceptrons (ather multi-layer classifier) the training of these decision trees is relatively simple and the classification speed is fast. We will look at the issues behind training and classifying with decision trees. As with the other classifiers supervised training data will be used to train the decision tree.

2 Engineering Part IIB: 4F10 Advanced Pattern Processing Example Decision Tree Consider a simple task to classify fruit according to its colour and size. A reasonable decision tree would be: colour=green? size>20cm? colour=red? watermelon size>5cm? size>5cm? colour=yellow? apple grape apple grape lemon fig The feature-vector, [red,3.0cm], is observed, what is the fruit? 1. Does colour=green? NO 2. Does colour=red? YES 3. Is size> 5cm? NO It s a grape!

10. Classification and Regression Trees 5 Example Tree Construction (1)???? mellon1 green 25.0cm mellon2 green 21.1cm grape1 green 2.5cm grape2 green 4.5cm grape3 red 2.5cm lemon1 yellow 8.0cm fig1 black 4.3cm apple1 green 7.0cm apple2 red 8.2cm

6 Engineering Part IIB: 4F10 Advanced Pattern Processing Example Tree Construction (2) colour=green????????? mellon1 green 25.0cm mellon2 green 21.1cm grape1 green 2.5cm grape2 green 4.5cm apple1 green 7.0cm grape3 red 2.5cm lemon1 yellow 8.0cm fig1 black 4.3cm apple2 red 8.2cm

10. Classification and Regression Trees 7 Example Tree Construction (3) colour=green? size>20cm? colour=red? watermelon???????????? mellon1 green 25.0cm mellon2 green 21.1cm grape1 green 2.5cm grape2 green 4.5cm apple1 green 7.0cm grape3 red 2.5cm apple2 red 8.2cm lemon1 yellow 8.0cm fig1 black 4.3cm

8 Engineering Part IIB: 4F10 Advanced Pattern Processing Example Tree Construction (4) colour=green? size>20cm? colour=red? watermelon size>5cm? size>5cm? colour=yellow? apple grape apple grape lemon fig mellon1 green 25.0cm mellon2 green 21.1cm grape1 green 2.5cm grape2 green 4.5cm apple1 green 7.0cm grape3 red 2.5cm apple2 red 8.2cm lemon1 yellow 8.0cm fig1 black 4.3cm

4F10: Statistical Pattern Processing Example Decision Tree colour=green? size>20cm? colour=red? watermelon size>5cm? size>5cm? colour=yellow? apple grape apple grape lemon fig 21

10. Classification and Regression Trees 3 Decision Tree Construction Issues Decision tree comprises: root de: thetopde/questioninthetree(red). internal des: consistsofquestions(blue) terminal des (or leaves): consists of a class label (green) When constructing decision trees a number of issues must be addressed: 1. how many splits should there be at each de (binary?)? 2. what question for root/internal des? what order should the questions be asked in? 3. when should a de be declared a leaf? 4. how should a leaf be labelled? For more general decision trees there are other issues that may be addressed for example handling missing data and pruning trees. These are t considered in this lecture.

4 Engineering Part IIB: 4F10 Advanced Pattern Processing Number of Splits When the question associated with a de is applied the outcome is called a split -itsplitsthedataintosubsets.theroot de splits the complete training set, subsequent des split subsets of the data. How many splits to have at each de? Normally the number is determined by an expert. Noteevery decision and hence every decision tree can be generated just using binary decisions. colour=green? size>20cm? watermelon size>5cm? colour=green and size>20cm colour=green and size>5cm colour=green and size<=5cm apple (a) Binary Tree grape watermelon apple grape (b) Tertiary Tree Two forms of query exist mothetic: (colour=green?) polythetic: (size>20cm?) AND (colour=green?) Polythetic queries may be used by can result in a vast number of possible questions.

4F10: Statistical Pattern Processing Number of Splits colour=green? size>20cm? watermelon size>5cm? colour=green and size>20cm colour=green and size>5cm colour=green and size<=5cm apple grape watermelon apple grape 4

10. Classification and Regression Trees 5 Question Selection The basic training paradigm for designing a binary decision tree automatically from training data is 1. generate a list of questions 2. initialise a root de containing all training data 3. sequence through all questions and for each: (a) apply the question to the de training data (b) partition the data and compute the purity (c) record the question with the highest purity 4. attach the best question to the current de and create two offspring des 5. split the data assigning as follows: data in the left de data in the right de 6. recursively apply the above algorithm to each new de Note that this is a so-called greedy algorithm because it only computes a locally optimal partitioning of the data. The complete tree is t guaranteed to be an optimal classifier.

6 Engineering Part IIB: 4F10 Advanced Pattern Processing Purity Measures The aimof a splitisto yielda pure de (des only containing one class), or get as close as possible. To assess the purity of the des a de impurity function at aparticularden, I(N) is defined I(N) =φ(p (ω 1 N),...,P(ω K N)) where there are K classification classes. The function φ() should have the following properties φ() is a maximum when P (ω i )=1/K for all i φ() is at a minimum when P (ω i )=1and P (ω j )=0,j i. It is symmetric function (i.e. the order of the class probabilities doesn t matter). The obvious way to use the purity when constructing the tree is to split the de according to the question that maximises the increase in purity (or minimises the impurity). The drop in impurity for a binary split is where: I(N) =I(N) n L n L + n R I(N L ) n R n L + n R I(N R ) N L and N R are the left and right descendants; n L is the number of the samples in the left descendant; n R is the number of the samples in the right descendant;

4F10: Statistical Pattern Processing Purity Measures colour=green????????? I(N) =I(N) n L n L + n R I(N L ) n R n L + n R I(N R ) 29

10. Classification and Regression Trees 7 Purity Measures There are a number of commonly used measures: Entropy measure: I(N) = K i=1 P (ω i N)log(P (ω i N)) Variance measure: for a two class problem a simple measure is I(N) =P (ω 1 N)P (ω 2 N) Gini measure: this is a generalisation of the variance measure to K classes I(N) = i j Misclassification measure: P (ω i N)P (ω j N) =1 K I(N) =1 max k i=1 (P (ω k N)) (P (ω i N)) 2 The purity measures for a two class problem are shown below. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Variance Entropy Misclassification 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

4F10: Statistical Pattern Processing Purity Measures 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Variance Entropy Misclassification 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 18

8 Engineering Part IIB: 4F10 Advanced Pattern Processing Non-Binary Purity Increase We may also want to extend the binary split into a multi-way split. Here we can use the natural extension I(N) =I(N) B k=1 f k I(N k ) where f k is the fraction of the samples in descendant k f i = n i B k=1 n k The direct use of this expression to determine the number of splits B has a drawback: large B are inherently favoured over small B. To avoid this problem the split is commonly selected by B I(N) = This is the gain ratio impurity I(N) B k=1 f k log(f k ) Using the splitting rules described before results in a greedy algorithm. This means that any split is only locally good. There is guarantee that a global optimal split will be found, r that the smallest tree will be obtained.

4F10: Statistical Pattern Processing Gain Ratio Impurity I(N) =I(N) B k=1 f k I(N k ) f i = n i B k=1 n k B I(N) = I(N) B k=1 f k log(f k ) 28

10. Classification and Regression Trees 9 Question Set Generation Only consider binary questions with n training samples. The question set generation can be split into two parts. Nominal data: asetofquestionsassociatedwithspecific minal attribute needs to be specified, e.g. fruit colour: (colour=green?), (colour=red?) For many tasks, such as clustering the phonetic context in speech recognition, the exact set of questions is t found to be important, provided a sensible set of questions is used. Naturally ordered data: Only considering splits parallel to the axis (combinations are possible) The choice of where to split for a particular dimension is determined by: 1. sorting the n-samples according to the particular feature selected ; 2. examining the impurity measure at each of the possible n 1 splits; 3. selecting the split with the lowest impurity; Each of the possible d-dimensions should be examined.

10 Engineering Part IIB: 4F10 Advanced Pattern Processing Multivariate Numerical Questions In some situations the proper feature to use to form the decision boundary does t line up with the axis. x2 x2 (c) Univariate x1 (d) Multivariate x1 This has lead to the use of multivariate decision trees. Here decision do t have to be parallel to the axis. The question is how to select the direction of the new split. This may be viewed as producing a linear hyperplane that correctly partitions the data. This is the same problem as generating a linear decision boundary. We can therefore use for example: perceptron criterion; logistic regression; support vector machines. There is a problem with this form of rule as the training time may be slow, particularly near the root de.

4F10: Statistical Pattern Processing Multivariate Numerical Questions x2 x2 x1 x1 3

10. Classification and Regression Trees 11 Declaring a Leaf Node If the splitting is allowed to carry on we will eventually reach as stage where there is only one sample in each de. Pure des can always be obtained! However these pure des are t useful. It is unlikely that the tree built will generalise to unseen data. The tree has effectively become a lookup-table. There are a number of techniques that may be used to stop the tree growth at an appropriate stage. Threshold: thetreeisgrownuntilthereductionintheimpurity measure fails to exceed some specified minimum. We stop splitting a de N when I(N) β where β is the minimum threshold. This is a very simple scheme to operate. It allows the tree to stop splitting at different levels. MDL: acriterionrelatedtominimum description length (MDL) may also be used. Here we look for a global minimum of αsize + leaf des I(N) size represents the number of des and α is a positive constant. The problem is that a tunable parameter α has been added. Other generic techniques such as cross-validation and heldout data may also be used.

12 Engineering Part IIB: 4F10 Advanced Pattern Processing Labelling Impure Nodes If the generation of the tree has been very successful then the des of the tree will be pure. By that it means that all samples at a terminal de have the same class label. In this case assignment of the terminal des to classes is trivial. However extremely pure trees are t necessarily good. It often indicates that the tree is over tuned to the training data and will t generalise. For the case of impure des they are assigned to the class with the greatest number of samples at that de. This is simply Ba decision rule (pick the class with the greatest posterior). where ˆω =argmax i {P (ω i N)} P (ω i N) n i K k=1 n k and n i is the number of the samples assigned to de N that belong to class ω i.

10. Classification and Regression Trees 13 Classification Trees? The classification tree approach is recommended when: 1. Complex datasets believed to have n-linear decision boundaries, which may be approximated by a sum of linear boundaries. 2. Data sets with minal data. 3. There is a need to gain insight into the nature of the data structure. In multi-layer perceptrons it is hard to interpret the meaning of a particular de or layer, due to the high connectivity. 4. Need for ease of implementation. The generation of decision is relatively simple compared to, for example, multilayer perceptrons. 5. Speed of classification. Classification can be achieved in cost O(L) where L is the number of levels in the tree. The questions at each level are relatively simple. Standard forms of classification tree implementations are: CART, ID3, C4.5.

14 Engineering Part IIB: 4F10 Advanced Pattern Processing Tree-Based Reduction Given a tree-structure, it is possible to use this tree-structure asan approach tocombine multiplebinary classifiersto form a multi-class classifier without the -decision regions of simple voting. Processing Classifier 1 Classifier 2 Classifier 3 ω 1 ω 2 ω 3 ω 4 The above Directed Acyclic Graph (DAG) is one approach: classification/training works from root de down top classifier classifies {ω 1,ω 2 } vs {ω 3,ω 4 } performance will depend on ordering of classes possible to use purity measures previusly described Mistakes at the top-level cant be corrected by lower levels top-classifier task highly complicated

4F10: Statistical Pattern Processing Tree-Based Reduction Classifier 1 Processing Classifier 2 Classifier 3 ω 1 ω 2 ω 3 ω 4 20

4F10: Statistical Pattern Processing Adaptive Directed Acyclic Graph Processing Result Classifier 1 ω 1 Result ω 4 Classifier 2 Classifier 3 ω 1 ω 2 ω 3 ω 4 21

10. Classification and Regression Trees 15 Adaptive Directed Acyclic Graph Adaptive DAG classifiers are an alternative process: classification works from leaf des up classifiers decide between results from lower classifiers Simple kckout competition Processing Classifier 1 Result ω 1 Result ω 4 Classifier 2 Classifier 3 ω 1 ω 2 ω 3 ω 4 Above diagram, the actions are for a particular observation: Classifier 2 applied to split ω 1 vs ω 2 -yieldsω 1 Classifier 3 applied to split ω 3 vs ω 4 -yieldsω 4 Classifier 1 applied to split ω 1 vs ω 4 to yield final result Classifier 1 changes depending on the class outputs! DAG versus ADAG System #classifiers #operations DAG K 1 log 2 (K) ADAG K(K 1)/2 K 1