Lazy Decision Trees Ronny Kohavi

Similar documents
Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CS Machine Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Classification. Instructor: Wei Ding

Lecture 7: Decision Trees

Decision tree learning

Data Mining Classification - Part 1 -

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Comparing Univariate and Multivariate Decision Trees *

Classification and Regression Trees

CSE4334/5334 DATA MINING

Classification and Regression Trees

7. Decision or classification trees

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Network Traffic Measurements and Analysis

Lazy Decision Trees. Jerome H. Friedman Ron Kohavi Ueogirl Uun. ecision Trees and Their Limitations. Introduction. Abstract

Classification: Basic Concepts, Decision Trees, and Model Evaluation

PROBLEM 4

Artificial Intelligence. Programming Styles

Nearest neighbor classification DSE 220

Data Mining Lecture 8: Decision Trees

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Part I. Instructor: Wei Ding

Random Forest A. Fornaser

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Decision Tree Learning

Data Mining Concepts & Techniques

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

CS6716 Pattern Recognition

Classification Key Concepts

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Decision Tree CE-717 : Machine Learning Sharif University of Technology

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Machine Learning Techniques for Data Mining

Logistic Model Tree With Modified AIC

Using Machine Learning to Optimize Storage Systems

Classification: Decision Trees

The Curse of Dimensionality

Tree-based methods for classification and regression

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Classification/Regression Trees and Random Forests

COMP 465: Data Mining Classification Basics

Classification Key Concepts

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

CS 584 Data Mining. Classification 1

Classification and Regression

1) Give decision trees to represent the following Boolean functions:

Lecture 5: Decision Trees (Part II)

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Business Club. Decision Trees

Algorithms: Decision Trees

CS 229 Midterm Review

Perceptron-Based Oblique Tree (P-BOT)

Decision Tree Learning

Extra readings beyond the lecture slides are important:

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Introduction to Machine Learning

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Slides for Data Mining by I. H. Witten and E. Frank

BITS F464: MACHINE LEARNING

Classification Salvatore Orlando

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Data Mining Algorithms: Basic Methods

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Data Mining and Knowledge Discovery: Practice Notes

DECISION TREE INDUCTION USING ROUGH SET THEORY COMPARATIVE STUDY

Data Mining and Knowledge Discovery: Practice Notes

Chapter 5. Tree-based Methods

Cyber attack detection using decision tree approach

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

CSE 573: Artificial Intelligence Autumn 2010

CS229 Lecture notes. Raphael John Lamarre Townshend

CS 343: Artificial Intelligence

Lecture outline. Decision-tree classification

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

arxiv: v1 [stat.ml] 25 Jan 2018

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Classification Algorithms in Data Mining

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Supervised Learning: K-Nearest Neighbors and Decision Trees

Data Mining and Knowledge Discovery: Practice Notes

Kernels and Clustering

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Chapter 2: Classification & Prediction

Transcription:

Lazy Decision Trees Ronny Kohavi Data Mining and Visualization Group Silicon Graphics, Inc. Joint work with Jerry Friedman and Yeogirl Yun Stanford University

Motivation: Average Impurity = / interesting impurity 2 3 1800 0 161 93 52 96 2013

Eager and Lazy Learning Eager decision tree algorithms (e.g., C4.5, CART, ID3) create a single decision tree for classification. The inductive leap is attributed to the building of this decision tree. 3 Lazy learning algorithms (e.g., nearest neighbors, and this paper) do not build a concise representation of the classifier and wait for the test instance to be given. The inductive leap is attributed to the classifier; little (if any) is done during the training phase.

Problems with Eager Decision Replication and fragmentation: As a tree is built, the number of instances in every node decreases. If many features are relevant, we may not have enough data to make the number of splits needed. Unknown values: Complex methods are usually employed. C4.5 penalizes attributes using induction and does multi way splits during classification; CART finds surrogate splits. 4

Lazy DTs: Basic Observation 5 In theory, we would like to select the best decision tree for each test instance, i.e., pick the best tree from all possible trees. Observation: only the path the test instance takes really matters. We don t need to search or build all possible trees, but at possible paths.

The LazyDT Algorithm (recursive) 6 Input: training set T of labelled instances. Instance I to classify. Output:a label for instance I. stopping criterion recursive step 1. If T is pure (all instances have same label L), return label L. 2. If all instances have the same feature values, return the majority class in T. 3. Select a test X and let x be the value of the test on instance I. Assign the test of instances with X=x to T and apply the algorithm to T.

The Split Measure Isn t Obvious 7 The "obvious" measure, the difference in entropies between the parent and the child node (into which the test instance trickles), is not a good idea: The difference in entropies may be negative. In fact, if A is dominant, for B to be dominant we may need to increase the entropy first. 80/20 and 20/80 have the same entropy, but they are very different.

Our Choice of Splitting Criteria We chose to reweight the instances at every node so that all classes have equal probability. 8 The entropy for each child is computed using the weighted instances. This method ensures that: The difference in entropy is always non negative. Changes from 80/20 to 20/80 are very signfiicant.

The LazyDT Implementation As with standard decision trees, we chose to limit ourselves to univariate splits (single attr). 9 We allow splits on single values to fine tune the parititions and avoid fragmentation. To speed the classification, we: Discretize the data (global discretization). Cache the impurity measures as we compute them. Because only a few attributes get chosen at every node, the cache was very effective.

Missing Values LazyDT never considers a split on an attribute whose value is unknown. Contrast with 10 C4.5 penalizes attributes with missing values based on the ratio of missing values. An attribute, such as tested for AIDS, may be missing from most instance and never chosen by C4.5 because of that. However, if the test instance has a value, it might be extremely useful and LazyDT will use it. CART computes surrogates to use instead.

Experiments 11 Acc diff 20 LazyDT - C4.5 15 10 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 DataSet -5-10 -15-20

Interesting Observation For the Anneal dataset, ID3 outperformed both LazyDT and C4.5 (0% error versus 5.9% and 8.4%). 12 Reason: unknown handling. Our ID3 considered unknowns as a separate value. Xref: Schaffer s paper showing how NN beat C4.5 (encoding for NN was as separate value). It s all in the representation. Changing the "?" to Unknown reduced the C4.5 error from 8.4% to 1.3%

Other Interesting Differences C4.5 outperformed LazyDT on audiology. Reason: 69 features, 24 classes, 226 instances. LazyDT clearly overfits (variance problem). Note that LazyDT as implemented does no pruning (not obvious how to do it). 13 LazyDT significantly outperformed C4.5 on tic tac toe. Concept is whether X won in an end game. LazyDT can split on squares that have X s (or at least are non blank) while decision trees need to pick the squares in advance.

Related Work Lazy learning issue (special issue of AI review to appear). 14 Friedman, Flexible metric nearest neighbor. Hastie and Tibshirani, Discriminant adaptive nearest neighbor classification. Holte, Acker & Porter: small disjuncts (could LazyDT help?); Quinlan, improved estimates for small disjuncts.

Future Work LazyDT is far from perfect: 15 There is no regularization (pruning). We proceed until the leaf is pure. Data is discretized in advance. That s very eager and local interactions are lost. (without discretization caching won t work well and classification would be very slow). Compare dynamic complexity (Holte), i.e., the number of splits until a decision is made.

Summary LazyDT creates a path in a tree that would be "best" for a given test instance. 16 The small single attribute splits coupled with the choice of path reduce fragmentation and allow handling problems with many relevant attributes. Missing values are naturally handled by avoiding splits on such values. Disadvantages: no pruning, pre discretization.