A Fast Way to Produce Optimal Fixed-Depth Decision Trees

Similar documents
Boosting Weighted Linear Discriminant Analysis

Bottom-Up Fuzzy Partitioning in Fuzzy Decision Trees

Performance Evaluation of TreeQ and LVQ Classifiers for Music Information Retrieval

Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec

Multilabel Classification with Meta-level Features

Connectivity in Fuzzy Soft graph and its Complement

An Optimal Algorithm for Prufer Codes *

Fuzzy Modeling for Multi-Label Text Classification Supported by Classification Algorithms

Problem Set 3 Solutions

Progressive scan conversion based on edge-dependent interpolation using fuzzy logic

Support Vector Machines

Cluster ( Vehicle Example. Cluster analysis ( Terminology. Vehicle Clusters. Why cluster?

Bit-level Arithmetic Optimization for Carry-Save Additions

Topic 5: semantic analysis. 5.5 Types of Semantic Actions

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

CS 534: Computer Vision Model Fitting

Link Graph Analysis for Adult Images Classification

Classifier Selection Based on Data Complexity Measures *

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Pattern Classification: An Improvement Using Combination of VQ and PCA Based Techniques

ABHELSINKI UNIVERSITY OF TECHNOLOGY Networking Laboratory

Adaptive Class Preserving Representation for Image Classification

TAR based shape features in unconstrained handwritten digit recognition

Color Texture Classification using Modified Local Binary Patterns based on Intensity and Color Information

Steganalysis of DCT-Embedding Based Adaptive Steganography and YASS

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Machine Learning. Topic 6: Clustering

Microprocessors and Microsystems

Research on Neural Network Model Based on Subtraction Clustering and Its Applications

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Feature Reduction and Selection

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Optimizing Document Scoring for Query Retrieval

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Announcements. Supervised Learning

X- Chart Using ANOM Approach

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

LOCAL BINARY PATTERNS AND ITS VARIANTS FOR FACE RECOGNITION

Brave New World Pseudocode Reference

Session 4.2. Switching planning. Switching/Routing planning

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Interval uncertain optimization of structures using Chebyshev meta-models

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Load Balancing for Hex-Cell Interconnection Network

Optimal shape and location of piezoelectric materials for topology optimization of flextensional actuators

Wishing you all a Total Quality New Year!

Machine Learning 9. week

A MPAA-Based Iterative Clustering Algorithm Augmented by Nearest Neighbors Search for Time-Series Data Streams

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

CMPS 10 Introduction to Computer Science Lecture Notes

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Performance Analysis of Hybrid (supervised and unsupervised) method for multiclass data set

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Measurement and Calibration of High Accuracy Spherical Joints

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

The Codesign Challenge

Performance Evaluation of Information Retrieval Systems

Mixture Models and the Segmentation of Multimodal Textures. Roberto Manduchi. California Institute of Technology.

Analysis of Continuous Beams in General

Mathematics 256 a course in differential equations for engineering students

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

An Adaptive Filter Based on Wavelet Packet Decomposition in Motor Imagery Classification

Greedy Technique - Definition

Lecture 5: Multilayer Perceptrons

Intro. Iterators. 1. Access

USING GRAPHING SKILLS

CE 221 Data Structures and Algorithms

Solving two-person zero-sum game by Matlab

Programming in Fortran 90 : 2017/2018

Lecture 5: Probability Distributions. Random Variables

Priority queues and heaps Professors Clark F. Olson and Carol Zander

Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

An Entropy-Based Approach to Integrated Information Needs Assessment

All-Pairs Shortest Paths. Approximate All-Pairs shortest paths Approximate distance oracles Spanners and Emulators. Uri Zwick Tel Aviv University

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)


CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

A Semi-parametric Approach for Analyzing Longitudinal Measurements with Non-ignorable Missingness Using Regression Spline

Semi-analytic Evaluation of Quality of Service Parameters in Multihop Networks

CS1100 Introduction to Programming

A Topology-aware Random Walk

5 The Primal-Dual Method

Active Contours/Snakes

S1 Note. Basis functions.

Classifying Acoustic Transient Signals Using Artificial Intelligence

Minimize Congestion for Random-Walks in Networks via Local Adaptive Congestion Control

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Additive Groves of Regression Trees

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Avatar Face Recognition using Wavelet Transform and Hierarchical Multi-scale LBP

Unsupervised Learning

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Transcription:

A Fast Way to Produe Optmal Fxed-Depth Deson Trees Alreza Farhangfar, Russell Grener and Martn Znkevh Dept of Computng Sene Unversty of Alberta Edmonton, Alberta T6G 2E8 Canada {farhang, grener, maz}@s.ualberta.a Abstrat Deson trees play an essental role n many lassfaton tasks. In some rumstanes, we only want to onsder fxeddepth trees. Unfortunately, fndng the optmal depth-d deson tree an requre tme exponental n d. Ths paper presents a fast way to produe a fxed-depth deson tree that s optmal under the Naïve Bayes (NB) assumpton. Here, we prove that the optmal depth-d feature essentally depends only on the posteror probablty of the lass label gven the tests prevously performed, but not on ether the dentty nor the outomes of these tests. We an therefore preompute, n a fast pre-proessng step, whh features to use at the fnal layer. Ths results n a speedup of O(n/ log n), where n s the number of features. We apply ths tehnque to learnng fxed-depth deson trees from standard datasets from the UCI repostory, and fnd ths model mproves the omputatonal ost sgnfantly. Surprsngly, ths approah stll yelds relatvely hgh lassfaton auray, despte the NB assumpton. 1 Introduton Many mahne learnng tasks nvolve produng a lassfaton label for an nstane. There s sometmes a fxed ost that the lassfer an spend per nstane, before returnng a value; onsder for example, per-patent aptaton for medal dagnoss. If the tests an be performed sequentally, then the lassfer may want to follow a poly (Sutton & Barto 1998) e.g., frst perform a blood test, then f ts outome s postve, perform a lver test; otherwse perform an eye exam. Ths proess ontnues untl the funds are exhausted, at whh pont the lassfer stops runnng tests and returns an outome; ether healthy or sk. Suh poles orrespond naturally to deson trees. There are many algorthms for learnng a deson tree from a data sample; most systems (Qunlan 1993; Breman et al. 1984) use some heursts that greedly seek the deson tree that best fts the sample, before runnng a post-proessor to redue overfttng. Ths paper fouses on the hallenge motvated above: of fndng the best fxed-ost poly, whh orresponds to fndng the fxed depth deson tree that best mathes the data sample. There are several algorthms for ths task, whh typally use dynam programmng to fnd Now at Yahoo Researh, Santa Clara Copyrght 2007, authors lsted above. All rghts reserved. the optmal depth-d deson tree,.e., the tree that mnmzes the 0/1-loss over the tranng data. These algorthms are nvarably exponental n the depth d, as they spend almost all of ther tme determnng whh features to test at fnal level, d (Auer, Holte, & Maass 1995). If one s wllng to aept the Naïve Bayes assumpton (Duda & Hart 1973) that features are ndependent of eah other gven the lass then there s an effent way to ompute the fnal layer of depth-d features. In partular, under ths assumpton, we prove that the optmal depth-d feature essentally depends only on the posteror probablty of the lass label gven the tests prevously performed, but not the outomes of the ndvdual tests. We an therefore use a fast pre-proessng step to reate a so-alled opt-feature lst, OFL, that dentfes whh feature to use as a funton of the posteror dstrbuton, then use ths lst to qukly determne the last level of the tree. Ths tehnque results n a speedup of O(n/ log n), where n s the number of features, and effetvely means we an ompute the optmal depth-d tree n the tme typally requred to ompute the optmal depth-(d 1) tree. Seton 2 surveys the relevant lterature. Seton 3 summarzng our OPTNBDT algorthm and proves the relevant theores. Seton 4 presents empral results that valdates our approah, by applyng t to the standard datasets from the UCI repostory (Newman et al. 1998) and elsewhere. The webpage (Grener 2007) provdes other nformaton about ths system, and about our experments. We fnd that our approah sgnfantly mproves the omputatonal ost of fndng a fxed depth tree, surprsng at lttle or no loss n auray. 2 Lterature Revew As noted above, there are many algorthms for learnng deson trees. Many algorthms, nludng C4.5 (Qunlan 1993) and CART (Breman et al. 1984), begn wth a greedy method that nrementally dentfes an approprate feature to test at eah pont n the tree, based on some heurst sore. Then these algorthms perform a post-proessng step to redue overfttng. In our ontext of shallow deson trees, overfttng s not as bg a onern, whh explans why many of the algorthms that seek fxed depth deson trees smply return the tree that best fts the data (Holte 1993; Dobkn, Gunopoulos, & Kasf 1996; Auer, Holte, & Maass 1995). These algorthms an easly be extended to al-

Depth A 1 + B C 2 + + D. 3.. + Q? 4 + (5) Fgure 1: Deson Tree T 1, as t s beng bult. low dfferent tests to have dfferent osts (Turney 2000; Grener, Grove, & Roth 2002). These algorthms n general requre O(n d ) tme to fnd the best depth-d deson tree over n varables. Whle ths s a polynomal-tme omplexty for fxed d, n prate t s not effetve exept for small n and tny d. The results n ths paper show that one an aheve a more effent proess by mposng a onstrant here the Naïve Bayes assumpton (Duda & Hart 1973) to speedup the proess. There s, of ourse, a large body of work on buldng lassfers based on Naïve Bayes systems (Domngos & Pazzan 1997; Lews 1998), and on analyzng and haraterzng ther lassfaton performane (Chan & Darwhe 2003). Those results, however, dffer sgnfantly from our task, whh expltly nvolves learnng a deson tree; we use only the Naïve Bayes assumpton n modelng the underlyng dstrbuton over nstanes. (But see the empral omparsons n Seton 4.) Turney (2000) dsusses the general hallenge of learnng the best lassfer subjet to some explt ost onstrant. Here, our depth requrement orresponds to a onstrant on the ost that the lassfer must pay to see the features at performane tme. 3 OPTNBDT Algorthm 3.1 Foundatons Assume there are n features F = {F (1),..., F (n) } where eah feature F (j) ranges over the r F (j) r values V F (j) = {f (j) 1,..., f r (j) F }, and there are two lasses C = {+, }.1 (j) A deson tree T s a dreted tree struture, where eah nternal node ν s labeled wth a feature F(ν) F, eah leaf l L(T ) s labeled wth one of the lasses (l) C, and eah ar deendng from a node labeled F s labeled wth a value f V F. We an evaluate suh a tree T on a spef nstane f = {F (j) = f (j) } j, to produe a value VAL(T, f) C as follows: If the tree s a leaf T = l, return ts label (l) C; that s, VAL(T, f) = (l). Otherwse, the root of the tree s labeled wth a feature F F. We then fnd the assoated value wthn the f (say F = f), and follow the f-labeled edge from the root to a new subtree; then reur. The value of VAL(T, f) wll be the value of that subtree. Hene, gven the nstane f = {A = +, B =,... } 1 We use CAPITAL letters for varables and lowerase for values, and bold for sets of varables. and the tree T 1 n Fgure 1, the value of VAL(T 1, f) wll be the value of the subtree rooted n the B-labeled node on ths nstane (as we followed the A = + ar from the root), whh n turn wll be the value of the subsubtree rooted n the D-labeled node (orrespondng to the subsequent B = ar), et. We say a deson tree T s orret for a labeled nstane f, f VAL(T, f) =. A labeled dataset s a set of labeled nstanes S = { f (j), (j) } j. Our goal s to fnd the depth-d deson tree wth the mum expeted auray, gven our posteror belefs, whh are onstruted gven a naïve bayes pror and the dataset S. To defne expeted auray, we must frst defne the noton of a path π(ν) to any node ν n the tree, whh s the sequene of feature-value pars leadng from the root to that node; hene, the path to the? node n Fgure 1 s π(? ) = A, +, B,, D,. We an use ths noton to defne the probablty 2 of reahng a node, whh here s P ( reahng? ) = P ( π(? ) ) = P ( A = +, B =, D = ). In general, the auray of any tree T s A( T ) = P ( π(l) ) A( π(l) ) l L(T ) over the set of leaf nodes L(T ). Note that ths auray has been fatored nto a sum of the auraes assoated wth eah leaf: t s the probablty P ( π(l) ) of reahng eah leaf node l L(T ) tmes the (ondtonal) auray assoated wth that node A π(l) ( l ) = P ( C = (l) π(l) ). Ths fatorng tells us mmedately that we an dede on the approprate lass label for eah leaf node, (l π(l) ) = arg P ( π(l) ), wth an assoated auray of A ( π(l) ) = P ( π(l) ) (1) based only on the sngle path;.e., t s ndependent of the rest of the tree. We are searhng for the best depth-d tree, arg T DT(d) A( T, S), where DT(d) s the set of deson trees of depth at most d.e., eah path from root to any leaf nvolves at most d varables. An earler system (Kapoor & Grener 2005) preomputed the auray assoated wth eah possble sequene of d tests (requrng O( ( n d) r d ) tme), and then onstruted the best tree gven these values, whh requred O((nr) d ) tme. Here we present a dfferent tehnque that requres less tme by preomputng a data struture that qukly provdes the optmal feature to use for eah poston n the bottom row. To understand our approah, onsder determnng the feature F (ν) to use at the fnal nternal node along a path π(ν) e.g., determne whh feature to test at? n Fgure 1. As an obvous extenson of the above argument, ths deson wll depend only on the path π(ν) 2 All probablty values, as well as Auray sores, are based on a posteror generated from a Naïve Bayes Pror and the labeled data S. Note also that we wll use varous forms of A ( ), for trees, paths and features appended to paths; here, the meanng should be lear from ontext.

OPTNBDT( d: nt; P ( ): dstrbuton over {C} F) Compute opt-feature lst OFL = {F (x) x [0, 1], = 1..d} Buld optmal depth-d 1 tree usng DynamProgrammng At eah length-d 1 path π, wth assoated probablty x π,+ = P ( C = + π ) Let F = F1 (x +,π),..., Fd 1(x +,π). Let = mn {F (x +,π) π} be the frst n F not n π Use feature F (x) at level d, after π Fgure 3: OPTNBDT algorthm. Fgure 2: Computng A π ( F ) (Equaton 3). to ths node. Then for the? node, t wll depend on π(? ) = A, +, B,, D,. Gven our NB assumpton, for any feature F F (exept the features n π.e., exept for A, B, and D), the omponent of auray assoated wth the path that begns wth π then performs F (and then desends to the leaf nodes mmedately under ths F ), s A( π F ) P ( π, F = f ) A ( π F =f ) P ( π, F =f ) P ( C = π, F =f ) P ( C =, π, F = f ) = P ( π ) P ( ) P ( π )P ( f ) (2) P ( C = π )P ( F =f C = ) where Equaton 2 s based on the Naïve Bayes assumpton. Note that the feature F that optmzes A( π F ) wll also optmze 1 A π ( F ) = P ( π ) A( π F ) P ( C = π )P ( F =f C = ). Whle our analyss wll work for lasses that range over any (fnte) number of values, ths paper wll fous on the bnary ase. Lettng x π,+ = P ( C = + π ) and abbrevatng F =f as f, we have { A π ( F ) = x π,+ P ( f + ) (1 x π,+ ) P ( f ) { P ( f ) xπ,+ P ( f + ) f x π,+ P ( f + )+P ( f ) (1 x π,+ )P ( f ) otherwse. (3) } For fxed values of P ( F = f C = ), ths A π ( F ) value does not depend on the features n π nor ther values, but only on x π,+ ; we an therefore express A π ( F ) as A( F, x π,+ ) to make ths dependeny explt. For any value of x π,+ [0, 1], eah summand n Equaton 3, orrespondng to a sngle F = f, s the mum of two lnes. Over the set of r F values, ths funton s therefore a sequene of at most 1 + r F lnear segments; see Fgure 2. Before begnnng to buld the atual deson tree, our OPTNBDT algorthm wll frst ompute these A( F (j), x) funtons for eah F (j). It then uses these funtons to ompute, for eah x [0, 1], the optmal feature: F1 (x) = arg{a( F, x)}. (4) F It also omputes the 2nd-best feature F2 (x) for eah x value, as well as F (x) for = 3, 4,..., d. We refer to ths {F (x) x [0, 1], = 1..d} dataset as the OFL ( optfeature lst ). The OPTNBDT algorthm then uses a dynam program to buld the tree, but only to depth d 1. Whenever t reahes a depth d 1 node, at the end of the path π = F π(1), f π(1), F π(2), f π(2),..., F π(d 1), f π(d 1), wth assoated ondtonal probablty x π,+, t then ndexes ths x π,+ nto the OFL, whh returns an ordered lst of d features, F(x +,π ) = F1 (x +,π),..., Fd (x +,π). OPTNBDT then returns the frst F that s not n the F(x +,π ) lst. Ths algorthm s shown n Fgure 3. To make ths more onrete, magne that n Fgure 1 the lst assoated wth π(? ) = A, +, B,, D, and x π,+ = 0.29 was F(0.29) = B, A, E, D. Whle we would lke to use the frst value F1 (0.29) = B as t appears to be the most aurate, we annot use t as ths feature has already been appeared n ths π path and therefore Equaton 3 does not apply. OPTNBDT would then onsder the seond feature, whh here s F2 (0.29) = A. Unfortunately, as that appears n π as well, OPTNBDT have to onsder the thrd feature, F3 (0.29) = E. Sne that does not appear, OPTNBDT labels the? node wth ths E feature. 3.2 Implementaton Ths seton provdes the detals of our mplementaton, whh requres reursvely omputng x = x π,+, and omputng and usng the opt-feature lst, OFL.

Computng x π,+ : It s easy to obtan x π,+ as we grow the path π n the deson tree. Here, we mantan two quanttes, y + π = P ( π, C = + ) and y π = P ( π, C = ), then for any path π, then set x π,+ = y+ π y + π +y π. For π 0 = {}, y π 0 = P ( C = ) for {+, }. Now onsder addng one more feature-value par F, f to π t, to form π t+1 = π t F, f. Then thanks to our NB assumpton, y π t+1 = y π t P ( F = f C = ). Computng and usng the OFL: As noted above, OFL orresponds to a set of peewse lnear funtons (see Fgure 2), eah of whh s the unon of a fnte number of lnear funtons. Formally, a peewse lnear funton f : [0, 1] R (wth k pees) an be desrbed by a sequene of real number trples (a 1, m 1, b 1 ),..., (a k, m k, b k ) where the a s are endponts suh that 0 = a 0 < a 1 <... < a k = 1, and for all {1,..., k} we have f(x) = m x + b for all x [a 1, a ]. A lnear funton s a peewse lnear funton wth one pee. The sum of two peewse lnear funtons wth k 1 and k 2 pees s a peewse lnear funton wth no more than k 1 +k 2 1 pees. We an ompute ths sum n O(k 1 + k 2 ) tme, as eah omponent of the sum f = f 1 + f 2 s just the sum of the relevant m and b from both f 1 and f 2. Smlarly, the mum of two peewse lnear funtons wth k 1 and k 2 pees s a peewse lnear funton wth no more than (k 1 + k 2 ) pees. Ths omputaton s slghtly more nvolved, but an be done n a smlar way. For eah feature F and eah value x π [0, 1], we need to ompute the sum over all V F r values of F = f: A π ( F ) = A π ( F = f ) A π ( F = f ) = x π,+ P ( f (j) + ) f x π,+ (1 x π,+ )P ( f (j) ) otherwse. P ( f (j) P ( f (j) ) + )+P ( f (j) ) We ompute ths total by addng the assoated V F r peewse lnear funtons, {A π ( F = f )} f VF. We an do ths reursvely: Lettng r = V F and l = A π ( F = f ) be the th funton, we frst defne l 12 = l 1 + l 2 as the sum of l 1 and l 2, l 34 = l 3 + l 4, and so forth untl l r 1,r = l r 1 + l r ; we next let l 14 = l 12 + l 34, l 58 = l 56 + l 78, et.; ontnung untl omputng l 1,r whh s the sum over all r funtons. By reurson, one an prove that ths resultng peewse lnear funton has no more than r + 1 pees, and that the tme omplexty s O(r log r), due to log r levels of reurson, eah of whh nvolves a total of O(2 r/2 ) = O(r) tme. Note that Equaton 4 s the mum of all of the peewse lnear funtons {A π ( F )} F F that were onstruted n the prevous step. We agan use a dvde and onquer tehnque to redue the problem of mzng over n peewse lnear funtons to sequene of log n problems, eah of mzng over two peewse lnear funtons. An analyss smlar to mergesort shows that the overall omputatonal ost of the proess s O(nr log nr ). Table 1: Datasets used n the experments Dataset Num. Num. Abbrefeatures nstanes vaton T-ta-toe 10 985 T Flare 10 1066 Flr Hepatte 13 80 Hep Crx 13 653 Crx Lymphography 18 145 Lym Vote 16 435 Vot Chess 36 3196 Chs Connet-4 42 5000 Con Prostate-aner 47 81 Prs Promoters 58 106 Prm Breast-aner 98 332 Snp When F1 ( ) (Equaton 4) nvolves k = O(nr) lnear pees at arbtrary ponts, we an ompute F1 (x) n O(log k) = O(ln nr) tme, usng a bnary searh to fnd the approprate segment and a onstant amount of tme to ompute the value gven ths segment. As we are omputng ths mum value, we an also spefy whh feature represents ths segment. We an ompute F (x) for = 2, 3,..., d n a smlar way. Consequently, when we ompute the mum of two funtons, we also store the d hghest funtons at every pont. Note the amount of memory storage requred here nreases lnearly n d. Hene, eah lookup funton an be onstruted n O(nr log(nr)) tme, requres O(nr) memory to store, and most mportantly, an be quered (to fnd the optmal value for a spef x) n O(log(nr)) tme. Note that a naïve mplementaton that merely tests all the features n the last level of the tree to onstrut the leaves (alled NBDT below) wll requre a tme lnear n nr, whereas our tehnque has omplexty logarthm n nr. Colletvely, these results show Theorem: Gven any labeled dataset S over n r-ary varables and bnary lass labels, the OPTNBDT algorthm (Fgure 3) wll ompute the depth-d deson tree that has the hghest auray under the Naïve Bayes assumpton. Moreover, t requres O(n S + (nr) d 1 d log(nr) ) tme 3 and O(d(nr) d ) spae. 4 Empral Results We performed an extensve set of experments to ompare the performane of OPTNBDT to varous learners, both state-of-the-art deson tree learners, and Naïve Bayes systems. Ths seton frst desrbes the datasets and expermental setup, then the expermental results and fnally our analyses. Expermental setup: The experments are performed usng nne datasets from UCI Mahne Learnng Repostory (Newman et al. 1998) as well as our own Breast Caner and 3 The addtonal n S term s the tme requred to ompute the bas Naïve Bayes statsts over the dataset S.

Fgure 4: Average lassfaton error (and standard devaton) for 1-NBDT, 2-OPTNBDT, 3-OPTNBDT, 4- OPTNBDT, and 3-ID3 algorthms Fgure 5: Comparng 3-OPTNBDT vs 3-NBDT: g(n) and f(n) = tme that 3-NBDT and 3-OPTNBDT requre for n features, respetvely. Prostate Caner sngle nuleotde polymorphsm datasets obtaned from our olleagues at the Cross Caner Insttute. All the seleted datasets have bnary lass labels. Table 1 haraterzes these datasets. The experments are performed usng 5-fold ross valdaton, performed 20 tmes; we report the average error rate and standard devaton aross these trals. We ompare the lassfaton performane and runnng tme of OPTNBDT wth dfferent depths here 2,3, and 4 to the followng fxed-depth deson trees: 3-ID3 (Qunlan 1986) s a deson tree learnng algorthm that uses an entropy based measure as ts splttng rteron, but stops at depth 3. 1-NBDT (Holte 1993) s a deson stump that splts the data wth only one feature.e., t s a depth-one deson tree. Expermental results: Fgure 4 shows the average lassfaton error rate and standard devaton of eah algorthm. We see that no algorthm s onsstently superor over all the datasets. Whle deeper deson tree sometmes mprove lassfaton performane (lassfaton auray nreases wth the depth of the tree for the Hep, Chs, Con, and Prm datasets), deeper trees an ause overfttng, whh may explan why shallower trees an be better: Deson stumps, whh only splts the data based on one feature, outperforms the other trees for the flr, rx, and vot datasets. Fgure 4 shows that the lassfaton error of 3- OPTNBDT s often lower than the errors from 3-ID3, whh shows that the optmal Naïve Bayes-based deson trees an perform better than heurst, entropy-based trees. In fat, a pared student s t-test over these eleven datasets show that both 3-OPTNBDT and 2-OPTNBDT are statstally better (more aurate) than 3-ID3 (at p < 0.05). Runnng Tme: Now onsder the runnng tme of these algorthms, as well as 3-NBDT, whh mplements the dynam programmng algorthm that produes the fxed depth deson tree that s optmal over the tranng data, gven the Naïve Bayes assumpton.e., ths d-nbdt algorthm expltly onstruts and tests all possble leaves up to depth d, whereas d-optnbdt uses a pre-omputed opt-feature lst (OFL) to effently onstrut the fnal (depth d) level of the deson tree. Of ourse, these two algorthms gve exatly the same answers. The prevous seton proved that OPTNBDT s O(log(n)/n) tmes faster than NBDT, where n s the number of features. To explore ths lam emprally, we extrated eghteen artfal datasets from the Promoters Prm) dataset, usng 3 features, for = 1 : 18. Fgure 5 plots log(f(n) n/ log n) versus log(g(n)), where f(n) (resp., g(n)) s the run-tme that 3-NBDT (resp., 3-OPTNBDT) requres for n features. Ths onfrms that OPTNBDT s sgnfantly more effent than NBDT, espeally when there are many features n the dataset. Fgure 6 shows the runnng tme of the varous algorthms: Eah lne orresponds to one of the algorthms, formed by jonng a set of (x, y) ponts, whose x value s the total number of feature values r = O(nr) of a partular dataset, and y s the (log of the) run tme of the algorthm on that dataset. As expeted, the 3-ID3 ponts are farly ndependent of ths number of features. The other lnes, however, appear farly lnear n ths log plot, at least for larger values of r ; ths s onsstent wth the Theorem. Fnally, to understand why the 4-OPTNBDT tmng numbers are vrtually dental to the 3-NBDT numbers, reall that 4-OPTNBDT basally runs 3-NBDT to produe a depth-3 deson, then uses the OFL to ompute the 4th level. These numbers show that the tme requred to ompute OFL, and to use t to produe the fnal depth, s extremely small. (The atual tmng numbers appear n (Grener 2007).) 5 Conluson Extensons: (1) Whle ths paper only dsusses how to use the opt-feature lst to fll n the features at the last level of the tree, our tehnques are not spef to ths fnal row. These deas an be extended to pre-ompute an optmal tree lst

NB assumpton) s often more aurate than entropy based deson tree learner lke ID3. 6 Aknowledgements All authors gratefully aknowledge the support from the Alberta Ingenuty Centre for Mahne Learnng (AICML), NSERC, and CORE. Fgure 6: Run-tme (log of seonds) of 4-OPTNBDT, 3- OPTNBDT, 2-OPTNBDT, 1-NBDT, 3-ID3, and 3-NBDT algorthms wth the fnal depth-q subtree at the end of a path gven the label posteror, rather than just the fnal nternal node;.e., so far, we have dealt only wth q = 1. Whle ths wll sgnfantly nrease the ost of the preomputatonal stage, t should provde a sgnfant omputatonal gan when growng the atual tree. (2) A seond extenson s to develop further trks that allow us to effently buld and ompare a set of related deson trees perhaps trees that are based on slghtly dfferent dstrbutons, obtaned from slghtly dfferent tranng samples. Ths mght be relevant n the ontext of some ensemble methods (Detterh 2000), suh as boostng or baggng, or n the ontext of budgeted learnng of bounded lassfers (Kapoor & Grener 2005). (3) Our approah uses a smple ost model, where every feature has unt ost. An obvous extenson would allow dfferent features to have dfferent osts, where the goal s to produe a deson tree whose ost depth s bounded. 4 Contrbutons: There are many stuatons where we need to produe a fxed-depth deson tree.e., the bounded poly assoated wth per-patent aptaton. Ths paper has presented OPTNBDT, an algorthm that effently produes the optmal suh fxed-depth deson tree, gven the Naïve Bayes assumpton.e., assumng that the features are ndependent gven the lass. We proved that ths assumpton mples that the optmal feature at the last level of the tree essentally depends only on x π,+ [0, 1], the posteror probablty of the lass label gven the tests prevously performed. We then desrbe a way to effently pre-ompute whh feature s best, as a funton of ths probablty value, and then, when buldng the tree tself, to use ths nformaton to qukly assgn the fnal feature on eah path. We prove theoretally that ths results n a speedup of O(n/ log n) ompared to the naïve method of testng all the features n the last level, then provde empral evdene, over a benhmark of eleven datasets, supportng ths lam. We also found that OPTNBDT s not just effent, but (surprsngly, gven ts Referenes Auer, P.; Holte, R. C.; and Maass, W. 1995. Theory and applatons of agnost PAC-learnng wth small deson trees. In ICML. Breman, L.; Fredman, J.; Olshen, J.; and Stone, C. 1984. Classfaton and Regresson Trees. Monterey, CA: Wadsworth and Brooks. Chan, H., and Darwhe, A. 2003. Reasonng about bayesan network lassfers. In UAI, 107 116. Detterh, T. G. 2000. Ensemble methods n mahne learnng. In Frst Internatonal Workshop on Multple Classfer Systems. Dobkn, D.; Gunopoulos, D.; and Kasf, S. 1996. Computng optmal shallow deson trees. In Internatonal Symposum on Artfal Intellgene and Mathemats. Domngos, P., and Pazzan, M. 1997. On the optmalty of the smple bayesan lassfer under zero-one loss. Mahne Learnng 29:103 130. Duda, R. O., and Hart, P. E. 1973. Pattern Classfaton and Sene Analyss. Wley. Grener, R.; Grove, A.; and Roth, D. 2002. Learnng ostsenstve atve lassfers. Artfal Intellgene 139:137 174. Grener. 2007. http://www.s.ualberta.a/ grener/research/dt-nb. Holte, R. C. 1993. Very smple lassfaton rules perform well on most ommonly used datasets. Mahne Learnng 11:63 91. Kapoor, A., and Grener, R. 2005. Learnng and lassfyng under hard budgets. In Proeedngs of the Sxteenth European Conferene on Mahne Learnng (ECML-05), 170 181. Sprnger. Lews, D. D. 1998. Naïve Bayes at forty: The ndependene assumpton n nformaton retreval. In ECML. Newman, D. J.; Hetth, S.; Blake, C. L.; and Merz, C. J. 1998. UCI Repostory of Mahne Learnng Databases. Unversty of Calforna, Irvne, Dept. of Informaton and Computer Senes: http://www.s.u.edu/ mlearn/ MLRepostory.html. Qunlan, J. R. 1986. Induton of deson trees. Mahne Learnng 1:81 106. Qunlan, J. R. 1993. C4.5: Programs for Mahne Learnng. Morgan Kaufmann. Sutton, R. S., and Barto, A. G. 1998. Renforement Learnng. MIT Press. Turney, P. D. 2000. Types of ost n ndutve onept learnng. In Workshop on Cost-Senstve Learnng (ICML-2000). 4 The ost depth of a leaf of a deson tree s the sum of the osts of the tests of the nodes onnetng the root to ths leaf; and the ost depth of a tree s the mal ost depth over the leaves. Note ost depth equals depth f all features have unt ost.