Rutgers University PhD in Management Program. Expert Systems Prof. Glenn Shafer Fall 2001

Size: px

Start display at page:

Download "Rutgers University PhD in Management Program. Expert Systems Prof. Glenn Shafer Fall 2001"

Cynthia Bailey
5 years ago
Views:

1 Rutgers University PhD in Management Program Expert Systems Prof. Glenn Shafer Fall 2001 Report on Brodley and Utgoff (1992) Paper on Multivariate Versus Univariate Decision Trees Fatima Alali December 13, 2001

2 Introduction Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. A decision tree is either a leaf node containing the name of a class or a decision node containing a test. For each possible outcome of the test there is a branch to a decision tree. To classify a new instance, one starts at the root of the tree and follow the branch indicated by the outcome of each test until a leaf node is reached. The classification will be then the name of class at the leaf. Univariate decision trees test one feature at a node and thus result in larger trees than if multiple features are tested at a node. In a multivariate decision trees, each test can be based on one or more input features. The paper describes and evaluates different multivariate tree construction methods. The paper also presents experiments that support the theoretical analysis. Section 1provides an overview of the paper. Section 2 provides an overview of the C4.5 and LMDT software used in the empirical applications described in the paper. Section 3, provides an overview of results obtained in the paper. Section 4 assesses both C4.5 and LMDT in a different data set. Section 5 provides conclusion and future implications. 1. Paper Overview The paper provides empirical evaluation of the LMDT algorithms and demonstrates the need for multivariate tests and the need for LMDT to uncover linear structure in the data. 1.1 The theory underlying tree models (classification models): The basic strategy for classification tree is to recursively split the cells of the space of input variables. A given cell is splited through first searching over all variables and all possible thresholds to find whatever leads to the best improvement in a specified score function. The score is assessed based on training data set elements and then is cross validated using the test data set. One disadvantage of classification trees is that they are hierarchical (sequential) in nature, so that the tests in a decision tree are performed sequentially by following the branches of the tree. Thus, only those features that are needed to reach a decision will be evaluated. Decision trees use the sequential decision procedure for determining the classification of a new data point. Decision trees have many attractive attributes: They are easy to construct and to understand. They can handle mixed variables, discrete, continuous, numeric and symbolic as described below. They can easily partition the space using binary tests (thresholds on real variables) They predict the class value for a new case quickly. However, decision as said above are sequential in nature that may result in over grown tree, which is in turn difficult to understand and use since they may result in sub optimal partitions of the space of the input variables. 1.2 Tree construction issues: 2

3 There are two stages for any univariate and multivariate tree construction. First, building the tree using a set of training data, each described by features and labeled with a class name. A top-down approach algorithm chooses the best test to split the training set using some score function 1. The chosen test is then used to partition the training instance and a branch for each outcome of the test is created. This algorithm is applied recursively to each resulting split. If the data in a partition is from a single class then a leaf node is created and assigned the class label of the single class. Different forms of partition merit criteria can be used to judge goodness of split. Those forms commonly appear in terms of impurity and entropy measure. 2 One concern in tree structuring is to avoid overfitting the decision tree to the training data in domains that contain noisy instances 3. Overfitting occurs when the training data contain noisy instances and the decision tree algorithm induces a classifier that forces all instances in the training set correctly. Thus, when classifying a new instance, the tree may perform poorly. To avoid overfitting, the second stage of building a tree is to prune it back to eliminate branches that are not statistically valid. At the pruning stage, a leaf node replaces a whole sub tree. The replacement takes place if a decision rule establishes that the expected error rate in the sub tree is greater than in the single leaf Numeric vs. symbolic features When instances contain features that are symbolic (unordered) and numeric (ordered), it is important to map each unordered feature to a numeric feature without imposing any order on the unordered values of the feature. For some instances, not all feature values may be available. Missing values can be filled in by the estimate of the sample mean after normalization Characteristics of Linear Machines and Linear Machine Decision Trees Breiman et al. (1984) Conducted extensive research on univariate tress. Univariate trees are based on one input variable and thus restricted to represent a split through the instance space that is orthogonal to the variables axis. This process results in bias classification especially when input variables are related numerically. The LMDT tries to overcome this problem. The LMDT constructs each test in a decision tree by training a linear machine and then eliminating irrelevant and noisy variables in a controlled manner. LMDT achieves this through two properties. First, the method by which LMDT finds and eliminates noisy and irrelevant variables is computationally efficient. Second, the linear 1 The score functions for predictive models is a function of the difference between the predictions obtained from the model for each individual input and the targets, which are the response variables. Score function in predictive models can be presented by sum of squared errors between model predictions and actual target measurements. Predictive modeling for Classification. Chapter 7, Score Functions for Data Mining Algorithms. Page Brodley and Utgoff (1990), Multivariate decision trees. 3 Noisy instance is one for which either the class label is incorrect, some number of the attribute values are incorrect or a combination of the two. Brodley and Utgoff (1990). 4 Multivariate decision trees. Brodley and Utgoff (1990). 5 Normalization: at each node in the tree, each encoded symbolic and numeric feature is normalized by mapping it to standard normal form ~ (0, 1). 3

4 machine learning approach enables LMDT to find a good partition of the instance space, no matter the instance space is linearly separable or not. LMDT can handle instances described by numeric/symbolic variables in which some of the values may be missing. LMDT encodes symbolic variables to numeric variables and ensures no order is placed on these variables. The encoded variables are then normalized at each node to gage the relative importance of the encoded variables by the magnitudes of the corresponding weights. LMDT sets missing variables to zero, which correspond to the sample mean of, the corresponding encoded normalized variable. Encoding information is computed at each node and is retained in the tree for classification purposes later. The authors describe how the elimination of noisy and irrelevant variables is done. When LMDT detects that a linear machine is near its final set of boundaries, it eliminates variables that contribute least to discriminating the set of instances at that node and then continues training the linear machine. During the process of eliminating variables, the most accurate linear machine ends; the test for the decision node is the saved linear machine. Linear machines based on fewer variables are preferred in two cases. When the accuracy of a linear machine is based on fewer variables, accuracy is higher than the best accuracy observed thus far or if the drop in accuracy is not significantly different than the best accuracy as measured by t-test at the.01 level of significance. In this case, the linear machines based on fewer variables are saved and if the accuracy is higher, the system updates its values for the best accuracy observed thus far. Second, the algorithm avoids underfitting the data and will eliminate variables until the number of instances is greater than the capacity of a hyperplane. In this case, if the number of unique instances is not twice the dimensionality of each instance, then the linear machine with fewer variables is preferred. The authors measure contribution of a variable to the ability to discriminate by using a measure of the dispersion of its weights over the set of classes. Variables whose weights are widely dispersed achieve two objectives. First, a large magnitude weight causes the corresponding variable to make a large contribution to the value of the discriminant function (discriminability). Second, a variable whose weights are widely spaced makes different contributions to the value of the discrimination function of each class. LMDT computes dispersion for each variable as the average squared distance between weights of each pair of classes and then eliminates the variable with smallest dispersion. As for variable selection criteria, LMDT performs a sequential backward selection (SBS) search for a good combination of features. SBS starts with all initial features and tries to remove the feature that will cause the smallest decrease in accuracy as measured by a criterion function 6. This criterion function in LMDT removes the feature with the lowest weight of dispersion. LMDT recalculate weights after each elimination. 6 A criterion function is a figure of merit reflecting the amount of classification information conveyed by a feature. 4

5 The LMDT is able to overcome the L problem demonstrated in the paper. As the linear machine at the root begins to move toward one of the segments, misclassified instance from the other segment have decreasing effect, thus allowing linear machine to fine one of the segments without being misled by the distant points. The L problem appears in situations where permitting multivariate splits enable a decision tree algorithm to induce a better generalization than using only univariate splits and is thus the appropriate bias. A decision tree that permits only univariate splits would require a large number of tests to classify the training instances correctly. In univariate algorithms, an increase in the number of instances, clustered near a separating hyperplane, increases the number of splits necessary to classify the data correctly. However, the increase in the number of splits does not ensure an increase in accuracy for previously unseen points. 2 C4.5 and LMDT descriptions In this section I first start describing C4.5 software developed by Quinlan originated from ID3 algorithms used for inducing classification models from data. Then, I describe the LMDT algorithm for purposes of multivariate decision trees. 2.1 C4.5 and ID3 algorithm: The basic idea behind the ID3 is that in a decision tree, each node corresponds to a noncategorical attribute which is most informative among the attributes have not yet considered in the tree. It is also important to know how informative a node is. This procedure is called entropy. Below is a brief description of the ID3 algorithm. The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes C1, C2,.., Cn, the categorical attribute C, and a training set T of records. function ID3 (R: a set of non-categorical attributes, C: the categorical attribute, S: a training set) returns a decision tree; begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified]; Let D be the attribute with largest Gain(D,S) among attributes in R; Let {dj j=1,2,.., m} be the values of attribute D; Let {Sj j=1,2,.., m} be the subsets of S consisting respectively of records with value dj for attribute D; Return a tree with root labeled D and arcs labeled d1, d2,.., dm going respectively to the trees 5

6 ID3(R-{D}, C, S1), ID3(R-{D}, C, S2),.., ID3(R-{D}, C, Sm); end ID3; 7 C4.5 is an extension of ID3 that considers missing value-attributes, continuous attributes, pruning procedures and rule derivation. 2.2 Linear Machine Decision Trees (as described in the paper): The LMDT algorithm builds a multiclass, multivariate decision tee using a top-down approach described above. For each decision node the LMDT trains a linear machine, based on a subset of the input variables, which then serves as multivariate test for the decision node. LM is multiclass linear discriminant which itself classifies an instance. Linear machine is a set of T linear discriminant functions that are used together to assign an instance to one of the R classes. For example, Y is an instance description (a pattern vector) consisting of a constant threshold value 1 and the numerically encoded features. gi(y) is a discriminant function of the form Wi Y, where Wi is a vector of adjustable coefficients known as weights. A LM infers instance Y to belong to one class i if and only if ( j,i<> j) gi(y) > gj(y). One way for training a linear machine is absolute error correction rule, which adjust Wi, where i is the class to which the instance belongs and j is the class to which the linear machine incorrectly assigns the instance. This correction is done by Wi Wi + cy and Wj Wj- cy where c = (Wj Wi) Y/ 2YY, is the smallest integer such that the updated linear machine will classify the instance correctly. When the instances are linearly separable, cycling through the instances allows the linear machine to partition the instances into separate convex regions. When the instances are not linearly separable, error correction may not cease and the classification accuracy will be unpredictable. To overcome this problem, the authors use a thermal perceptron 8, which they call thermal linear machine. This approach solve the problem when one large error is far away from the decision boundary through using c= β/ β+ k, where k = (Wj Wi) Y/ 2YY and annealing β during training. In addition, it solves the problem when a misclassified instance lies very close to the decision boundary through annealing c by β giving the correction coefficient c = β^2 / β+ k. The model reduces β geometrically by rate a and mathematically by constant b. this enables the algorithm to spend more time training with small values of β when it is refining the location of the decision boundary. Β is reduced for the magnitude of linear machine 9 decreased for the current weight adjustment and increased during the previous adjustment. 7 Building Classification Models: ID3 and C4.5 at: 8 Linear perceptron, developed by Frean (1990), provides stable behavior when instances are not linearly separable. Frean addresses the two problems. 9 The magnitude of the linear machine is defined as the sum of the magnitudes of its constituent weight vectors. 6

7 3 Overview of results obtained in Brodley and Utgoff s paper (1992): The paper uses LMDT across a variety of large data sets, symbolic, numeric, and logic and binary task and multi task data. It compares LMDT approach to C4.5 univariate approach across different tasks to investigate the circumstances the bias of multivariate (and thus LMDT s search bias for finding such a tree) is more appropriate. 3.1 Data description: Six data sets are used to represent a mix of symbolic, and/or numeric attributes, missing values, binary class tasks and multiclass tasks. The Clevland data set consists of 303 patient diagnoses (presence or absence of heart disease). The Glass data set contains different glass samples taken from the scene of an accident. The Iris data set that contains linearly separable and non-linearly separable tasks. Letter recognition data set describes whether a rectangular pixel white-and-black displays as one of the 26 capital letters in English alphabet. Pixel segmentation data set segments an image into seven classes. The vote data set is used to classify each member of the Congress in 1984 as Republican or Democrat using their votes on key issues Terminologies: Number of classes: output classifications classes Performance measures of each of the tasks are reported. Each reported measure is the average of ten runs. To achieve an estimate of the true error rate for five of the domains, authors performed a ten-fold cross validation for each run. The data were spit randomly for each run, with the same split used for both algorithms. Measures reported are: o Unique attributes: The number of the original input attributes that ever need to be evaluated somewhere in the tree. o Number of nodes is the number of linear machines in the tree o Number of leaves is the number of leaves in the tree o o Nodes: The number of test nodes in the tree o Leaves: The total number of leaves in the tree o Average variables per LM: the average number of encoded variables per linear machines o Epochs: The numbers of epochs need to converge to a tree that classifies the training instances correctly. An epoch equals to the number of instances in the training set o Bits: the umber of bits needed to represent the classifier. o Accuracy: the percentage of the test instances classified correctly. If the difference in the accuracy for the test set is statistically significantly different for the two algorithms, the paper reports this difference by highlighting the higher accuracy in bold face type. o The test for significance is a t-test at the.01 level of significance. 10 For more details on the data sets (domains), refer to Brodley and Utgoff (1992), page

8 3.3 Results: It is expected that the time required for LMDT is longer than the time required for C4.5 tree because the hypothesis space for multivariate decision trees is larger than the hypothesis space for univariate decision trees. To compare the difference, the authors report both the number of instances used to update the linear machine and the number of instances observed. For both algorithms, authors count the number of times each training instance is examined. All training instances are examined at the root of the tree, however, at each sub tree, the algorithm examines only a portion of the instance. The number reported in the paper is the sum of the number of instances observed at each node in the tree divided by the size of a training set. This count is fair because although C4.5, while searching for a test at a sub tree, may only examine part of each instance, the same is true for LMDT. It is not meaningful to compare the size of the resulting trees using measures such as number of nodes or number of leaves. The size of an LMDT can be more complex than a C4.5 node. To compare size of the trees, the authors use the Minimum Description Length Principle (MDLP), which states that the best hypothesis to induce from a data set is the one that minimizes the length of the hypothesis plus the length of the data when coded using the hypothesis to predict the data. Here the hypothesis is the decision tree and the data is the training set. The best hypothesis is the one that can represent the data with the fewest number of bits. To represent the data, we must code both the tree and the error vector 11. The results show that LMDT finds trees for the Clevland and Letter Recognition tasks that are statistically significantly more accurate than those C4.5 finds, whereas, C4.5 finds more accurate trees for the Glass and Votes tasks. The difference in accuracies for the Iris and the Pixel Segmentation tasks are not significant. The size of the trees as measured by the number of bits required to code the tree is not consistent with the MDLP. Authors explain this that their data codings are not provably optimal. 4. Empirical assessments of the univariate and multivariate decision trees using C4.5 and LMDT on different data sets The paper uses the C4.5 and the LMDT software to provide empirical results on performance on these software as well as to show how the resulting decision trees based on univariate tests and multivariate trees differ. Data limitation and software restrictions: In this part of this report, I use different data sets and compare the two algorithms. The original version of C4.5 has restricted access. Therefore, I used a See5 demo and run the software. The See5 12 demo is limited to small datasets (up to 400 cases for See5/C5.0) but it incorporates all the features of the C4.5 and C5 the updated version of C4.5. Data set used for LMDT is within software data. So that I have used two different data sets one on each software. Because training set data are comparable, due to limited attribute, results below can be meaningful with limitation. Changes also has been made in three 11 Appendix A of the paper provides the coding procedures. See Brodley and Utgoff (1992), Page See 5. is downloadable at 8

9 revisions, to change the weight training algorithms to run thermal training 10 times, it picks the set of weights that allow the LM to maximize the selection criterion (information-gain or accuracy), add the capability to have LMDT not discard features (user can choose this option with -k parameter) and add the capability for the user to choose accuracy as a selection criterion (user chooses accuracy with -y parameter). 13 Below are the outputs of See5 and LMDT, respectively: See5: See5 [Release 1.15] Wed Dec 12 15:50: ** This demonstration version cannot process ** ** more than 400 training or test cases. ** Read 400 cases (35 attributes) from soybean. data Decision tree: int-discolor = brown: brown-stem-rot (31.4/1.4) int-discolor = black: charcoal-rot (10.5/0.5) int-discolor = none: :...plant-growth = norm: :...leafspot-size = N/A: : :...canker-lesion = N/A: powdery-mildew (16.3/2.3) : : canker-lesion = brown: anthracnose (4.2/0.2) : : canker-lesion = dk-brown-blk: anthracnose (13.8/0.8) : : canker-lesion = tan: purple-seed-stain (6.4/0.4) : leafspot-size = lt-1/8: : :...canker-lesion in brown,dk-brown-blk: bacterial-blight (0) : : canker-lesion = tan: purple-seed-stain (7) : : canker-lesion = N/A: : : :...leafspots-marg = no-w-s-marg: bacterial-pustule (8.4/0.4) : : leafspots-marg = w-s-marg: : : :...seed-size = norm: bacterial-blight (11) : : seed-size = lt-norm: bacterial-pustule (2.6/0.6) : leafspot-size = gt-1/8: : :...mold-growth = present: : :...leaves = norm: diaporthe-pod-&-stem-blight (5.7) : : leaves = abnorm: downy-mildew (11) : mold-growth = absent: : :...fruit-pods = few-present: brown-spot (0) : fruit-pods = diseased: frog-eye-leaf-spot (28/1) : fruit-pods = norm: : :...fruiting-bodies = present: brown-spot (27) : fruiting-bodies = absent: : :...date = april: brown-spot (1) : date = may: brown-spot (14/1) : date = october: alternarialeaf-spot (18/1) : date = june: : :...precip = lt-norm: phyllosticta-leaf-spot (2) : : precip = norm: phyllosticta-leaf-spot (1) : : precip = gt-norm: brown-spot (11) : date = july: [S1] 13 LMDT Documentation provided by Carla E. Brodley Version 2 9/4/94. 9

10 : date = august: : :...severity = severe: alternarialeaf-spot (0) : : severity = minor: frog-eye-leaf-spot (3) : : severity = pot-severe: alternarialeaf-spot (11/3) : date = september: : :...stem = norm: alternarialeaf-spot (26/2) : stem = abnorm: frog-eye-leaf-spot (2) plant-growth = abnorm: :...leaves = norm: rhizoctonia-root-rot (13) leaves = abnorm: :...stem = abnorm: :...plant-stand = normal: : :...seed = norm: diaporthe-stem-canker (13.1/0.1) : : seed = abnorm: anthracnose (4) : plant-stand = lt-normal: : :...fruiting-bodies = absent: : :...area-damaged = scattered: anthracnose (1.9/0.9) : : area-damaged = low-areas: phytophthora-rot (47.5/0.1) : : area-damaged = upper-areas: 2-4-d-injury (0.1) : : area-damaged = whole-field: herbicide-injury (2.6/1.2) : fruiting-bodies = present: : :...roots = galls-cysts: phytophthora-rot (0) : roots = norm: anthracnose (4) : roots = rotted: phytophthora-rot (8.2/0.6) stem = norm: :...seed = abnorm: cyst-nematode (11/1.1) seed = norm: :...leafspot-size = N/A: 2-4-d-injury (0.1) leafspot-size = lt-1/8: bacterial-blight (2) leafspot-size = gt-1/8: :...leaf-shread = present: phyllosticta-leaf-spot (3) leaf-shread = absent: :...date in june,september, : october: brown-spot (0) date = april: brown-spot (2) date = may: brown-spot (2) date = july: frog-eye-leaf-spot (3/1) date = august: frog-eye-leaf-spot (1) SubTree [S1] area-damaged = scattered: frog-eye-leaf-spot (3/1) area-damaged = low-areas: brown-spot (2/1) area-damaged = upper-areas: phyllosticta-leaf-spot (3) area-damaged = whole-field: brown-spot (1) Evaluation on training data (400 cases): Decision Tree Size Errors 45 19( 4.8%) << (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) <-classified as 10

11 (a): class diaporthe-stem-canker 10 (b): class charcoal-rot 13 (c): class rhizoctonia-root-rot 55 1 (d): class phytophthora-rot 30 (e): class brown-stem-rot 14 (f): class powdery-mildew 11 (g): class downy-mildew 58 1 (h): class brown-spot 13 (i): class bacterial-blight 10 1 (j): class bacterial-pustule 13 (k): class purple-seed-stain 26 (l): class anthracnose (m): class phyllosticta-leaf-spot (n): class alternarialeaf-spot 6 37 (o): class frog-eye-leaf-spot 8 (p): class diaporthe-pod-&-stem-blight 11 (q): class cyst-nematode 4 (r): class 2-4-d-injury 2 1 (s): class herbicide-injury Evaluation on test data (233 cases): Decision Tree Size Errors 45 33(14.2%) << (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) <-classified as (a): class diaporthe-stem-canker 11

12 9 (b): class charcoal-rot 1 5 (c): class rhizoctonia-root-rot 28 (d): class phytophthora-rot 14 (e): class brown-stem-rot 4 (f): class powdery-mildew 8 (g): class downy-mildew (h): class brown-spot 5 (i): class bacterial-blight (j): class bacterial-pustule 6 (k): class purple-seed-stain 11 (l): class anthracnose 4 3 (m): class phyllosticta-leaf-spot 27 5 (n): class alternarialeaf-spot (o): class frog-eye-leaf-spot 7 (p): class diaporthe-pod-&-stem-blight 2 (q): class cyst-nematode 4 8 (r): class 2-4-d-injury 1 3 (s): class herbicide-injury Time: 0.1 secs LMDT Output: LM LM LM LEAF LEAF 2 12

13 LEAF LM LEAF LEAF 2 Output Statistics: Number Epochs : Num Insts seen: Num Insts trnd: Number nodes : 4 Number of LVs : 5 Unique vars : 6 Ave. vars/lm : 5.75 Train accuracy: Test accuracy : Train errors 66.0 Test errors 12.0 Time : 1 It is noticeable that the multivariate decision tree generated by LMDT is much smaller (simpler) than the one generated by the See5. decision trees. The See5 demo does not provide detailed statistics on the output. Multivariate decision trees have more complex nodes compared to univariate decision tree nodes. However, results may not be comparable due to the use of different data sets of different sizes. 4. Conclusion Since the objective of creating a multivariate decision tree algorithm is to overcome problems of univariate trees tests that branch splits are orthogonal to the variables axes. Results demonstrate that for some data sets the bias of univariate decision tree is more appropriate. This is because LMDT s bias for finding a multivariate tree may be inappropriate for some tasks because it may not find a univariate test when it should. LMDT variable elimination method is a greedy search procedure, which can get stuck on local maxima. Therefore, although the hypothesis space LMDT searches includes univariate decision trees, the heuristic nature of LMDT s search may result in selecting a test from an inappropriate part of the hypothesis space. A solution suggested by the authors would be to determine the appropriate bias dynamically for each test in the tree. The perceptron tree algorithm is one example of a system, which tries to determine the appropriate representational bias for the instances automatically. Specifically, the algorithm first tries to fit a linear threshold unit (LTU) to the space of the instances. If the 13

14 space is not linearly separable, then the bias of an LTU 14 is inappropriate and the system searches for the best univariate test. However, for some instance spaces, the best test may be based on a subset of the variables. The multivariate decision tree algorithm therefore should employ a dynamic control strategy for finding the appropriate representational bias for each test in the decision tree. Specifically, rather than search the space of multivariate tests using a fixed bias (like LMDT), such a system would have the capability to focus its search using heuristic measures of the learning process. Future directions suggested by the authors: The bias problem can be solved if a chosen algorithm is able to induce a good generalization that depends on how well the hypothesis space underlying the learning algorithm and the bias for searching that space fit the given task. Given that different algorithms search different hypothesis spaces, one algorithm can find a better hypothesis than other for some but not for all tasks. Given a task for which there is no a priori knowledge as to what the appropriate hypothesis space should be, a learning algorithm should itself determine what is the appropriate bias. 14 See Utgoff, P. E. and Brodley, C. E., Linear Machine Decision Trees. COINS Technical Report 91-10, January 1991, Department of Computer Science, University of Massachusetts, Amherst, MA

15 References Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, CA. Building Classification Models: ID3 and C4.5 at: David J. Hand Padhraic Smyth Heikki Mannila, Principles of Data Mining. MIT Press, September Chapter 7, and 10, Score Functions for Data Mining Algorithms and Predictive Modeling For Classification. Brodley, C. E. and Utgoff, P. E., Multivariate Versus Univariate Decision Trees. COINS Technical Report 92-8, January 1992, Department of Computer Science, University of Massachusetts, Amherst, MA See 5. Demo at Utgoff, P. E. and Brodley, C. E. Linear Machine Decision Trees. COINS Technical Report 91-10, January 1991, Department of Computer Science, University of Massachusetts, Amherst, MA

Univariate and Multivariate Decision Trees

Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each