Feature Minimization within Decision Trees. Erin J. Bredensteiner. Kristin P. Bennett. R.P.I. Math Report No November 1995.

Size: px

Start display at page:

Download "Feature Minimization within Decision Trees. Erin J. Bredensteiner. Kristin P. Bennett. R.P.I. Math Report No November 1995."

Lynette Mathews
5 years ago
Views:

1 Feature Minimization within Decision Trees Erin J. Bredensteiner Kristin P. Bennett R.P.I. Math Report No. 218 November 1995 Abstract Decision trees for classication can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspect of data mining since it helps identify what attributes are important and helps produce accurate and interpretable decision trees. In feature minimization with bounded accuracy, we minimize the number of features using a given misclassication error tolerance. This problem can be formulated as a parametric bilinear program and is shown to be NP-complete. A parametric Frank- Wolfe method is used to solve the bilinear subproblems. The resulting minimization algorithm produces more compact, accurate, and interpretable trees. Computational results compare favorably with a popular greedy feature elimination method as well as with a linear programming method of tree construction. Key Words: Data mining, machine learning, feature minimization, decision trees, bilinear programming. 1 Introduction We consider the fundamental problem in machine learning of the discrimination between elements of two sets A and B in the n-dimensional real space R n. Each dimension of the space represents a feature or attribute of the elements of the set. Commonly, the method of discrimination involves determining a linear function which consists of a linear combination of the attributes of the two given sets. In general it is not possible for a single linear function to completely separate these sets of points. Knowledge Discovery and Data Mining Group, Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY bredee@rpi.edu, bennek@rpi.edu. Telephone (518) FAX (518) This material is based on research supported by National Science Foundation Grant

2 Thus, some error criterion is minimized to determine the linear discriminant. To obtain a more accurate discrimination, many linear separators can be used as the decisions within a decision tree. In a decision tree, several linear discriminants are applied recursively to form a nonlinear separation of the space R n into disjoint regions, each corresponding to set A or set B. The goal is to obtain a decision tree, with one or more decisions, which generalizes well, i.e. correctly classies future points. Feature minimization is an important aspect of multivariate decision tree construction. The goal of feature minimization is to construct good decisions using as few features as possible. By minimizing the number of features used at each decision, understandability of the resulting tree is increased and the number of data evaluations is decreased [6]. Feature minimization is not necessary in univariate decision tree algorithms in which each decision in the tree is based on a single feature or attribute. Note that in this paper we use the term feature and attribute interchangeably. For example, in a credit card approval application a univariate decision may be: \Is income > $50,000?" A multivariate decision uses a linear combination of features, for example: \Is 3*debt > income?" A tree with multivariate decisions can represent more complex relationships using fewer decisions than univariate trees. However multivariate decisions with too many attributes can be dicult to interpret. Our goal is to make both a small number of decisions and to utilize only necessary attributes in each decision. Feature minimization is especially important in data mining applications where the resulting decision tree is used not only to classify future points, but also to understand the underlying characteristics of the sets being studied. An added benet is increased generalization at each node which may assist in better decision tree construction by avoiding overtting. There is a trade o between the complexity of each decision and the number of decisions required in the tree. Multivariate decision trees typically have many fewer decisions than univariate decision trees constructed using one attribute per decision. Univariate decision trees have the advantages that single attribute decisions help avoid over-parameterization and the resulting trees are more readily interpretable provided the number of decisions is not excessive. Reducing the number of features at each decision allows the inclusion of all of the benets of multivariate decisions while maintaining the simplicity of univariate decisions. The goal of this paper is to obtain a compact, accurate decision tree with decisions that include as few features as possible while maintaining a specic level of accuracy. We will be using mathematical programming methods to construct the decisions. In contrast, other common approaches to feature minimization are based on heuristics. Sequential Backward Elimination (SBE) and Sequential Forward Elimination (SFE) [6] involve searching the feature space for features that do not contribute (SBE) or contribute (SFE) to the quality of the decision. In SBE an initial discriminant function is constructed using all of the features and then features are removed sequentially from the problem until some stopping criterion is satised. In SFE, a discriminant is constructed using a single feature and then features are added one at a time. At each iteration in both methods, the best feature to add or remove is determined by nding the best discriminant for each possible attribute. We will compare our method to a simplied version of the SBE called Greedy Sequential Backward Elimination (GSBE)[6] that was found to compare very favorably with the more costly SBE and SFE methods. Using the mathematical programming approach we can evaluate the eectiveness of this heuristic. 2

3 In Section 2, we will discuss the background and formulation of our feature minimization method using parametric bilinear programming. We then prove in Section 3 that our feature minimization problem is NP-complete. In Section 4, we propose an algorithm based on the Frank-Wolfe method discussed in [4] for solving the parametric bilinear programming problem. Section 5 contains a computational comparison of our feature minimization method to GSBE and a linear programming approach to decision tree construction. Results on a number of practical problems are given. The following notation is used. Let A and B be two sets of points in the n- dimensional real space R n with cardinality m and k respectively. Let A be a m n matrix whose rows are the points in A. Let B be a k n matrix whose rows are the points in B. The i th point in A and the i th row of A are both denoted A i. Likewise, B j is the j th point in B and the j th row of B. For two vectors in R n, xy denotes the dot product. Let e denote a vector of ones of the appropriate dimension. The set of minimizers of f(x) on the set S is denoted by arg min x2s f(x). For a vector x in R n, x + will denote the vector in R n with components (x + ) i := maxfx i ; 0g; i = 1; : : :; n. The step function x will denote the vector in [0; 1] n with components (x ) i := 0 if x i 0 and (x ) i := 1 if x i > 0; i = 1; : : :; n. 2 Feature Minimization At each decision we are interested in nding a linear function that separates the two sets. Mathematically, this corresponds to nding the plane such that wx = (1) Aw > e e > Bw (2) where w 2 R n is the normal to the separating plane and determines the distance of the plane from the origin. Upon normalization, this becomes Aw? e? e 0? Bw + e? e 0 (3) Equation (3) is feasible if and only if the two sets are linearly separable. In the event that the sets are not linearly separable, we must choose a plane that minimizes some error function. Our formulations of the feature minimization problem can be applied to many dierent error functions. In this paper, we use an error function that minimizes the average magnitude of misclassied points within each class. The underlying problem without feature minimization is a linear program. This robust linear program (RLP) [3] has been used for decision tree construction [1]. RLP combined with the greedy sequential backward elimination method for feature minimization forms the basis of a breast cancer diagnosis system [17, 16]. Our feature minimization method could also be applied to algorithms that minimize the number of points misclassied such as [2, 11] or to other successful linear programming approaches [10, 15], but we leave these extensions for future work. 3

4 The following robust linear programming problem, RLP [3], minimizes a weighted average of the sum of the distances from the misclassied points to the separating plane. min w;;u;v 1 m eu + 1 k ev subject to u + Aw? e? e 0 v? Bw + e? e 0 u 0; v 0 We are interested in minimizing the number of features at each decision in an eort to balance the amount of separation achieved versus the number of features used. The step function x will be used to count the number of nonzero elements in the vector w. We replace w with (w + )? (w? ) where w + ; w? 0. At optimality, w + = (w) + and w? = (?w) +. Thus the number of nonzero elements in the vector w is now ((w + ) + (w? )). Adding this term to the objective function yields the following multiobjective optimization problem: min w+;w?;;u;v 1 m eu + 1 k ev + e(w + + w? ) subject to u + A(w +? w? )? e? e 0 v? B(w +? w? ) + e? e 0 u 0; v 0; w + 0; w? 0 The rst issue we will confront in the above problem is the elimination of the step function. The step function is removed from problem (5) using properties found in [11] and [12]. The details are contained in the appendix. The resulting linear program (6) with equilibrium constraints is equivalent to the original problem (5). min w+;w?;;u;v;r 1 m eu + 1 ev + er k subject to u + A(w +? w? )? e? e 0 v? B(w +? w? ) + e? e 0 (w + + w? )(e? r) = 0 0 r e u 0; v 0; w + 0; w? 0 Note that at optimality r = (w + + w? ), thus er counts the number of features used. The second issue to confront is how to choose the parameter. The solution of problem (6) yields optimal decisions dependent on the value of. The choice of is not intuitively obvious. We propose two variants of the problem that eliminate the parameter, move the complementarity constraints to the objective function, and allow the problem to be solved using bilinear programming. One possible approach is to minimize the number of features while satisfying a specic misclassication error bound. In our eort to achieve this goal, we propose removing from the problem by bounding the error function in a constraint. Problem (7) removes features while maintaining accuracy within some tolerance,. A similar concept was used by [6] and [16] in their feature elimination methods. In [6], feature elimination was allowed to continue as long as a specic error tolerance was maintained. Street [16] computed planes for all feature counts and then used a tuning set to determine the best plane. We call this problem feature minimization with bounded accuracy. (4) (5) (6) 4

5 Find the positive integer such that where f() = = minfjf() = 0g (7) >0 min w+;w?;;u;v;r subject to (w + + w? )(e? r) 1 m eu + 1 k ev u + A(w +? w? )? e? e 0 v? B(w +? w? ) + e? e 0 0 r e er 2 [1; n] u 0; v 0; w + 0; w? 0 (8) For each xed value of, problem (8) nds a linear separator within a specic error rate. If for any given, f() 6= 0, then no linear discriminant exists with the error tolerance that uses at most features. Theorem 3.2 proves this problem is NP-complete for the error function that counts the number of points misclassied. An alternate approach is to rephrase the problem as follows: What is the best decision that can be made using at most variables? This limited feature minimization problem then becomes: min w+;w?;;u;v;r 1 m eu + 1 k ev + (w + + w? )(e? r) subject to u + A(w +? w? )? e? e 0 v? B(w +? w? ) + e? e 0 0 r e er 2 [1; n] u 0; v 0; w + 0; w? 0 Here must be chosen suciently large in order to force the complementarity constraints of (6), (w + + w? )(e? r) = 0, to be satised at optimality. Limiting the maximum number of features is appealing in practice because if the number of features is small the interpretability of the tree may be greatly enhanced. For example, the number of features per decision could be limited to three, then each decision may be viewed graphically as a three dimensional plot. The extreme case of univariate trees, those limited to a single feature, has been demonstrated to work very well on numerous applications. However, such trees may require an excessive number of decisions. (9) 3 Computational Complexity For this paper we will concentrate on the feature minimization with bounded accuracy problem. In this section, we will show that this problem is NP-complete. We begin by giving a formal denition of a problem titled \bounded accuracy with limited features". We then prove this problem to be NP-complete. The feature minimization with bounded accuracy problem is then dened and the bounded accuracy with limited features problem is used to prove it is NP-complete. Denition 3.1 (Bounded Accuracy with Limited Features) Let X be a nite subset of vectors in R n+1. Let the vector x 2 X have integer valued entries. Is there a 5

6 vector y 2 R n+1 such that at most (0 < n) entries y i ; i = 1; : : :; n are nonzero and such that xy > 0 for at least K vectors x? Specically, X contains vectors of the form [A i ;?1] and [?B i ; 1]. Also, y i = (w +? w? ) i for i = 1; : : :; n and y n+1 =. Theorem 3.1 The Bounded Accuracy with Limited Features Problem is NP-complete. Proof. It is easy to show that this problem is in NP. We need only choose a vector y 2 R n+1 and check in polynomial time whether xy > 0 for at least K vectors x 2 X and if at most elements y i ; i = 1; : : :; n are nonzero. To show that the above problem is NP-complete the Open Hemisphere problem of [9] can be easily transformed into a single instance of the bounded accuracy with limited features problem. The Open Hemisphere problem is the problem of determining if there is a vector y such that xy > 0 for at least K vectors x 2 X. Thus, solving the Open Hemisphere problem is exactly solving the instance of our problem when = n. The feature minimization with bounded accuracy problem is precisely the problem of bounded accuracy with limited features with the added condition that the number of features be minimized. The formal denition of this problem is as follows. Denition 3.2 (Feature Minimization with Bounded Accuracy) Let X be a - nite subset of vectors in R n+1. Let the vector x 2 X have integer valued entries. Find a vector y 2 R n+1 such that the number of nonzero elements y i ; i = 1; : : :; n is minimized and such that xy > 0 for at least K vectors x. Theorem 3.2 The Feature Minimization with Bounded Accuracy Problem is NPcomplete. Proof. We can show that this problem is in NP by relating it to at most two instances of the bounded accuracy with limited features problem which is in NP by Theorem 3.1. There exists a solution for the feature minimization with bounded accuracy problem with exactly p nonzero elements y i ; i = 1; : : :; n if and only if there exists a solution for the bounded accuracy with limited features problem for = p, but no solution exists for = p? 1. Thus, given a vector y with p nonzero elements y i ; i = 1; : : :; n we can check whether or not y is a solution to the feature minimization with bounded accuracy problem by solving the bounded accuracy with limited features problem for = p? 1 and verifying that y satises xy > 0 for at least K vectors x 2 X. Therefore, the feature minimization with bounded accuracy problem is in NP. We will now show that this problem is NP-complete by reducing the bounded accuracy with limited features problem to the feature minimization with bounded accuracy problem. We solve the feature minimization with bounded accuracy problem and obtain the solution y which contains exactly p nonzero elements y i ; i = 1; : : :; n. A solution exists to the bounded accuracy with limited features problem if and only if p. In the next two sections, we describe a practical algorithm for solving the feature minimization with bounded accuracy problem and provide computational results. 6

7 4 Feature Minimization Algorithm In this section we provide the algorithm used in solving our feature minimization problem (7). The rst step in solving problem (7) is to determine values for the parameters and. To determine, we solve the linear program (4) and allow for a 10% error on the value of the objective function. This value of will stay constant for the remainder of the solution of this program. The parameter is allowed to change in the process of solving this parametric bilinear program. Subsection 4.2 contains a complete description of how is chosen. For xed values of and, several approaches are available to nd a solution of program (8). Some possibilities are to apply branch and bound techniques, cutting plane methods, or the Frank-Wolfe method. The approach implemented in this paper uses a Frank-Wolfe type algorithm used successfully to solve bilinear programs in [4, 2]. This algorithm reduces the original bilinear program into two linear programs. One of these linear programs has a closed form solution as shown in [2]. A complete description of our algorithm is given in the following two subsections. 4.1 Bilinear Subproblems The parametric bilinear programming formulation (8) is an uncoupled bilinear program. It has been shown that a Frank-Wolfe algorithm [8] applied to an uncoupled bilinear program will converge to a global solution or a stationary point [4]. Applying this Frank-Wolfe algorithm to problem (8) we obtain the following algorithm: Algorithm 4.1 (Frank-Wolfe algorithm for uncoupled bilinear programs) For xed, Step 1: (w i+1 + ; wi+1? ; i+1 ; u i+1 ; v i+1 ) 2 arg min w+;w?;;u;v (w + + w? )(e? r i ) 1 m eu + 1 k ev u + A(w +? w? )? e? e 0 v? B(w +? w? ) + e? e 0 u 0 v 0 w + 0 w? 0 Step 2: (r i+1 ) 2 arg min (w i+1 + r + wi+1? )(e? r) 0 r e er Step 3: Repeat until no improvement in objective. It can easily be shown that the subproblem contained in step 2 has a closed form integer solution namely r j = 1 for the largest components of jw i+1 j = (w i w i+1? ) otherwise r j = The Feature Minimization Bilinear Program The parametric bilinear program (7) searches for the minimum number of features such that a specic error criterion is met. There are various methods available for choosing which values of should be explicitly solved. For each a series of linear programs 7

8 must be solved, thus it is computationally valuable to solve for as few values of as possible. We have used a modication of the secant method, similar to that used in [2], in the following algorithm: Algorithm 4.2 (Feature Minimization with Bounded Accuracy) Let max denote the smallest number of features such that the error tolerance is satised thus far. Let min denote the largest number of features attempted so far in Algorithm 4.1 such that the error tolerance is violated. All calculations for and p are rounded to the nearest integer. Step 0: Solve the robust LP (4) to nd the best linear discriminant using all of the features. Let lp error = 1 m eu + 1 k ev. Let = 1.1(lp error). Let max = n. Let min = 1. Step 1: Solve bilinear subproblem (8) using Algorithm 4.1 for = 1. If f() = 0 then return = 1 else let f( min ) = f() and = 1 2 (n) Step 2: Solve bilinear subproblem (8) using Algorithm 4.1. Step 3: If f() = 0 then let max = and = 1 2 ( min + max ) else calculate secant method update (? min ) p =? f() (f()? f( min )) Let min = and f( min ) = f(). If p 2 ( min ; max ) then let = p else let = 1 2 ( min + max ) Step 4: If max > min + 1 Go to Step 2 Else return max. In the remaining two sections, we refer to our implementation of the feature minimization with bounded accuracy program as Feature Minimization. 5 Computational Method To evaluate the eectiveness of our mathematical programming method, Feature Minimization, we have implemented a heuristic feature reduction method, GSBE, and the LP problem (4) for comparison. Several experiments on real world data sets are reported. Section 5.1 describes our experimental method, the GSBE algorithm, and the data sets used. Computational results on single linear separators and decision trees are contained in section

9 5.1 Experimental Method Feature Minimization results are compared to the LP (4) and our implementation of GSBE as described below. Each method utilizes the CPLEX 3.0 [7] solver to optimize the linear subproblems. To estimate generalization or accuracy on future data, 10-fold cross validation was used to evaluate the testing set accuracies. The original data set is split into ten equal parts. Nine of these are used for training and the remaining one is saved for testing. This process is repeated ten times allowing each part to be the testing set. Paired t-tests were performed to determine any signicant dierences between the three methods. The Feature Minimization and the GSBE algorithms both require that the training set be normalized. Thus, for all three methods, at each decision we normalize the training data and use the normalization information to transform (w; ) for testing. Two experiments were performed. In the rst, we compared the three results when obtaining a single linear separator. In the second, we compared the use of the three algorithms in a greedy decision tree procedure. This procedure nds the best decision using a given algorithm. This splits the data into two regions, and this process is repeated recursively for points falling in the two regions. This process can be used to construct a decision tree to recognize any two disjoint sets with 100% accuracy. However the resulting tree frequently overts the data causing poor generalization. Thus we must simplify or prune the tree. The pruning method chosen is Error Complexity pruning [5] in which 10% of the training set is reserved specically for pruning. Decision tree growth is terminated when each leaf contains either a single class or both classes where one class has less than 1% of the total points represented. Also to avoid overtting, at a given decision of 2r points, if r or more features are chosen then this decision becomes a leaf and growth is terminated. The Greedy Sequential Backward Elimination algorithm works as follows: Algorithm 5.1 (Greedy Sequential Backward Elimination - GSBE) Step 1: Find a linear separator, (w; ) n, using all n features. Step 2: Let (w; ) best = (w; ) n and i = n. Step 3: Eliminate the feature with the smallest magnitude and let i = i - 1. Step 4: Solve for (w; ) i. Step 5: If a stopping criterion is met, then stop and return (w; ) best. Step 6: Otherwise, let (w; ) best = (w; ) i and go to 3. The Robust LP was used to determine the linear separators in steps 1 and 4. We let GSBE continue as long as the magnitude of misclassication error ( 1 m eu + 1 k ev ) obtained for i variables is less than 110% the error for n variables. This is consistent with the choice of in problem (8). The data sets used in the computational experiments are listed below. All of these data sets except the Star/Galaxy Database and the Plastics data set are available via anonymous le transfer protocol (ftp) from the University of California Irvine UCI Repository of Machine Learning Databases [13]. Cleveland Heart Disease Database The Cleveland Heart Disease Database has 297 patients listed with 13 numeric attributes. Each patient is classied as to 9

10 data. whether there is presence or absence of heart disease. There are 137 patients who have a presence of heart disease. Wisconsin Breast Cancer Database This data set is used to classify 682 patients with breast cancer. Each patient is represented by nine integral attributes ranging in value from 1 to 10. The two classes represented are benign and malignant: 442 of the patients are benign while 240 are malignant. Star/Galaxy Database The Star/Galaxy Database consists of two data sets: dim and bright. The dim data set has 4192 examples and the bright data set has 2462 examples. Each example represents a star or a galaxy and is described by 14 numeric attributes. The bright data set is nearly linearly separable. These two data sets are generated from a large set of star and galaxy images collected by Odewahn [14] at the University of Minnesota. BUPA liver disorders The BUPA data set contains 345 single male patients with 6 numeric attributes. Five of these attributes are blood tests which are thought to be relevant to liver disorders. The sixth attribute corresponds to the number of alcoholic beverages drunk per day. Johns Hopkins University Ionosphere database The Ionosphere data set is used to distinguish between good and bad radar returns. A good return is one indicating evidence of some type of structure in the ionosphere. A bad return simply passes through the ionosphere. There are 351 instances with 34 continuous attributes. Pima Indians Diabetes Database The Pima Diabetes data set consists of 768 female patients who are at least 21 years of age and are of Pima Indian heritage. The 8 numeric attributes describe physical features of each patient United States Congressional Voting Records Database This data set includes votes for each of the 435 U.S. House of Representatives Congressmen. There are 267 democrats and 168 republicans. The chosen attributes represent 16 key votes. Possible values for the attributes are y,n, and?. A value of? indicates that the person did not make a position known. Our program requires numeric valued attributes, thus we let y,n, and? be 2,-2, and 0 respectively. Plastic Part Injection Molding This proprietary plastic part manufacturing data set 1 consists of two data sets: Plastics I has 935 examples and Plastics II has 1118 examples. Each example consists of 47 numeric measurements taken during manufacturing of a plastic part and the resulting quality of the part. 5.2 Computational Results Tables 1 and 2 are the results of constructing a single linear separator on each training set. Tables 3 and 4 contain results obtained from full decision tree construction and Error Complexity pruning. T-tests were computed, for both the decision trees and the single planes, on each pair of methods. We are using a p value of less than 10% to indicate a signicant dierence between the two methods. In every case except two that are noted, a 15% level of signicance yields the same results. First, we will discuss the results for a single plane. Then the decision tree results will be investigated. Table 1 shows that Feature Minimization had a smaller testing set error than GSBE on 6 of the 10 problems. On the three problems where Feature Minimization performed 1 Many thanks to Dr. Ozden Gur Ali and GE Corporate Research and Development for providing this 10

11 Feature Minimization GSBE RLP Heart Cancer Bupa Ionosphere Pima House Votes Bright Dim Plastics I Plastics II Table 1: Single Plane Testing Set Errors (%) worse than GSBE the dierences were not statistically signicant. On the two data sets Plastics I and House Votes, Feature Minimization tested signicantly better than GSBE with p values of 13% and 14% respectively. In general, RLP performed better in terms of testing set accuracy than both GSBE and Feature Minimization. However the t-tests indicate that RLP is signicantly better than both Feature Minimization and GSBE on the Pima, Dim, and Plastics II data sets. T-tests also indicate that Feature Minimization was signicantly better than RLP on the Ionosphere data set. As shown in Table 2, RLP uses many more features than both of the other methods. The average numbers of features used by Feature Minimization and GSBE are very close. Note that GSBE actually used slightly fewer features on the Heart, Ionosphere, Dim, and Plastics II data sets. This indicates that the Frank-Wolfe method (a local optimization method) is failing to nd the global minimum in some of the cases. So it is possible the results could be improved by using a global optimization technique. Our results indicate that the GSBE heuristic does perform well especially when computational times are considered. On average, Feature Minimization takes 15.7 times as long as RLP in training time, and GSBE is 3.2 times slower than RLP. Our results do show that if a small number of features is favored, both GSBE and Feature Minimization provide good results. For the decision tree problem a similar pattern was observed. Table 3 shows that Feature Minimization performed better than GSBE on 8 of the 10 data sets. The only signicant dierences reported by the t-test correspond to the House Votes and Dim results. Feature Minimization performed better on House Votes, but GSBE has a smaller error on the Dim data set. RLP performed better than FM on only ve data sets with a statistically signicant improvement only on the Pima and Dim data sets. RLP performed better than GSBE on 7 of the 10 data sets with a statistically signicant improvement only on the Pima data set. Table 4 lists the average number of features used per decision along with the average number of leaves in each decision tree. Comparing RLP and Feature Minimization in Table 4, Feature Minimization utilized signicantly fewer features than RLP while maintaining a similar small tree size. Thus, Feature Minimization provides a substantial improvement over the RLP 11

12 Feature Minimization GSBE RLP Heart Cancer Bupa Ionosphere Pima House Votes Bright Dim Plastic I Plastic II Table 2: Average Number of Features in Each Linear Separator Feature Minimization GSBE RLP Heart Cancer Bupa Ionosphere Pima House Votes Bright Dim Plastic I Plastic II Table 3: Decision Tree Testing Set Errors (%) program with regards to constructing much simpler but still very accurate decision trees. However, as expected, Feature Minimization is computationally more expensive than RLP. On average Feature Minimization takes 10 times as long to solve as RLP. GSBE also uses signicantly fewer features per decision tree than RLP. On average GSBE takes 3 times as long as RLP to solve. Feature Minimization used less features on average than GSBE on 7 of the 10 data sets. If accuracy is the only priority when choosing a method, then RLP seems preferable. RLP achieved excellent results with the least computational eort. If the number of features is an important criterion, then both Feature Minimization and GSBE obtained excellent results that varied with the data sets. When computational time is considered, the heuristic GSBE has proven to be a very eective method. For high dimensional problems such as the Plastics Manufacturing problems, Feature Minimization performed the best in terms of testing set errors and minimum number of features. 12

13 Feature Minimization GSBE RLP F L F L F L Heart Cancer Bupa Ionosphere Pima House Votes Bright Dim Plastics I Plastics II Table 4: Average Number of Features (F) and Leaves (L) in Decision Tree 6 Conclusions We have proposed two parametric bilinear programming methods for feature minimization. The rst method, feature minimization with bounded accuracy, is the problem of nding a linear separator within a specic accuracy using as few features as possible. The second method, limited feature minimization, nds the best linear discriminant using at most features. The feature minimization with bounded accuracy problem was shown to be NP-complete. These feature minimization approaches can be used to produce accurate decision trees using the minimum number of features. Feature minimization is an important aspect of data mining because we are interested in both the accuracy of the trees and the interpretability of the trees. A Frank-Wolfe algorithm was used to transform the bilinear program into a series of linear programs, half of which have closed form solutions. Computational results indicate that the Feature Minimization method performed as accurately as the robust linear programming method, RLP, and the heuristic GSBE. Feature Minimization and GSBE nd planes with substantially fewer features than RLP. Therefore, Feature Minimization provides improvement over the RLP method with an additional computational time expense. The computationally ecient GSBE heuristic performed surprisingly well. Thus, GSBE proves to be a valuable method. As expected, the results are data set dependent and no single method always performs best. However, Feature Minimization did perform best on the high dimensional plastics data sets. Overall Feature Minimization is a very promising approach. Further work is needed to explore the application of Feature Minimization to other types of discriminant functions and misclassication error metrics. A Removal of the Step Function The following equivalence relation will be used to transform the step function from program (5): 13

14 Proposition A.1 (Characterization of the Step Function) [11], [12]! r r = (a) u = (a) + () (r; u) 2 arg min er subject to: r;u u = r? u + a r + u? e In addition, the elementary relationship in Proposition A.2 will be useful in the removal of the plus function: Proposition A.2 [11] c = d + () c 0 c? d 0 c(c? d) = 0! + In order to apply these properties to problem (5) we let r and u be as follows: Thus from property A.1: r = (w + + w? ) (10) u = (w + + w? ) + = (w + + w? ) = a (11) r = (r? u + a) + = r + (12) u = (r + u? e) + (13) Equation (13) is reduced to the following three equations using Proposition A.2: u 0 (14) e? r 0 (15) u(e? r) = 0 (16) Resulting from equations (11), (12), (15), and (16) are the new constraints: References r 0 (17) e? r 0 (18) (w + + w? )(e? r) = 0 (19) [1] K. P. Bennett. Decision tree construction via linear programming. In M. Evans, editor, Proceedings of the 4th Midwest Articial Intelligence and Cognitive Science Society Conference, pages 97{101, Utica, Illinois, [2] K. P. Bennett and E. J. Bredensteiner. A parametric optimization method for machine learning. Math Report 217, Rensselaer Polytechnic Institute, Troy, New York, Submitted to ORSA Journal on Computing. [3] K. P. Bennett and O. L. Mangasarian. Neural network training via linear programming. In P. M. Pardalos, editor, Advances in Optimization and Parallel Computing, pages 56{67, Amsterdam, North Holland. 14

15 [4] K. P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Computational Optimization and Applications, 2:207{227, [5] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression Trees. Wadsworth International, California, [6] C. E. Brodley and P. E. Utgo. Multivariate decision trees. Machine Learning, 19(1):45{77, [7] CPLEX Optimization Incorporated, Incline Village, Nevada. Using the CPLEX Callable Library, [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95{110, [9] M.R. Garey and D.S. Johnson. Computers and Intractability, A Guide to the Theory of NP{Completeness. W.H. Freeman and Company, San Francisco, [10] F. Glover. Improved linear programming models for discriminant analysis. Decision Sciences, 21:771{785, [11] O. L. Mangasarian. Misclassication minimization. Journal of Global Optimization, 5:309{332, [12] O. L. Mangasarian. Mathematical programming in machine learning. Technical Report 95-06, University of Wisconsin, Madison, Wisconsin, Submitted to Proceedings of Nonlinear Optimization and Applications Workshop, June 1995, Plenum Press. [13] P.M. Murphy and D.W. Aha. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, California, [14] S. Odewahn, E. Stockwell, R. Pennington, R Humphreys, and W Zumach. Automated star/galaxy discrimination with neural networks. Astronomical Journal, 103(1):318{331, [15] A. Roy, L. S. Kim, and S. Mukhopadhyay. A polynomial time algorithm for the construction and training of a class of multilayer perceptrons. Neural Networks, 6:535{545, [16] W.N. Street. Cancer diagnosis and prognosis via linear-programming-based machine learning. Technical Report 94-14, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, August Ph.D. thesis. [17] W.H. Wolberg, W. N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Quantitative Cytology and Histology, 17(2):77{87,

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds.

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds. Constrained K-Means Clustering P. S. Bradley K. P. Bennett A. Demiriz Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. Redmond, WA 98052 Renselaer