Feature Selection Based on Relative Attribute Dependency: An Experimental Study

Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez Hills 1000 E. Vistoria Street, Carson, CA 90747 College of Information Science and Technology, Drexel University 3141 Chestnut Street, Philadelphia, PA 19104 Department of Electrical Engineering and Computer Science, University of California, Berkeley, California 94720 Glossary Rough set : A rough set is defined by the lower and upper approximations of a concept. The lower approximation contains all elements that necessarily belong to the concept, while the upper approximation contains those that possibly belong to the concept. In rough set theory, a concept is considered a classical set. Reduct :The task of rough set attribute reduction is to find a subset of the conditional attributes set, which functions as the original conditional attributes set without loss of classification capability. This subset of the conditional attributes set is called reduct. Attribute dependency :The degree of attribute dependency provides a measure how an attributes subset is dependent on another attributes subset. Relative attribute dependency :The relative attribute dependency degree can be calculated by counting the distinct instances of the subset of the data set, instead of generating discernibility functions or positive regions. 1

Abstract : Most existing rough set-based feature selection algorithms suffer from intensive computation of either discernibility functions or positive regions to find attribute reduct. In this paper, we develop a new computation model based on relative attribute dependency defined as the proportion of the projection of the decision table on a condition attributes subset to the projection of the decision table on the union of the condition attributes subset and the decision attributes set. To find an optimal reduct, we use information entropy conveyed by the attributes as the heuristic. A novel algorithm to find optimal reducts of condition attributes based on the relative attribute dependency is implemented using Java, and is experimented with 10 data sets from UCI Machine Learning Repository. We conduct the comparison of data classification using C4.5 with the original data sets and their reducts. The experiment results demonstrate the usefulness of our algorithm. 1 Introduction There are many factors affecting the performance of data analysis, and one prominent factor is the size of the data set. In the era of information, the availability of huge amounts of computerized data that many organizations possess about their business and/or scientific research attracts many researchers from different communities such as statistics, bioinformatics, databases, machine learning and data mining. Most data sets collected from real world applications contain noisy data, which may distract the analyst and mislead to nonsense conclusions. Thus the original data need to be cleaned in order to not only reduce the size of the dataset but also remove noise as well. This data cleaning is usually done by data reduction. Feature selection has long been an active research topic within statistics, pattern recognition, machine learning and data mining. Most researchers have demonstrated the interest in designing new methods and improving the performance of their algorithms. These methods can be divided into two types: exhaustive or heuristic search. The exhaustive search probes all possible subsets chosen from the original features. This is prohibitive when the number of the original features is large. In practice, the heuristic search is the way out of this exponential computation and in general makes use of background information to approximately estimate the relevance of features. Although the heuristic search works reasonably well, it is certain that some features with high order correlation may be missed out. Rough set theory has been used to develop feature selection algorithm by finding condition attribute reduct. Most existing rough set-based feature selection algorithms suffer from intensive computation of either discernibility functions or positive regions to find attribute reduct. In order to improve the efficiency, in this paper, we develop a new computation model based on relative attribute dependency. With this model, a novel algorithm to find optimal reducts of condition attributes based on the relative attribute dependency are proposed and implemented. The implemented algorithm is experimented with 10 data sets from UCI Machine 2

Learning Repository. The experiment results demonstrate their usefulness and are analyzed for further research. 2 Rough Set Approach Rough set theory was developed by Palwak [11] in the early 1980s and has been used in data analysis, pattern recognition, and data mining and knowledge discovery [8, 14]. Recently, rough set theory has also been employed to select feature subset [4, 10, 11, 15, 17]. In the rough set community, feature selection algorithms are attribute-reduct oriented, that is, finding optimal reduct of condition attributes of a given data set. Two main approaches to finding attribute reducts are recognized as discernibility function-based and attribute dependency-based [3, 11]. These algorithms, however, suffer from intensive computations of either discernibility functions for the former or positive regions for the latter, although some computation efficiency improvement has been made in some new developments. In rough set theory, the data is collected in a table, called decision table. Rows of the decision table correspond to instances, and columns correspond to features (or attributes). All attributes are recognized into two groups: conditional attributes set C as input and decision attributes set D as output. Assume P C D and Q C D, the positive region of Q with respect to P, denoted POS P (Q), is defined as POS P (Q) = def X U/IND(D) PX, where PX is the lower approximation of X and U/IND(D) is the equivalent partition induced by Q. The positive region of Q with respect to P contains all objects in U that can be classified using the information contained in P. With this definition, the degree of dependency of Q from P, denoted γ P (Q), is defined as P OS γ P (Q) = P (Q) def, where X denotes the cardinality of the set X. U The degree of attribute dependency provides a measure how an attributes subset is dependent on another attributes subset. γ P (Q) = 1 means that Q totally depends on P, γ P (Q) = 0 indicates that Q is totally independent from P, while 0 < γ P (Q) < 1 denotes a partially dependency of Q from P. Particularly, assume P C, then γ P (D) can be used to measure the dependency of the decision attributes from a conditional attributes subset. The task of rough set attribute reduction is to find a subset of the conditional attributes set, which functions as the original conditional attributes set without loss of classification capability. This subset of the conditional attributes set is called reduct, and defined as follows [14]. R C is called a reduct of C, if POS R (D) = POS C (D), or equivalently, γ R (D) = γ C (D). A reduct R of C is called a minimum reduct of C if Q R, Q is not a reduct of C. A reduct R of C has the same expressiveness of instances as C with respect to D. A decision table may have more than one reduct. Anyone of them can be used to replace the original 3

condition attributes set. Finding all the reducts from a decision table, however, is NP-hard. Thus, a natural question is which reduct is the best. Without domain knowledge, the only source of information to select the reduct is the contents of the decision table. For example, the number of attributes can be used as the criteria and the best reduct is the one with the smallest number of attributes. Unfortunately, finding the reduct with the smallest number of attributes is also NP-hard. Some heuristic approaches to finding a good enough reduct have been proposed. A recent algorithm, called QuickReduct, was developed by Shen and Chouchoulas [18] in 2002. QuickReduct is a filter approach of feature selection and a forward searching hill climber. QuickReduct initializes the candidate reduct R as an empty set, and attributes are added to R incrementally using the following heuristic: the next attribute to be added to R is the one with the highest significance to R with respect to the decision attributes. R is increased until R becomes a reduct. The basic idea behind this algorithm is that the degree of attribute dependency is monotonically increasing. There are two problems with this algorithm, however. First, it is not guaranteed to yield the best reduct with the smallest number of attributes. Second, to calculate the significance of attributes, the discernibility function and positive regions must be computed, which is inefficient and time-consuming. A variant of QuickReduct, called QuickReduct II is also a filter algorithm, but performs the backward elimination using the same heuristic [18]. 3 Relative Attribute Dependency Based on Rough Set Theory In order to improve the efficiency of algorithms to finding optimal reducts of condition attributes, we proposed a new definition of attribute dependency, called relative attribute dependency, with which we showed a sufficient and necessary condition of the optimal reduct of conditional attributes [4]. The relative attribute dependency degree can be calculated by counting the distinct instances of the subset of the data set, instead of generating discernibility functions or positive regions. Thus the computation efficiency of finding minimum reducts is highly improved. Most existing rough set-based attribute reduction algorithms suffer from intensive computation of either discernibility functions or positive regions. In the family of QuickReduct algorithms, in order to choose the next attribute to be added to the candidate reduct, one must compute the degree of dependency of all remaining conditional attributes from the decision attributes. This means that the positive regions POS R /P/ (D), p C -R, must be computed. To improve the efficiency of the attribute reduction algorithms, we define a new concept, called the degree of relative attribute dependency. For this purpose, we assume that the decision table is consistent, that is, t, s U, if f D (t) f D (s), then q C such that f q (t) f q (s). This assumption is not realistic in most real-life applications. Fortunately, any decision table can be uniquely decomposed into two decision tables, with one being consistent and the other the boundary area, and our method could be performed on the consistent one. 4

We first define the concept of projection and then define the relative attribute dependency. Let p C D. The projection of U on P, denoted as P (U), is a sub-table of U and constructed as follows: 1) eliminating attributes C D P; and 2) merging all indiscernible tuples (rows). Let Q C. The degree of relative dependency, denoted K Q (D), of Q on D over U is defined as Q K Q (D) = (U) Q (U), where X (U) is actually the number of equivalence classes in U/IND(X). D The relative attribute dependency is the proportion of the projection of the decision table on a condition attributes subset to the projection of the decision table on the union of the condition attributes subset and the decision attributes set. On the other hand, the regular attribute dependency is the proportion of the positive region of one attributes subset with respect to another attributes subset to the decision table. With the relative attribute dependency measure, we propose a new computation model to find a minimum reduct of condition attributes in a consistent decision table, which is described as follows. The Computation Model Based on Relative Attribute Dependency (RAD): Input:A consistent decision table U, condition attributes set C and decision attributes set D Output: A minimum reduct R of condition attributes set C with respect to D in U Computation: Find a subset R of C such that K R (D) = 1, and Q RK Q (D) < 1. The following theory shows that our proposed computation model is equivalent to the traditional model. The correctness of our model is built on the following condition: a subset of condition attributes is a minimum reduct in the tradition model if and only if it is a minimum reduct of condition attributes in our new model. Theorem [4]. Assume U is consistent. R C is a reduct of C with respect to D if and only if 1) K R (D) = K C (D) = 1 ; and 2) Q R, K Q (D) < K C (D). The degree of relative attribute dependency provides a mechanism of finding a minimum reduct of the conditional attributes set of a decision table. This dependency measure can be more efficiently calculated than the traditional functional computation. 4 A Heuristic Algorithm for Finding Optimal Reducts Some authors propose algorithms for constructing the best reduct, but what is the best depends on how to define the criteria, such as the number of attributes in the reduct. In the absence of criteria, the only source of information to select the reduct is the content of the data table. A common metric of data content is information entropy contained in the data items. In this section, we develop a heuristic algorithm to implement the proposed model based on the relative attribute dependency. The algorithm is based on the heuristic backward elimination in 5

terms of the information entropy conveyed by condition attributes. The algorithm calculates the information entropy conveyed in each attribute and selects the one with the maximum information gain for elimination. The goal of the algorithm is to find a subset R of the condition attributes set C such that R has the same classification power as C with respect to the given decision table. As our model suggests, such R is a minimum reduct of C with the total relative dependency on the decision attributes set D. To find such an R, we initialize R to containing all condition attributes in C, and then eliminate redundant attributes one by one. Given the partition by D, U/IND(D), of U, the entropy, or expected information based on the partition by q C, U/q, of U, is given by E(q) = Y Y U/q I(q Y ), where I(q Y) = U - Y X Y X U/IND(D) log X Y 2. Thus, the entropy E(q) can be represented as Y E(q) = - 1 U X U/IND(D) Y U/q X X Y log Y 2. Y Algorithm A C Attribute information entropy based backward elimination Input: Consistent decision table U, condition attributes set C, decision attributes set D Output: R a minimum reduct of condition attributes set C with respect to D in U Procedure: 1. R C, Q empty 2. For each attribute q in C do 3. Compute the entropy E(q) of q 4. Q Q {<q,e(q)>} 5. While Q Φ do 6. q arg max{e(p) <p, E(p) > Q} 7. Q Q? {< q, E(q) >} //select attribute with maximum entropy 8. If K R {q} (D) = 1 Then //if the relative dependency is 1 9. R R? {q} //remove q 10. Return R The following theorem demonstrates the correctness of Algorithm A. Theorem[4]. The outcome of Algorithm A is a minimum reduct of C with respect to D in U. Algorithm A has been implemented using the computer programming language Java. To calculate the information entropy of condition attributes and the relative dependency, the 6

original data set is sorted using the Radix-Sort technique. One can easily see that the time complexity of Algorithm A is O( C U log 2 U ), where C is the number of condition attributes, and U is the number of tuples in the decision table. 5 Experiments We select 10 data sets from UCI machine learning repository [2] to experiment our implemented algorithm, which are illustrated in Table 1. These data sets were carefully chosen to avoid numerical attributes and reflect diverse sizes. Since the current version of our approach only considers categorical attributes, numerical attributes need to be partitioned into nonintersected intervals. To verify our approach, we choose such data sets with small number of tuples and small number of attributes, small number of tuples and large number of attributes, large number of tuples and small number of attributes, as well as large number of tuples and large number of attributes. Table 1 describes each data set with the number of condition attributes and the number of rows. All data sets have only one decision attribute. Since some data sets such as breastcancer-wisconsin, dermatology, zoo, and audiology, contain tuple identifiers as a column which provides no information for data analysis, we remove these id columns. Table 1 also shows our experiment results using Algorithm A, where the column Number of Rows under Algorithm A gives the number of distinct tuples in the reduced data set by projecting the original data set on the reduct that the algorithm found; the column Reduct Size shows the number of condition attributes contained in the reduct. Table1: The 10 data sets excerpted from the UCI machine learning repository 7

The output results of Algorithm A are compared with the original data sets, illustrated in Figure 1. Figure 1 a) shows the comparison of the number of columns in the original data sets and the reducts found by our implemented algorithm. Figure 1 b) shows the comparison of number of discernible rows in the original data sets and the reducts discovered by our implemented algorithm. From the figure, one can see that in the cases where reducts were smaller than the number of condition attributes in the original data set, the number of indiscernible rows was reduced very much. To verify the effectiveness of the reducts discovered by Algorithm A, we run C4.5 [16] on both original data sets and the reudcts. The experiment results are listed in Table 2 and plot in Figure 2. From Table 2 and Figure 2, one can see that, with C4.5, using the reducts that are discovered by Algorithm A, we can obtain the little bit better classifiers than using the original data sets in most situations. Actually, only 2 of 10 data sets, namely, BCW and YS, C4.5 can find more accurate classifiers from the original data sets than from the reducts induced by our 8

algorithm, and the difference is very small ( 92.5%. 92.3% for BCM, 78.6% vs. 78.4% for YS ). This experimental results show us that Algorithm A is very useful and can be used to find optimal reducts for most application data sets to replace the original data sets. 6 Related Work Many feature subset selection algorithms have been proposed, and many approaches and algorithms to find classifiers based on rough set theory have been developed in the past decades. There are two approaches very close to our algorithm. Grzymala-Busse [5, 6] developed a learning system LERS that applies two algorithms LEM1 and LEM2 based on rough sets to deal with non-numerical and numerical attributes, respectively. LERS finds a minimal description of a concept, which is a set of rules. The rough measure of the rules describing a concept X is defined as X Y, where Y is the set of all examples described by the rule. This definition is Y very similar to our relative attribute dependency, which is defined as Q (U) Q U (U). Nguyen and Nguyen [11] developed an approach to first constructs the discernibility relation by sorting the data tuples in the data table, the uses the discernibility relation to build the lower and upper approximations, and finally applies the approximations to find a semi-minimal reduce. Our algorithm take advantage of the Radix-sorting technique, and has the same running efficiency as theirs, but our algorithm does not need to maintain the discernibility relation, and lower and upper approximations. Almuallim and Dietterich [1] proposed an exhaustive search algorithm, FOCUS, in 1994. The algorithm starts with an empty feature set and carries out exhaustive search until it finds a minimal combination of features that are sufficient for the data analysis task. It works on binary, noise-free data and runs in time of O(N M ). They also proposed three heuristic algorithms to speed up the search algorithm. Kira and Rendell [7] developed a heuristic algorithm, RELIEF, for data classification. RELIEF assigns a relevance weight each feature, which is meant to denote the relevance of the feature to the task. RELIEF samples instances randomly from the given data set and updates the relevance values based on the difference between the selected values and the two nearest instances of the same and opposite classes. It assumes two-class classification problems and does not help with redundant features. If most of the given features are relevant to the task, it would select most of them even though only a fraction is necessary for the classification. Another heuristic feature selection algorithm, PRESET, was developed by Modrzejewski [10] in 1993, that heuristically ranks the features and assumes a noise-free binary domain. Chi2 is also a heuristic algorithm proposed by Liu and Setiono [9] in 1995, which automatically removes irrelevant continuous features based on the χ 2 statistics and the inconsistency found in the data. Some other algorithms have been employed in data classification methods, for example, Quinlans C4.5 [16], Pagallo and Hausslers FRINGE [12]. 9

7 Summary and Future Work We proposed a novel definition relative attribute dependency, with which we developed a computational model for finding optimal reducts of conditional attributes. The relative attribute 9 dependency degree can be calculated by counting the distinct instances of the subset of the data set, instead of generating discernibility functions or positive regions. Thus the computation efficiency of finding minimum reducts is highly improved. We implemented an algorithm using the object-oriented programming language Java, based on the backward elimination. We experiment the implemented algorithm with 10 data sets from UCI Machine Learning Repository. These data sets are carefully excerpted to cover various situations with different number of features and tuples. Our experiment results show the algorithm significantly reduces the size of original data sets, and improves the prediction accuracy of the classifiers discovered by C4.5. Our future work will focus on the following aspects: 1) Apply more existing classification algorithms besides C4.5 to the results of our algorithms to see whether the classifier can be improved. We expect the classifier discovered in the reducts is more accurate than the one discovered in the original data sets. 2) Extend the algorithms to be able to process other types of data, such as numerical data. 3) Attempt to develop novel classification algorithms based on our definition of relative attribute dependency. References 1. Almuallim H., Dietterich, Learning Boolean concepts in the presence of many irrelevant features, Artificial Intelligence, Vol. 69(1-2), pp 279-305, 1994. 2. Blake, C. L. and Merz, C. J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/ mlearn/mlrepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. 3. Han, J., Hu, X., and Lin T. Y., A New Computation Model for Rough Set Theory Based on Database Systems, 5th International Conference on Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science 2737, pp. 381 C 390, 2003. 4. Han, J., Hu, X., and Lin T. Y., Feature Subset Selection Based on Relative Dependency Between Attributes, 4th International Conference on Rough Sets and Current Trends in Computing, Lecture Notes in Computer Science 3066, pp. 176-185, Springer, 2004. 5. J. W. Grzymala-Busse, LERS C A system for learning from examples based on rough sets. In Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory, ed. by R. Slowinski, Kluwer Academic Publishers, 3-18, 1992. 6. J. W. Grzymala-Busse, A Comparison of Three Strategies to Rule Induction, Proc. of the International Workshop on Rough Sets in Knowledge Discovery, Warsaw, Poland, April 5-13, pp.132-140, 2003. 10

7. Kira, K., Rendell, L.A. The Feature Selection Problem: Traditional Methods and a new Algorithm, 9th National Conference on Artificial Intelligence (AAAI), pp 129-134, 1992. 8. Lin, T.Y and Cercone, N., Applications of Rough Sets Theory and Data Mining, Kluwer Academic Publishers, 1997. 9. Liu, H. and Setiono, R., Chi2: Feature Selection and Discretization of Numeric Attributes, 7th IEEE International Conference on Tools with Artificial Intelligence, 1995. 10. Modrzejewski, M., Feature Selection Using Rough Sets Theory, European Conference on Machine Learning, pp.213-226, 1993. 11. Nguyen, H., Nguyen, S., Some efficient algorithms for rough set methods, IPMU, 1451-1456, 1996 12. Pagallo, G., Haussler, D., Boolean Feature Discovery in Empirical Learning, Machine Learning, Vol. 5, pp 71-99, 1990. 13. Pawlak, Z., Rough Sets, International Journal of Information and Computer Science, 11(5), pp.341-356, 1982. 14. Pawlak, Z., Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishers, 1991. 15. Quafafou, M. and Boussouf, M., Generalized Rough Sets Based Feature Selection, Intelligent Data Analysis, Vol. 4, pp. 3-17, 2000. 16. Quinlan, J.R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. 17. Sever, H., Raghavan, V., and Johnsten, D. T., The Status of Research on Rough Sets for Knowledge Discovery in Databases, 2nd International Conference on Nonlinear Problems in Aviation and Aerospace, Vol. 2, pp. 673-680, 1998. 18. Shen, Q., Chouchoulas A., A Rough-fuzzy Approach for Generating Classification Rules, Pattern Recognition, Vol. 35, pp. 2425-2438, 2002. 11