Consistency Based Attribute Reduction

Size: px

Start display at page:

Download "Consistency Based Attribute Reduction"

Edgar Beasley
6 years ago
Views:

1 Consistency ased Attribute eduction inghua Hu, Hui Zhao, Zongxia Xie, and aren Yu Harbin nstitute of Technology, Harbin , P China huqinghua@hcmshiteducn ough sets are widely used in feature subset selection and attribute reduction n most of the existing algorithms, the dependency function is employed to evaluate the quality of a feature subset The disadvantages of using dependency are discussed in this paper And the problem of forward greedy search algorithm based on dependency is presented We introduce the consistency measure to deal with the problems The relationship between dependency and consistency is analyzed t is shown that consistency measure can reflects not only the size of decision positive region, like dependency, but also the sample distribution in the boundary region Therefore it can more finely describe the distinguishing power of an attribute set ased on consistency, we redefine the redundancy and reduct of a decision system We construct a forward greedy search algorithm to find reducts based on consistency What s more, we employ cross validation to test the selected features, and reduce the overfitting features in a reduct The experimental results with UC data show that the proposed algorithm is effective and efficient 1 ntroduction As the capability of gathering and storing data increases, there are a lot of candidate features in some pattern recognition and machine learning tasks Applications show that excessive features will not only significantly slow down the learning process, but also decrease the generalization power of the learned classifiers Attribute reduction, also called feature subset selection, is usually employed as a preprocessing step to select part of the features and focuses the learning algorithm on the relevant information [1, 3, 4, 5, 7, 8] n recent years, rough set theory has been widely discussed and used in attribute reduction and feature selection [6, 7, 8, 14, 16, 17] educt is a proper term in rough set methodology t means a minimal attribute subset with the same approximating power as the whole set [14] This definition shows that a reduct should have the least redundant information and not loss the classification ability of the raw data Thus the attributes in a reduct should not only be strongly relevant to the learning task, but also be not redundant with each other This property of reducts exactly accords with the objective of feature selection Thereby, the process of searching reducts, called attribute reduction, is a feature subset selection process As so far, a series of approaches to search reducts have been published iscernibility Matrices [11, 14] were introduced to store the features which can distinguish the corresponding pair of objects, and then oolean operations were conducted on the matrices to search all of the reducts The main problem of this method is space and Z-H Zhou, H i, and Yang (Eds): PAK 2007, NA 4426, pp , 2007 Springer-erlag erlin Heidelberg 2007

2 Consistency ased Attribute eduction 97 time cost We need a 10 h0 matrix if there are 10 4 samples What s more, it is also time-consuming to search reducts from the matrix with oolean operations With the dependency function, a heuristic search algorithm was constructed [1, 6, 7, 8, 16] There are some problems in dependency based attribute reduction The dependency function in rough set approaches is the ratio of sizes of the positive region over the sample space The positive region is the sample set which can be undoubtedly classified into a certain class according to the existing attributes rom the definition of the dependency function, we can find that it ignores the influence of boundary samples, which maybe belong to more than one class However, in classification learning, the boundary samples also exert an influence on the learned results or example, in learning decision trees with CAT or C45 learning, the samples in leaf nodes sometimes belong to more than one class [2, 10] n this case, the nodes are labeled with the class with majority of samples However, the dependency function does not take this kind of samples into account What s more, there is another risk in using the dependency function in greedy feature subset search algorithms n a forward greedy search, we usually start with an empty set of attribute, and then we add the selected features into the reduct one by one n the first round, we need to compute the dependency of each single attribute, and select the attribute with the greatest dependency value We find that the greatest dependency of a single attribute is zero in some applications because we can not classify any of the samples beyond dispute with any of the candidate features Therefore, according to the criterion that the dependency function should be greater than zero, none of the attributes can be selected Then the feature selection algorithm can find nothing However, some combinations of the attributes are able to distinguish any of the samples although a single one cannot distinguish any of them As much as we know, there is no research reporting on this issue so far These issues essentially result from the same problem of the dependency function t completely neglects the boundary samples n this paper, we will introduce a new function, proposed by ash and iu [3], called consistency, to evaluate the significance of attributes We discuss the relationship between dependency and consistency, and employ the consistency function to construct greedy search attribute reduction algorithm The main difference between the two functions is in considering the boundary samples Consistency not only computes the positive region, but also the samples of the majority class in boundary regions Therefore, even if the positive region is empty, we can still compare the distinguishing power of the features according to the sample distribution in boundary regions Consistency is the ratio of consistent samples; hence it is linear with the size of consistent samples Therefore it is easy to specify a stopping criterion in a consistency-based algorithm With numerical experiments, we will show the specification is necessary for real-world applications n the next section, we review the basic concepts on rough sets We then present the definition and properties of the consistency function, compare the dependency function with consistency, and construct consistency based attribute reduction in section 3We present the results of experiments in section 4 inally, the conclusions are presented in section 5

3 98 Hu et al 2 asic Concepts on ough Sets ough set theory, which was introduced to deal with imperfect and vague concepts, has Wtracted a lot of attention from theory and application research areas ata sets are usually given as the form of tables, we call a data table as an information system, formulated as S =< U, A,, f >, where U = { x1, x2, xn} is a set of finite and nonempty objects, called the universe, A is the set of attributes characterizing the objects, is the domain of attribute value and f is the information function f : U A f the attribute set is divided into condition attribute set C and decision attribute set, the information system is also called a decision table With arbitrary attribute subset A, there is an indiscernibility relation N () : N ( ) = { < x, y > U U a, a( x) = a( y)} < x, y > N( ) means objects x and y are indiscernible with respect to attribute set Obviously, indiscernibility relation is an equivalent relation, which satisfies the properties of reflexivity, symmetry and transitivity The equivalenw class induced by the attributes is denoted by [ xi ] = { x < xi, x > N( ), y U} Equivalent classes generated by are also called -elemental granules, - information granules The set of elemental granules forms a concept system, which is used to characterize the imperfect concepts in the information system Given an arbitrary concept X in the information system, two unions of elemental granules are associated with X = {[ x] [ x] X, x U}, X = {[ x] [ x] X, x U} The concept X is approximated with the two sets of elemental granules X and X are called lower and upper approximations of X in terms of attributes X is also called the positive region X is a definable if X = X, which means the concept X can be perfectly characterized with the knowledge, otherwise, X is indefinable An indefinable set is called a rough set N( X ) = X X is called the boundary of the approximations As a definable set, the boundary is empty Given <U, C,,,f>, C and will generates two partitions of the universe Machine learning is usually involved in using condition knowledge to approximate the decision and finding the mapping from the conditions to decisions Approximating U / with U / C, the positive and boundary regions are defined as: POS C ( ) = U CX, N ( ) = U CX U CX X U / C X U / X U / The boundary region is the set of elemental granules which can not be perfectly described by the knowledge C, while the positive region is the set of C-elemental granules which completely belong to one of the decision concepts The size of positive or boundary regions reflects the approximation power of the condition

4 Consistency ased Attribute eduction 99 attributes Given a decision table, for any C, it is said the decision attribute set depends on the condition attributes with the degree k, denoted by, where POS ( ) k = γ ( ) = U The dependency function k measures the approximation power of a condition attribute set with respect to the decision n data mining, especially in feature selection, it is important to find the dependence relations between attribute sets and to find a concise and efficient representation of the data Given a decision table T =< U, C U,, f >, if P C, we have γ ( ) γ ( ) Given a decision table T =< U, C U,, f >, C, a, we say that the condition attribute a is indispensable if γ ( a) ( ) < γ ( ), otherwise we say a is redundant We say C is independent if any a in is indispensable Attribute subset is a reduct of the decision table if 1) γ ( ) = γ C ( ) ; 2) a : γ ( ) > γ ( ) a A reduct of a decision table is the attribute subset which keeps the approximating capability of all the condition attributes n the meantime it has no redundant attribute The term of reduct presents a concise and complete ways to define the objective of feature selection and attribute reduction P k 3 Consistency ased Attribute eduction A binary classification problem in discrete spaces is shown in igure 1, where the samples are divided into a finite set of equivalence classes { E1, E2,, EK } based on their feature values The samples with the same feature values are grouped into one equivalence class We find that some of the equivalence classes are pure as their samples belong to one of decision classes, but there also are some inconsistent equivalence classes, such as E 3 and E 4 in figure1 According to rough set theory, they are named as decision boundary region, and the set of consistent equivalence classes is named as decision positive region The objective of feature selection is to find a feature subset which minimizes the inconsistent region, in either discrete or numerical cases, accordingly, minimizes ayesian decision error t is therefore desirable to have a measure to reflect the size of inconsistent region for discrete and numerical spaces for feature selection ependency reflects the ratio of consistent samples over the whole set of samples Therefore dependency doesn t take the boundary samples into account in computing significance of attributes Once there are inconsistent samples in an equivalence class, these equivalence classes are just ignored However, inconsistent samples can be divided into two groups: a subset of samples under the majority class and a subset under the minority classes According

5 100 Hu et al p ( E1 ω1 ) p ( E 2 ω 1 ) p ( E 3 ω 2 ) p( E 4 ω 2 ) p( E5 ω2) p( E6 ω2 ) p ( E1 ω1 ) p ( E 2 ω 1 ) p ( E 3 ω 2 ) p( E 4 ω 2 ) p( E5 ω2) p( E6 ω2) E1 E2 E3 E4 E5 E6 p E3 ω ) p E4 ω ) ( 1 ( 1 E1 E2 E3 E4 E5 E6 p E3 ω ) p E4 ω ) ( 1 (1) (2) ig 1 Classification complexity in a discrete feature space ( 1 to ayesian rule, only the samples under the minority classes are misclassified ox example, the samples in E 3 and E 4 are inconsistent in figure 1 ut only the samples labeled with P ( E 3 ω2 ) and P ( E 4 ω1) are misclassified The classification power in this case can be given by f = 1 [ P( E3 ω2 ) P( E3 ) P( E4 ω1) P( E4 )] ependency can not reflect the true classification complexity n the discrete cases, we can see from comparison of figure 1 (1) and (2) although the probabilities of inconsistent samples are identical, the probabilities of misclassification are different ependency function in rough sets can not reflect this difference n [3], ash and iu introduced the consistency function which can measure the difference Now we present the basic definition on consistency Consistency measure is defined by inconsistency rate, computed as follows efinition 1 A pattern is considered to be inconsistent if there are at least two objects such that they match the whole condition attribute set but are with different decision label efinition 2 The inconsistency count ξ i for a pattern p i of feature subset is the number of times it appears in the data minus the largest number among different class labels efinition 3 The inconsistency rate of a feature subset is the sum, ξ i, of all the inconsistency counts over all patterns of the feature subset that appears in data divided by U, the size of all samples, namely ξ i / U Correspondingly, consistency is computed as = ( U ξi )/ U δ ased on the above analysis, we can understand that dependency is the ratio of samples undoubtedly correctly classified, and consistency is the ratio of samples probably correctly classified There are two kinds of samples in POS ( ) U M POS () is the set of consistent samples, while M is the set of the samples with the largest number among different class labels in the boundary region n the paper, we will call M pseudoconsistent samples

6 Consistency ased Attribute eduction 101 Property 1: Given a decision table <U, C,, f>, C, we have 0 δ ( ) 1, γ ( ) δ ( ) Property 2 (monotonicity): Given a decision table <U, C,, f>, if we have δ ) δ ( ) 1( 2 Property 3: Given a decision table <U, C,, f>, if and only if namely, the table is consistent, we have δ ( ) = γ ( ) = 1 C C 1 2, U / C U /, efinition 4 Given a decision table T =< U, C U,, f >, C, a, we say condition attribute a is indispensable in if δ ( a) ( ) < δ ( ), otherwise; we say a is redundant We say C is independent if any attribute a in is indispensable δ () reflects not only the size of positive regions, but also the distribution of boundary samples The attribute is said to be redundant if the consistency doesn t decrease when we delete it Here the term redundant has two meanings The first one is relevant but redundant, the same as the meaning in general literatures [6, 7, 8, 14, 16, 17] The second meaning is irrelevant So consistency can detect the two kinds of superfluous attributes [3] efinition 5 Attribute subset is a consistency-based reduct of the decision table if (1) δ ( ) = δ C ( ) ; (2) a : δ ( ) > δ ( ) a n this definition, the first term guarantees the reduct has the same distinguishing ability as the whole set of features; the second one guarantees that all of the attributes in the reduct are indispensable Therefore, there is not any superfluous attribute in the reduct inding the optimal subset of features is a NP-hard problem We require evaluating N 2 1 combinations of features for find the optimal subset if there are N features in the decision table Considering computational complexity, here we construct a forward greedy search algorithm based on the consistency function We start with an empty set of attribute, and add one attribute into the reduct in a round The selected attribute should make the increment of consistency maximal Knowing attribute subset, we evaluate the significance of an attribute a as SG( a,, ) = δ U ( ) δ ( ) SG ( a,, ) is the increment of consistency by introducing the new attribute a in the condition of The measure is linear with the size of the new consistent and pseudo-consistent samples ormally, a forward greedy reduction algorithm based on consistency can be formulated as follows a

7 102 Hu et al Algorithm: Greedy eduction Algorithm based on Consistency nput: ecision table < U, C U d, f > Output: One reduct red Step 1: red ; // red is the pool to contain the selected attributes Step 2: or each a i A red Compute SG( a, red, ) δ ( ) δ ( ) end Step 3: select the attribute SG i = redu a i red ak which satisfies: ( a, red, ) max( SG( a, red, k = Step 4: f SG( a k, red, ) > 0, red U ak red go to step2 else return red Step 5: end i n the first round, we start with an empty set, then specify δ ( ) = 0 n this algorithm, we generate attribute subsets with a semi-exhaustive search Namely, we evaluate all of the rest attributes in each round with the consistency function, and select the feature producing the maximal significance The algorithm stops when adding any of the rest attributes will not bring increment of consistency value n realworld applications, we can stop the algorithm if the increment of consistency is less than a given threshold to avoiding the over-fitting problem n section 4, we will discuss this problem in detail The output of the algorithm is a reduced decision table The irrelevant, relevant and redundant attributes are deleted from the system The output results will be validated with two popular learning algorithms: CAT and SM, in section 4 y employing a hashing mechanism, we can compute the inconsistency rate approximately with a time complexity of O ( U ) [3] n the worst case the whole computational complexity of the algorithm can be computed as U C + U ( C 1) + + U = ( C + 1) C U / 2 i )) 4 Experimental Analysis There are two main objectives to conduct the experiments irst, we compare the proposed method with dependency based algorithm Second, we study the classification performance of the attributes selected with the proposed algorithm, n particular, how the classification accuracy varies with adding a new feature This can tell us where the algorithm should be stopped We download data sets from UC epository of machine learning databases The data sets are described in table 1 There are some numerical attributes in the data sets Here we employ four discretization techniques to transform the numerical data into

8 Consistency ased Attribute eduction 103 Table 1 ata description ata set Abbreviation Samples eatures Classes Australian Credit Approval Crd Ecoli Ecoli Heart disease Heart onosphere ono Sonar, Mines vs ocks Sonar Wisconsin iagnostic reast Cancer WC Wisconsin Prognostic reast Cancer WPC Wine recognition Wine categorical one: equal-width, equal-frequency, CM and entropy Then we conduct the dependency based algorithm [8] and the proposed one on the discretized data sets The numbers of the selected features are presented in table 2 rom table 2, we can find there is a great problem with dependency based algorithm, where, P stands for dependency based algorithm, and C stands for consistency based algorithm The algorithm selects two few feature for classification learning as to some data sets As to the discretized data with Equal-width method, the dependency based algorithm only selects one attribute, while the consistency one selects 7 attributes As to Equal-frequency method, the dependency based algorithm selects nothing for data sets Heart, Sonar and WPC The similar case occurs to Entropy and CM based discretization methods Obviously, the results are unacceptable if a feature selection algorithm cannot find anything y contrast, the consistency based attribute reduction algorithm finds feature subsets with moderate sizes for all of the data sets What s more, the sizes of selected features with the two algorithms are comparable if the dependency algorithm works well Why does the dependency based algorithm find nothing for some data sets? As we know, dependency just reflects the ratio of positive regions The forward greedy algorithm starts off with an empty set and adds, in turn, one of the best attributes into the pool at a time, those attributes that result in the greatest increase in the dependency function, until this produces its maximum possible value for the data set n the first turn, we need to evaluate each single attribute or some data sets, the dependency is zero for each single attribute Therefore, no attribute can be added into the pool in the first turn Then the algorithm stops here Sometimes, the algorithm can Table 2 The numbers of selected features with different methods aw Equal-width Equal-frequency Entropy CM data P C P C P C P C Crd Ecoli Heart ono Sonar WC WPC Wine Aver

9 104 Hu et al also stop in the second turn or the third turn However, the selected features are not enough for classification learning Consistency can overcome this problem as it can reflect the change in distribution of boundary samples Now we use the selected data to train classifiers with CAT and SM learning algorithms We test the classification power of the selected data with 10-fold cross validation The average classification accuracies with CAT and SM are presented in tables 3 and 4, respectively rom table 3, we can find most of the reduced data can keep, even improve the classification power if the numbers of selected attributes are appropriate although most of the candidate features are deleted from the data t shows that most of the features in the data sets are irrelevant or redundant for training decision trees; thereby, it should be deleted However, the classification performance will greatly decrease if the data are excessively reduced, such as iono in the equalwidth case and ecoli in the entropy and CM cases Table 3 Classification accuracy with 10-fold cross validation (CAT) aw Equal-width Equal-frequency Entropy CM data P C P C P C P C Crd Ecoli Heart ono Sonar WC WPC Wine Aver We can also find from table 4 that most of classification accuracies of reduced data decrease a little compared with the original data Correspondingly, the average classification accuracies for all of the four discretization algorithms are a little lower than the original data This shows that both dependency and consistency based feature selection algorithms are not fit for SM learning because both dependency and consistency compute the distinguishing power in discrete spaces Table 5 shows the selected features based on consistency algorithm and the corresponding turns being selected for parts of the data, where we use the CM discretized data sets The trends of consistency and classification accuracies with Table 4 Classification accuracy with 10-fold cross validation (SM) aw Equal-width Equal-frequency Entropy CM data P C P C P C Positive C Crd Ecoli Heart ono Sonar WC WPC Wine Aver

10 Consistency ased Attribute eduction 105 Table 5 The selected eatures with method CM + Consistency 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Heart ono Sonar WC WPC W O [ H G, W O [ H G, 1XPEHUHWXUH 1XPEHUHWXUH (1)Heart (2) ono W O [ H G, 1XPEHUHWXUH W O [ H G, (3) Sonar (4) WC 1XPEHUHWXUH W O [ H G, 1XPEHUHWXUH (5) WPC &WH\ &$57XU\ 690XU\ ig 4 Trends of consistency, accuracies with CAT and SM CAT and SM are shown in figure 4 As to all of the five plots, the consistency monotonously increases with the number of selected attributes The maximal value of consistency is 1, which shows that the corresponding decision table is consistent With the selected attributes, all of the samples can be distinguished What s more, it is

11 106 Hu et al noticeable that the consistency rapidly rises at the beginning; and then slowly increases, until stops at 1 t means that the majority of samples can be distinguished with a few features, while the rest of the selected features are introduced to discern several samples This maybe leads to the over-fitting problem Therefore the algorithm should be ceased earlier or we need a pruning algorithm to delete the over-fitting features The classification accuracy curves also show this problem n figure 4, the accuracies with CAT and SM rise at first, arrive at a peak, then keep unchanged, or even decrease n terms of classification learning, it shows the features after the peak are useless They sometimes even deteriorate learning performance Here we can take two measures to overcome the problem The first one is to stop the algorithm when the increment of consistency is less than a given threshold The second one is to employ some learning algorithm to validate the selected features, and delete the features after the accuracy peak However, sometimes the first one, called prepruning method, is not feasible because we usually cannot exactly predict where the algorithm should stop The latter, called post-pruning, is widely employed n this work, cross validation are introduced to test the selected features Table 6 shows the numbers of selected features and corresponding classification accuracies We can find that the classification performance improves in most of the cases At the same time, the selected features with consistency are further reduced Especially for data sets Heart and ono, the improvement is high to 10% and 18% with CAT algorithm Table 6 Comparison of features and classification performance with post-pruning aw data CAT SM features CAT SM features Accuracy features Accuracy Heart ono Sonar WC WPC Conclusions n this paper, we introduce consistency function to overcome the problems in dependency based algorithms We discuss the relationship between dependency and consistency, and analyze the properties of consistency With the measure, the redundancy and reduct are redefined We construct a forward greedy attribute reduction algorithm based on consistency The numerical experiments show the proposed method is effective Some conclusions are shown as follows Compared with dependency, consistency can reflect not only the size of decision positive region, but also the sample distribution in boundary region Therefore, the consistency measure is able to describe the distinguishing power of an attribute set more finely than the dependency function Consistency is monotonous The consistency value increases or keeps when a new attribute is added into the attribute set What s more, some attributes are introduced into the reduct just for distinguishing a few samples f we keep these attributes in the final result, the attributes maybe overfit the data Therefore, a pruning technique is

12 Consistency ased Attribute eduction 107 required We use 10-fold cross validation to test the results in the experiments and find more effective and efficient feature subsets eference 1 hatt, Gopal M: On fuzzy-rough sets approach to feature selection Pattern ecognition etters 26 (2005) reiman, reidman J, Olshen, Stone C: Classification and regression trees California: Wadsworth nternational ash M, iu H: Consistency-based search in feature selection Artificial ntelligence 151 (2003) Guyon, Weston J, arnhill S, et al: Gene selection for cancer classification using support vector machines Machine earning 46 (2002) Guyon, Elisseeff A: An introduction to variable and feature selection Journal of Machine earning esearch 3 (2003) Hu H, i X, Yu : Analysis on Classification Performance of ough Set ased educts Yang and G Webb (Eds): PCA 2006, NA 4099, pp , 2006 Springer-erlag erlin Heidelberg 7 Hu H, Yu, Xie Z X: nformation-preserving hybrid data reduction based on fuzzy-rough techniques Pattern ecognition etters 27 (2006) Jensen, Shen : Semantics-preserving dimensionality reduction: ough and fuzzyrough-based approaches EEE transactions of knowledge and data engineering 16 (2004) iu H, Yu : Toward integrating feature selection algorithms for classification and clustering EEE Transactions on knowledge and data engineering 17 (2005) uinlan J : nduction of decision trees Machine earning 1 (1986) Skowron A, auszer C: The iscernibility Matrices and unctions in nformation Systems ntelligent ecision Support-Handbook of Applications and Advances of the ough Sets Theory, Slowinski (ed), 1991, pp Slezak : 2001 Approximate decision reducts Ph Thesis, Warsaw University 13 Ślezak : Approximate Entropy educts undamenta nformaticae 53 (2002) Swiniarski W, Skowron A: ough set methods in feature selection and recognition Pattern recognition letters 24 (2003) Xie Z X, Hu H, Yu : mproved feature selection algorithm based on SM and correlation ecture notes in computer science 3971(2006) Zhong N, ong,j, Ohsuga S: Using rough sets with heuristics for feature selection J ntelligent nformation Systems 16 (2001) Ziarko W: ariable precision rough sets model Journal of Computer and System Sciences 46 (1993) 39-59

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM Zhongzhi Shi, Shaohui Liu, Zheng Zheng Institute Of Computing Technology,Chinese Academy of Sciences, Beijing, China Abstract: Key words: Efficiency of algorithms