2009 Internatonal Conference on Machne Learnng and Computng IPCSIT vol.3 (2011) (2011) IACSIT Press, Sngapore Feature Selecton as an Improvng Step for Decson Tree Constructon Mahd Esmael 1, Fazekas Gabor 2 1 Department of Computer Scence, Islamc Azad Unversty (Kashan Branch), Iran 2 Faculty of Informatcs, Unversty of Debrecen, Hungary Abstract. The removal of rrelevant or redundant attrbutes could beneft us n makng decsons and analyzng data effcently. Feature Selecton s one of the most mportant and frequently used technques n data preprocessng for data mnng. In ths paper, specal attenton s made on feature selecton for classfcaton wth labeled data. Here an algorthm s used that arranges attrbutes based on ther mportance usng two ndependent crtera. Then, the arranged attrbutes can be used as nput one smple and powerful algorthm for constructon decson tree (Oblvous Tree). Results ndcate that ths decson tree usng featured selected by proposed algorthm outperformed decson tree wthout feature selecton. From the expermental results, t s observed that, ths method generates smaller tree havng an acceptable accuracy. Keywords: Decson Tree, Feature Selecton, Classfcaton Rules, Oblvous Tree 1. Introducton Feature selecton plays an mportant role n data mnng tasks. Methods always perform better wth lower-dmensonal compared to hgher-dmensonal data. Irrelevant or redundant attrbutes as useless nformaton often nterfere wth useful ones. In the classfcaton task, the man am of feature selecton s to reduce the number of attrbutes used n classfcaton whle mantanng acceptable classfcaton accuracy. In optmal feature selecton, all possble feature combnatons should be searched. Ths searched space s exponentally prohbtve for exhaustve search even wth a moderate attrbutes. In ths case, the hgh computatonal cost s stll a problem unsolved. Under certan crcumstances, suboptmal feature selecton algorthms are an alternatve. Though suboptmal feature selecton algorthms do not guarantee the optmal soluton, the selected feature subset usually leads to a hgher performance n the nducton system (such as a classfer). Search may also be started wth a randomly selected subset n order to avod beng trapped nto local optmal [1]. Each feature selecton algorthm needs to be evaluated usng a certan crteron. An optmal subset selected utlzng one crteron may not be optmal accordng to another crteron. An evaluaton crteron can be broadly categorzed nto two groups based on ther dependency on mnng algorthms that wll fnally be appled on the selected feature subset [2]. An ndependent crteron, as the name suggests, tres to evaluate a feature subset by characterstcs of the tranng data wthout nvolvng any mnng algorthm. Some popular ndependent crtera are dstance measures, nformaton measures, dependency measures, and consstency measures [3][4][5]. Instead, a dependent crteron requres a predetermned mnng algorthm n feature selecton and uses the performance of the mnng algorthm appled on the selected subset to determne whch features are selected. There are two man technques for feature subset selecton,.e. the flter and wrapper methods. All flter methods use heurstcs based on general characterstcs of the data rather than a learnng algorthm to Tel.: +98 361 5550055; fax: +98 361 5550056. E-mal address: M.Esmael@aukashan.ac.r, Fazekas.Gabor@crc.undeb.hu. 35
evaluate the mert of feature subsets. Wrapper methods for feature selecton use an nducton algorthm to estmate the mert of feature subsets. Flter methods are n general much faster than wrapper methods and more practcal for usng on hgh-dmensonal data. Feature wrappers often acheve better results than flter due to ths fact that they are tuned to the specfc nteracton between an nducton algorthm and ts tranng data [6]. Early research efforts manly are focused on feature selecton for classfcaton wth labeled data where class nformaton s avalable [1][2][7][8]. Dvde-and-conquer algorthms such as ID3 choose an attrbute to maxmze the nformaton gan; proposed algorthm whch we wll descrbe chooses an attrbute to maxmze the probablty of the desred classfcaton. Experments wth a decson tree learner (C4.5) have shown that addng to standard datasets a random bnary attrbute generated by tossng an unbased con affects classfcaton performance, causng t to deterorate (typcally by 5% to 10% n the stuatons tested). Ths happens because at some pont n the trees those are learned the rrelevant attrbute s nvarably chosen to branch on, causng random errors when test data s processed [9]. We should know that there s no sngle machne learnng method whch be approprate for all possble learnng problems. The unversal learner s an dealstc fantasy. In ths paper be used an algorthm that arrange attrbutes based on mportance by two ndependent crtera. Then, ranked attrbutes are used as nput for constructon one decson tree. Our goal s to consder nfluences data preprocessng (feature selecton) on classfcaton. Ths paper s organzed as follows. Secton 2 s detaled descrpton of the proposed method. Secton 3 descrbes the data sets, results and dscusson. Fnally, secton 5 concludes the research. 2. Proposed Method Proposed method s descrbed n ths secton. Schematc dagram of method shows n Fgure 1. Attrbute Rankng Algorthm (ARA) Top-Down Inducton of Decson Trees (TDIDT) Phase 1 Phase 2 Fg. 1: Schematc Dagram of Proposed Method 2.1. The Frst Phase In the frst phase, t s used attrbute rankng algorthm (ARA) before rule generaton. In partcular, we want to address the nducer to optmze the model through feature selecton. In ARA algorthm be used a measure whch a knd of ths measure was proposed n [10] for determnng mportance of the orgnal attrbutes. Then, ranked attrbutes obtaned based on ths algorthm are fed as nputs to the second phase. As mentoned prevously, dstance measures and dependency measures are two popular ndependent crtera. In dstance measures we try to fnd the feature that can separate the two classes as far as possble. Dependency measures are also known as correlaton measures or smlarty measures. They measure the ablty to predct the value of one varable from the value of another. In feature selecton for classfcaton, we look for how strongly a feature s assocated wth the class [2]. The ARA ncludes two parts, class dstance rato and an attrbute-class correlaton measure. Class dstance rato s measured from two parameters. These parameters are calculated wth the kth attrbute omtted from each nstance. Equaton (1) and (2) show how to do ths. c n Dstance1= P X k k m X k m 1 1 c T 1 2 Dstance2= P m mm m 1 T 1 2 (1) (2) 36
C s the number of classes n the data set and P s the probablty of the th class. m and m are the mean vector of the th class and mean of all nstances n the data set, respectvely. n s number of nstances n the th class, and N s the total number of nstances n the data set,.e., N=n1 +n2 + +nc On the other sde, the attrbute-class correlaton measure s used to evaluate the power of each attrbute affectng the class label for each nstance. The larger the correlaton factor, the more mportant the attrbute s for determnng the class labels of nstances. A great magntude of attrbute class correlaton shows that there s a close correlaton between class labels and attrbute, whch ndcates the great mportance of ths attrbute n classfyng the nstances, and vce versa. Equaton (3) ndcates attrbute-class correlaton. Equaton calculated for attrbutes that not belong to the same class. Attrbute class correlaton= Xk Xjk (3) 2.2. The Second Phase In the second phase smple but very powerful algorthm s used for generatng rules called Top-Down Inducton of Decson Trees (TDIDT). Ths has been known snce the md-1960s and has formed the bass for many classfcaton systems, two of the best known beng ID3 and C4.5, as well as beng used n many commercal data mnng packages [11]. Fgure 2 shows ths algorthm. # j IF all the nstances n the tranng set belong to the same class THEN Return the value of class ELSE (a) Select an attrbute A from ranked lst (b) Sort the nstances n the tranng set nto subsets, one for each value of attrbute A (c) Return a tree wth one branch for each non-empty subset, Each branch havng a descendant subtree or a class value Produced by applyng the algorthm recursvely 3. Results and Dscusson Fg. 2: Top-Down Inducton of Decson Trees(TDIDT) The effectveness of newly proposed method has to be evaluated n practcal experment. For ths reason we selected four data sets from UCI repostory [12]. Table 1 shows tranng datasets and ther characterstcs. Table 1. Descrpton for Test Number of Attrbutes Number of Instances Number of classes 4 150 3 7 432 2 10 214 6 34 351 2 Above datasets are used as nput of ARA algorthm, frst phase of proposed algorthm. A ranked attrbutes lst are obtaned from ths phase. Table 2 and Table 3 show output of ARA and attrbute orderng, respectvely. Table 2: Output of ARA algorthm Output of ARA 1662,3471,3727,27 21388,22750,20231,22198,22725,524 2452,6280,3210,2554,1413,2206,2127,2875,59 1796,10766,9392,11534,9338,10705,10385,10217,9408,10496,9777,11543, 9947,12124,9096,11158,9973,11337,10297,11666,10074,11562,10454, 11081,9995,9458,11398,11370,9837,11535,10115,10441,8927,1800 37
Table 3: Importance rankng results obtaned by frst phase of proposed algorthm Attrbutes Orderng 3,2,1,4 2,5,4,1,3,6 2,3,8,4,1,6,7,5,9 14,20,22,12,30,4,27,28,18,16,24,2,6,10,23,32,7,19, 8,31,21,25,17,13,29,11,26,9,3,5,15,33,34,1 On the bass of attrbute orderng n Table 3, attrbutes are passed to second phase whch constructs a decson tree. As mentoned before n ths phase of algorthm, smple and very strong algorthm s used. Fgure 3 shows output of ths algorthm for dataset. It s obvous from Fgure 3, rule orderng s same as attrbute rankng. So that the most mportant attrbute compare n the frst term of rule. All of rules nclude ths attrbute. Feld3<=1.7 : -setosa( 48) Feld3>1.7 AND Feld2<=2.2 : -verscolor(4/1) Feld3>1.7 AND Feld2>2.2 AND Feld1<=4.9 : -verscolor(3/2) Feld3>1.7 AND Feld2>2.2 AND Feld1>4.9 AND Feld4<=1.4 : -verscolor(34/2) Feld3>1.7 AND Feld2>2.2 AND Feld1>4.9 AND Feld4>1.4 : -vrgnca(61/14) Fg. 3: Rule generaton by the second phase of algorthm After tree constructon and also confuson matrx, evaluaton parameters such as Recall, F-measure, Precson, and Accuracy are calculated. Ths step s done for all of data set (Table 4). Table 4: Detaled Accuracy by Class Monk s Problems Glass Identfcaton TP Rate 0.96 0.10 0.96 0.06 0.54 0.86 0.70 FP Rate 0 0.05 0.14 0.03 0.58 0.02 0.10 0.17 Recall 0.90 0.07 0.84 0.07 0.08 0.22 0.70 Precson 1.00 0.88 0.77 0.64 0.48 0.34 0.64 0.93 0.87 0.85 F-measure 0.84 0.80 0.83 0.13 0.61 0.02 0.13 0.14 0.36 0.89 Class Setosa Verscolour Vrgnca Class 0 Class 1 Buldng_w_f_p Buldng_w_nf_p Vehcle_w_f_p Contaners Tableware Headlamps Bad Good As explaned we need to use other algorthms for comparson them wth proposed algorthm. One of the most common applcatons s Weka. The methods that we use n ths applcaton are J48, BFTree, REPTree, and NBTree. Weka use 10-fold cross valdaton for accuracy. The standard way of predctng the error rate of a learnng technque gven a sngle, fxed sample of data s to use stratfed 10-fold cross valdaton. The sze of nduced decson trees s one of the evaluaton crtera. Fnally we complete our overvew wth a comparson between proposed algorthm and Weka algorthms output. The result of ths comparson s summarzed n Table 5 and Table 6. Table 5: Calculaton Number of Leaves/Sze of Tree for all dataset J48 5/9 2/3 30/59 18/35 BFTree 6/11 2/3 16/31 11/21 REPTree 3/5 8/15 12/23 5/9 NBTree 4/7 1/1 9/17 8/15 Proposed Method 5/8 4/6 14/30 13/24 38
Table 6: Comparson of Error rate Error Rate J48 BFTree REPTree NBTree Proposed Method 0.04 0.06 0.06 0.06 0.12 0.25 0.25 0.15 0.25 0.34 0.38 0.30 0.45 0.09 0.10 0.11 0.10 0.18 The resultng tree s an oblvous tree. In ths knd of tree each level check the same attrbute. For ths reason, error ratng of proposed method s more than other algorthms. 4. Concluson and Recommendatons In ths paper, feature selecton for decson tree constructon s presented. Feature selecton as one way of data preprocessng can effect n all steps of data mnng algorthms. Attrbutes mportance rankng obtan by runnng ARA algorthm whch s the frst phase of proposed algorthm. In the next phase smple algorthm s used for generatng rules. Fnally evaluaton parameters such as sze of tree, number of leaves, error rate, recall, and precson are computed. Results of comparson show that average number of leaves and sze of decson tree generated by proposed method are better than other algorthm. As other data mnng algorthm, the results of proposed algorthm depend on characterstc of dataset. However, ths method generated smaller trees when comparng wth other algorthm such as J48 or BFTree. It s also found that error rate s acceptable. For mprovng accuracy we can repeat two phase of algorthm nstead of TDIDT method. Thus we have an algorthm wth more tme complexty but better accuracy. 5. Acknowledgments The authors wsh to thank Mansour Tarafdar. Hs programmng and constructve comments and suggestons helped us sgnfcantly mprove ths work. 6. References [1] J. Doak. An Evaluaton of Feature Selecton Methods and Ther Applcaton to Computer Securty, techncal report, Unversty of Calforna at Davs, Department of Computer Scence, 1992 [2] Huan Lu, Le Yu. Toward Integratng Feature Selecton Algorthms for Classfcaton and Clusterng, IEEE Transactons on Knowledge and Data Engneerng, Vol. 17, No. 4, pp. 491-502, Aprl-2005 [3] H. Almuallm, T.G. Detterch. Learnng Boolean Concepts n the Presence of Many Irrelevant Features, Artfcal Intellgence, Vol. 69, pp.279-305, 1994 [4] M.A. Hall. Correlaton-Based Feature Selecton for Dscrete and Numerc Class Machne Learnng, Proc. 17 th Int'l conf. Machne Learnng, pp. 359-366, 2000 [5] H. Lu, H. Motoda. Feature Selecton for Knowledge Dscovery and Data Mnng. Boston :Kluwer Academc, 1998. [6] Oded Mamon, Lor Rokach. The Data Mnng and Knowledge Dscovery Handbook, Sprnger,pp. 93-111, pp. 149-164, 2005 [7] M. Dash, H. Lu. Feature Selecton for classfcaton, Intellgent Data Analyss: An Int'l J., Vol. 1, No. 3, pp. 131-156, 1997 [8] W. Sedleck, J. Sklansky. On Automatc Feature Selecton, Int'l J. Pattern Recognton and Artfcal Intellgence, Vol. 2, pp. 197-220, 1988. [9] Ian H.Wtten, Ebe Frank. Data Mnng: Practcal Machne Learnng Tools and Technques, Second Edton, Morgan Kaufmann, pp. 288-296, 2005 [10] Lpo Wang, Xuju Fu. Data Mnng wth Computatonal Intellgence, Sprnger, pp. 117-123,2005 [11] Max Bramer. Prncples of Data Mnng, Sprnger, pp. 47-48, 2007 [12] Blake, C.L. and Merz. C.J. UCI Repostory of Machne Learnng Databases. Irvne, CA: Unversty of Calforna, Department of Informaton and Computer Scence. [http://www.cs.uc.edu/~mlearn/mlrepostory.html] 39