An improvement direction for filter selection techniques using information theory measures and quadratic optimization

(IJARAI) Internatonal Journal of Advanced Research n Artfcal Intellgence, An mprovement drecton for flter selecton technques usng nformaton theory measures and quadratc optmzaton Waad Bouaguel LARODEC, ISG, Unversty of uns 4, rue de la Lberté, 000 Le Bardo, unse. Ghaz Bel Muft ESSEC, Unversty of uns 4, rue Abou Zakara El Hafs, 089 Montfleury, unse. Abstract Flter selecton technques are known for ther smplcty and effcency. However ths knd of methods doesn t take nto consderaton the features nter-redundancy. Consequently the un-removed redundant features reman n the fnal classfcaton model, gvng lower generalzaton performance. In ths paper we propose to use a mathematcal optmzaton method that reduces nter-features redundancy and mamze relevance between each feature and the target varable. Keywords-Feature selecton; mrmr; Quadratc mutual nformaton ; flter. I. INRODUCION In many classfcaton problems we deal wth huge datasets, whch lkely contan not only many observatons, but also a large number of varables. Some varables may be redundant or rrelevant to the classfcaton task. As far as the number of varables ncrease, the dmensons of data amplfy, yeldng worse classfcaton performance. In fact wth so many rrelevant and redundant features, most classfcaton algorthms suffer from etensve computaton tme, possble decrease n model accuracy and ncrease of overfttng rsks [7, ]. As a result, t s necessary to perform dmensonalty reducton on the orgnal data by removng those rrelevant features. wo famous specal forms of dmensonalty reducton est. he frst one s feature etracton, n ths category the nput data s transformed nto a reduced representaton set of features, so new attrbutes are generated from the ntal ones. he Second category s feature selecton. In ths category a subset of the estng features wthout a transformaton s selected for classfcaton task. Generally feature selecton s chosen over feature etracton because t conserves all nformaton about the mportance of each sngle feature whle n feature etracton the obtaned varables are, usually, not nterpretable. In ths case t s obvous that we wll study the feature selecton but choosng the most effectve feature selecton method s not an easy task. Many emprcal studes show that manpulatng few varables leads certanly to have relable and better understandable models wthout rrelevant, redundant and nosy data [, 0]. Feature selecton algorthms can be roughly categorzed nto the followng three types, each wth dfferent evaluaton crtera [7]: flter model, wrapper model and embedded. Accordng to [8, 3, 9] a flter method s a preselecton process n whch a subset of features s frstly selected ndependently of the later appled classfer. Wrapper method on the other hand, uses search technques to rank the dscrmnatve power of all of the possble feature subsets and evaluate each subsets based on classfcaton accuracy [6], usng the classfer that was ncorporated n the feature selecton process [5, 4]. he wrapper model generally performs well, but has hgh computatonal cost. Embedded method [0] ncorporates the feature selecton process n the classfer obectve functon or algorthm. As result the embedded approach s consdered as the natural ablty of a classfcaton algorthm; whch means that the feature selecton take place naturally as a part of classfcaton algorthm. Snce the embedded approach s algorthm-specfc, t s not an adequate one for our requrement. Wrappers on other hand have many merts that le n the nteracton between the feature selecton and the classfer. Furthermore, n ths method, the bas of both feature selecton algorthm and learnng algorthm are equal as the later s used to assess the goodness of the subset consdered. However, the man drawback of these methods s the computatonal weght. In fact, as the number of features grows, the number of subsets to be evaluated grows eponentally. So, the learnng algorthm needs to be called too many tmes. herefore, performng a wrapper method becomes very epensve computatonally. Accordng to [, 7] flter methods are often preferable to other selecton methods because of ther usablty wth alternatve classfers and ther smplcty. Although flter algorthms often score varables separately from each other wthout consderng the nter-feature redundancy, as result they do not always acheve the goal of fndng combnatons of varables that gve the best classfcaton performance [3].herefore, one common step up for flter methods s to consder dependences and relevance among varables. mrmr [8] (Mnmal-Redundancy-Mamum-Relevance) s an effectve approach based on studyng the mutual nformaton among features and the target varable; and takng nto account the nter-features dependency [9]. hs approach selects those features that have hghest relevance to the target class wth the mnmum nter-features redundancy. he mrmr algorthm, selects features greedly. he new approach proposed n ths paper ams to show how usng mathematcal methods mproves current results. We use quadratc programmng [] n ths paper, the studed www.acsa.thesa.org 7 P a g e

(IJARAI) Internatonal Journal of Advanced Research n Artfcal Intellgence, obectve functon represents nter-features redundancy through quadratc term and the relatonshp between each feature and the class label s represented through lnear term. hs work has the followng sectons: n secton we revew studes related to flter methods; and we study the mrmr feature selecton approach. In secton 3 we propose an advanced approach usng mathematcal programmng and mrmr algorthm background. In secton 4 we ntroduce the used smlarty measure. Secton 5 s dedcated to emprcal results. II. FILER MEHODS he processng, of flter methods at most cases can be descrbed as t follows: At frst, we must evaluate the features relevance by lookng at the ntrnsc propertes of the data. hen, we compute relevance score for each attrbute and we remove ones whch have low scores. Fnally, the set of kept features forms the nput of the classfcaton algorthm. In spte of the numerous advantages of flters, scorng varables separately from each other s a serous lmt for ths knd of technques. In fact when varables are scored ndvdually they do not always acheve the obect of fndng the perfect features combnaton that lead to the optmal classfcaton performance [3]. Flter methods fal n consderng the nter-feature redundancy. In general, flter methods select the top-ranked features. So far, the number of retaned features s set by users usng eperments. he lmt of ths rankng approach s that the features could be correlated among themselves. Many studes showed that combnng a hghly ranked feature for the classfcaton task wth another hghly ranked feature for the same task often does not gve a great feature set for classfcaton. he rason behnd ths lmt s redundancy n the selected feature set; redundancy s caused by the hgh correlaton between features. he man ssue wth redundancy s that wth many redundant features the fnal result wll not be easy to nterpret by busness managers because of the comple representaton of the target varable characterstcs. Wth numerous mutually hghly correlated features the true representatve features wll be consequently much fewer. Accordng to [8], because features are selected accordng to ther dscrmnatve powers, they do not fully represent the orgnal space covered by the entre dataset. he feature set may correspond to several domnant characterstcs of the target varable, but these could stll be fne regons of the relevant space whch may cause a lack n generalzaton ablty of the feature set. A. mrmr Algorthm A step up for flter methods s to consder dssmlarty among features n order to mnmze feature redundancy. he set of selected features should be mamally dfferent from each other. Let S denote the subset of features that we are lookng for. he mnmum redundancy condton s MnP, P = S M ( S where we use M(, ) to represent smlarty between features, and S s the number of features n S. In general, ), () mnmzng only redundancy s not enough suffcent to have a great performance, so the mnmum redundancy crtera should be supplemented by mamzng relevance between the target varable and others eplcatve varables. o measure the level of dscrmnant powers of features when they are dfferentally epressed for dfferent target classes, agan a smlarty measure M y, ) s used, between targeted classes y={0,} ( and the feature epresson. hs mesure quantfes the relevance of for the classfcaton task. hus the mamum relevance condton s to mamze the total relevance of all features n S: MaP = S, P S M ( y, ). Combnng crtera such as: mamal relevance wth the target varable and mnmum redundancy between features s called the mnmum redundancy-mamum relevance (mrmr) approach. he mrmr feature set s obtaned by optmzng the problems P and P receptvely n Eq. () and Eq. () smultaneously. Optmzaton of both condtons requres combnng them nto a sngle crteron functon () Mn{ P P}. (3) mrmr approach s advantageous of other flter technques. In fact wth ths approach we can get a more representatve feature set of the target varable whch ncreases the generalzaton capacty of the chosen feature set. Consstently, mrmr approach gves a smaller feature set whch effectvely cover the same space as a larger conventonal feature set does. mrmr crteron s also another verson of MaDep [9] that chooses a subset of features wth both mnmum redundancy and mamum relevance. In spte of the numerous advantages of mrmr approach; gven the prohbtve cost of consderng all possble subsets of features, the mrmr algorthm selects features greedly, mnmzng ther redundancy wth features chosen n prevous steps and mamzng ther relevance to the class. A greedy algorthm s an algorthm that follows the problem solvng heurstc of makng the locally optmal choce at each stage wth the hope of fndng a global optmum; the problem wth ths knd of algorthms s that n some cases, a greedy strategy do not always produce an optmal soluton, but nonetheless a greedy heurstc may yeld locally optmal solutons that appromate a global optmal soluton.. On the other hand, ths approach treats the two condtons equally mportant. Although, dependng on the learnng problem, the two condtons can have dfferent relatve purposes n the obectve functon, so a coeffcent balancng the MaDep and the MnRev crtera should be added to mrmr obectve functon. o mprove the theory of mrmr approach we use n the net secton; mathematcal knowledge to modfy and balance the mrmr obectve functon and solve t wth quadratc programmng. www.acsa.thesa.org 8 P a g e

(IJARAI) Internatonal Journal of Advanced Research n Artfcal Intellgence, III. QUADRAIC PROGRAMMING FOR FEAURE SELECION A. Problem Statement he problem of feature selecton was addressed by statstcs machne learnng as well as by other mathematcal formulaton. Mathematcal programmng based approaches have been proven to be ecellent n terms of classfcaton accuracy for a wde range of applcatons [5, 6]. he used mathematcal method s a new quadratc programmng formulaton. Quadratc optmzaton process, use an obectve functon wth quadratc and lnear terms. Here, the quadratc term presents the smlarty among each par of varables, whereas the lnear term captures the correlaton of each feature and the target varable. Assume the classfer learnng problem nvolves N tranng samples and m varables [0]. A quadratc programmng problem ams to mnmze a multvarate quadratc functon subect to lnear constrants as follows: Mnf ( ) = Q F Subectto 0 =,, m m =. = where F s an m-dmensonal row vector wth non-negatve entres, descrbng the coeffcents of the lnear terms n the obectve functon. F measures how correlated each feature s wth the target class (relevance). Q s an (m m) symmetrc postve sem-defnte matr descrbng the coeffcents of the quadratc terms, and represents the smlarty among varables (redundancy). he weght of each feature decson varables are denoted by the m-dmensonal column vector. We assume that a feasble soluton ests and that the constrant regon s bounded. When the obectve functon f () s strctly conve for all feasble ponts the problem has a unque local mnmum whch s also the global mnmum. he condtons for solvng quadratc programmng, ncludng the Lagrangan functon and the Karush-Kuhn-ucker condtons are eplaned n detals n []. After the quadratc programmng optmzaton problem has been solved, the features wth hgher weghts are better varables to use for subsequent classfer tranng. B. Condtons balancng Dependng on the learnng problem, the two condtons can have dfferent relatve purposes n the obectve functon. herefore, we ntroduce a scalar parameter as follows: Mnf ( ) = ( ) Q F, (5) above, Q and F are defned as before and [0,], f =, only relevance s consdered. On the. (4) opposng, f = 0, then only ndependence between features s consdered that s, features wth hgher weghts are those whch have lower smlarty coeffcents wth the rest of features. Every data set has ts best choce of the scalar. However, a reasonable choce of must balances the relaton between relevance and redundancy. hus, a good estmaton of must be calculated. We know that the relevance and redundancy terms n Equaton 6 are balanced when ( ) Q = F, where Q s the estmate of the mean value of the matr Q ; and F s the estmate of the mean value of vector F elements. A practcal estmate of s defned as IV. Q ˆ =. Q F (6) BASED INFORMAION HEORY SIMILARIY MEASURE he nformaton theory approach has proved to be effectve n solvng many problems. One of these problems s feature selecton where nformaton theory bascs can be eploted as metrcs or as optmzaton crtera. Such s the case of ths paper, where we eplot the mean value of the mutual nformaton between each par of varables n the subset as metrc n order to appromate the smlarty among features. Formally, the mutual nformaton of two dscrete random varables and can be defned as: I( ) = S S ) log ), ) ) and of two contnuous random varables s denoted as follows: ) I ( ) = ) log dd. (8) ) ) V. 5. EMPIRICAL SUDY In general mutual nformaton computaton requres estmatng densty functons for contnuous varables. For smplcty, each varable s dscretzed usng Weka 3.7 software [4]. We mplemented our approach n R usng the quadprog package [0, ]. he studed approach should be able to gve good results wth any classfer learnng algorthms, for smplcty the logstc regresson provded by R wll be the underlyng classfer n all eperments. he generalty of the feature selecton problem makes t applcable to a very wde range of domans. We chose n ths paper to test the new approach on two real word credt scorng datasets from the UCI Machne Learnng Repostory. he frst dataset s the German credt data set conssts of a set of loans gven to a total of 000 applcants, consstng of 700 eamples of credtworthy applcants and 300 eamples where credt should not be etended. For each applcant, 0 varables descrbe credt hstory, account balances, loan purpose, loan amount, employment status, and personal nformaton. Each sample contans 3 categorcal, 3 (7) www.acsa.thesa.org 9 P a g e

(IJARAI) Internatonal Journal of Advanced Research n Artfcal Intellgence, contnuous, 4 bnary features, and bnary class feature. he second data set s the Australan credt dataset whch s composed by 690 nstances where 306 nstances are credtworthy and 383 are not. All attrbutes names and values have been changed to meanngless symbols for confdental reason. Australan dataset present an nterestng mture of contnuous features wth small numbers of values, and nomnal wth larger numbers of values. here are also a few mssng values. he am of ths secton s to compare classfcaton accuracy acheved wth the quadratc approach and others flter technques. able I and able show the average classfcaton error rates for the two data sets as a functon of the number of features. Accuracy results are obtaned wth α= 0.5 for German data set and α= 0.489 for Australan data set, whch means that an equal tradeoff between relevance and redundancy s best for the two data sets. From able and able II t's obvous that usng the quadratc approach for varable selecton lead to the lowest error rate. VI. ABLE I. RESULS SUMMARY FOR GERMAN DAASE, WIH 7 SELECED FEAURES est ype I ype II error error error Quadratc 0.3 0. 0. Relef 0.4 0.33 0.87 Informaton Gan 0.5 0.38 0.3 CFS Feature Set Evaluaton 0.54 0.34 0.344 mrmr 0.66 0.5 0.355 MaRel 0.5 0.38 0.3 ABLE II. RESULS SUMMARY FOR AUSRALIAN DAASE, WIH 6 SELECED FEAURES est ype I ype II error error error Quadratc 0.6 0.55 0.09 Relef 0.30 0.64 0.099 Informaton Gan 0.7 0.63 0.094 CFS Feature Set Evaluaton 0.6 0.65 0.098 mrmr 0.30 0.64 0.099 MaRel 0.39 0.79 0.0 CONCLUSION hs paper has studed a new feature selecton method based on mathematcal programmng; ths method s based on the optmzaton of a quadratc functon usng the mutual nformaton measure n order to capture the smlarty and nonlnear dependences among data. ACKNOWLEDGMEN he authors would lke to thank Prof. Mohamed Lmam who provdes valuable advces, support and gudance. he product of ths research paper would not be possble wthout hs help. REFERENCES [] M. Bazaraa, H. Sheral, C. Shetty, Nonlnear Programmng heory and Algorthms, JohnWley, New York, 993. [] R. Bekkerman, E. Yanv r, N. shby, Y.Wnter, "Dstrbutonal word clusters vs. words for tet categorzaton", J. Mach. Learn. Res., vol. 3, 003, pp. 83 08. [3] A. L. Blum, P. Langley, "Selecton of relevant features and eamples n machne learnng", Artfcal ntellgence, vol. 97, 997, pp. 45 7. [4] R. R. Bouckaert, E. Frank, M.Hall, R. Krkby, P. Reutemann, A. Seewald, D. Scuse, "Weka manual (3.7.) ", 009. [5] P. S. Bradley, O. L. Mangasaran, W. N. Street, "Feature Selecton Va Mathematcal Programmng", INFORMS J. on Computng, vol. 0, 998, pp. 09 7. [6] P. S. Bradley, U. M. Fayyad, Mangasaran, "Mathematcal Programmng for Data Mnng : Formulatons and Challenges", INFORMS J. on Computng, vol., num. 3, 999, p. 7 38, INFORMS. [7] Y.S. Chen, "Classfyng credt ratngs for Asan banks usng ntegratng feature selecton and the CPDA-based rough sets approach", Knowledge-Based Systems, 0. [8] C. Dng, H. Peng, "Mnmum Redundancy Feature Selecton from Mcroarray Gene Epresson Data", J. Bonformatcs and Computatonal Bology, vol. 3, num., 005, pp. 85-06. [9] G. Forman, "BNS feature scalng : an mproved representaton over tfdf for svm tet classfcaton", CIKM 08: Proceedng of the 7th ACM conference on Informaton and knowledge mnng, New York, NY, USA, 008, ACM, pp. 63 70. [0] D. Goldfarb, A. Idnan, "Dual and Prmal-Dual Methods for Solvng Strctly Conve Quadratc Programs", In J. P. Hennart (ed.), Numercal Analyss, 98, pp. 6 39, Sprnger Verlag. [] D. Goldfarb, A. Idnan, "A numercally stable dual method for solvng strctly conve quadratc programs. ", Mathematcal Programmng,, 983, pp. 33. []. Howley, M.G. Madden, M. L. O connell, A.G. Ryder, "he effect of prncpal component analyss on machne learnng accuracy wth hghdmensonal spectral data. ", Knowl.-Based Syst., vol. 9, num. 5, 006, pp. 363-370. [3] A. K. Jan, R. P. W. Dun, J. Mao, "Statstcal pattern recognton : a revew", Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol., num., 000, pp. 4 37, IEEE. [4] R. Kohav, G. H. John, "Wrappers for Feature Subset Selecton", Artfcal Intellgence, vol. 97, num., 997, pp. 73 34. [5] D. Koller, M. Saham, "oward Optmal Feature Selecton", Internatonal Conference on Machne Learnng, 996, pp. 84 9. [6] S. Yuan kung, "Feature selecton for parwse scorng kernels wth applcatons to proten subcellular localzaton", n IEEE Int. Conf. on Acoustc, Speech and Sgnal Processng (ICASSP), 007, pp. 569 57. [7] Y. lu, M. Schumann, "Data mnng feature selecton for credt scorng models", Journal of the Operatonal Research Socety, vol. 56, num. 9, 005, pp. 099 08. [8] L.C. Molna, L. Belanche, A. Nebot, "Feature Selecton Algorthms : A Survey and Epermental Evaluaton", Data Mnng, IEEE Internatonal Conference on, vol. 0, 00, pp 306, IEEE Computer Socety. [9] H. Peng, F. Long, C. Dng, "Feature selecton based on mutual nformaton: crtera of madependency, marelevance, and mnredundancy", IEEE ransactons on Pattern Analyss and Machne Intellgence, vol. 7, 005, pp. 6 38. [0] I. Rodrguez-luan, R. Huerta, C. Elkan, C. S. Cruz, "Quadratc Programmng Feature Selecton", Journal of Machne Learnng Research, vol., 00, pp. 49 56. [] C. M.Wang, W. F. Huang, "Evolutonary-based feature selecton approaches wth new crtera for data mnng: A case study of credt approval data", Epert Syst. Appl., vol. 36, num. 3, 009, pp. 5900-5908. www.acsa.thesa.org 0 P a g e

(IJARAI) Internatonal Journal of Advanced Research n Artfcal Intellgence, www.acsa.thesa.org P a g e