Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier

Size: px

Start display at page:

Download "Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier"

Anne Fletcher
5 years ago
Views:

1 Usng Ambguty Measure Feature Selecton Algorthm for Support Vector Machne Classfer Saet S.R. Mengle Informaton Retreval Lab Computer Scence Department Illnos Insttute of Technology Chcago, Illnos, U.S.A Nazl Goharan Informaton Retreval Lab Computer Scence Department Illnos Insttute of Technology Chcago, Illnos, U.S.A ABSTRACT Wth the ever-ncreasng number of documents on the web, dgtal lbrares, news sources, etc., the need of a text classfer that can classfy massve amount of data s becomng more crtcal and dffcult. The major problem n text classfcaton s the hgh dmensonalty of feature space. The Support Vector Machne (SVM classfer s shown to perform consstently better than other text classfcaton algorthms. However, the tme taen for tranng a SVM model s more than other algorthms. We explore the use of the Ambguty Measure (AM feature selecton method that uses only the most unambguous eywords to predct the category of a document. Our analyss shows that AM reduces the tranng tme by more than 50% than the scenaro when no feature selecton s used, whle mantanng the accuracy of the text classfer equvalent to or better than usng the whole feature set. We emprcally show the effectveness of our approach n outperformng seven dfferent feature selecton methods usng two standard benchmar datasets. Categores and Subject Descrptors H.3.3 [Informaton Systems and Retreval]: Informaton flterng, Informaton Search and Retreval-search process General Terms Algorthms, Performance, Expermentaton Keywords Feature selecton, Text classfcaton, SVM 1. INTRODUCTION Text classfcaton nvolves scannng through the text documents, and assgnng categores to documents to reflect ther content. A supervsed learnng algorthm nduces decson rules that are used to categorze documents to dfferent categores by learnng from a set of tranng examples. One of the problems n text classfcaton s hgh dmensonalty of the feature space. Some features are Permsson to mae dgtal or hard copes of all or part of ths wor for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SAC 08, March 16-0, 008, Fortaleza, Ceará, Brazl. Copyrght 008 ACM /08/0003 $5.00. commonly used terms, not specfc to any category. These features may hurt the accuracy of the classfer. Moreover, the tme requred for nducton ncreases as the number of features ncreases. That s, rrelevant features lead to an ncrease n tranng tme. Feature selecton methods are used to acheve two objectves: to reduce the sze of the feature set to optmze the classfcaton effcency; and to reduce nose found n the data to optmze the classfcaton effectveness [11]. Feature selecton methods are used as a preprocessng step n the learnng process. The selected features from the tranng set are then used to classfy new ncomng documents. Among the well-nown feature selecton methods are nformaton gan, expected cross entropy, the weght of evdence of text, odds rato, term frequency, mutual nformaton and CHI. The Ambguty Measure (AM feature selecton method s shown to perform better than the state of art feature selecton algorthms on statstcal classfers [9]. The Ambguty measure algorthm selects the most unambguous features, where unambguous features are those features whose presence n a document ndcate a hgh degree of confdence that the document belongs to one specfc category. One of the wdely used text classfcaton algorthms s Support Vector Machnes (SVM [3][4][5][16]. Pror wor [5] ndcates that SVM performs consstently better than Naïve Bayes, NN, C4.5 and Roccho text classfers. However, one of the lmtatons of SVM s ts tme complexty. [16] shows that SVM has a hgher tme complexty for tranng a model than other text classfcaton algorthms. To overcome ths lmtaton of SVM, feature selecton methods are used as a preprocessng step before tranng SVM [1][13][14]. Many well-nown feature selecton algorthms are used on SVM to mprove the accuracy and effcency of SVM. We explore the effects of the AM feature selecton method when appled on SVM and evaluate ts performance n comparson to the publshed state of the art feature selecton algorthms on SVM. We use the AM feature selecton method as a pre-processng step for the Support Vector Machne classfer. The features whose AM are below a gven threshold,.e., more ambguous terms, are purged whle the features whose AM values are above a gven threshold are used for the SVM learnng phase. We compare AM wth the other feature selecton algorthms on two dfferent standard benchmar datasets and show that AM performs statstcally sgnfcantly better than seven publshed state of the art feature selecton methods, reported n [13][14], wth 99% confdence. We also emphercally show that we can reduce the

2 tranng tme by more than 50% than the scenaro when no feature selecton s used, whle mantanng the accuracy of the classfer.. PRIOR WORK To show the effectveness of our feature selecton algorthm, we compare our approach wth the exstng feature selecton methods lsted n Table 1. The descrpton of these feature selecton methods s gven n [][13][15][17], thus we forgo ther mathematcal justfcaton and provde a bref explanaton on the dfferences. The feature selecton methods le odds rato, nformaton gan and CHI use the nowledge about the presence of the terms n the relevant categores ( c as well as n the non-relevant categores ( c. In our approach, AM feature selecton method only uses the nowledge about the presence of the terms n the relevant categores to calculate how confdently a eyword ponts to a gven category. Our objectve s to choose only the features that confdently pont to only one category. In the Improved Gn Index and cross entropy methods, the probabltes of a term wth respect to all categores are consdered. Thus, f the term t appears many tmes n the documents of category c, or f t appears n every document of category c, t s assgned a hgher weght. In a stuaton where t appears n both the categores c 1 and c an equal number of tmes, and moreover, t appears n every document of the both categores, then t s assgned a lower weght. In ths case t s ambguous, as t does not pont to a sngle category. Our proposed AM feature selecton avods such stuaton and assgns a lower weght to such features. For tfdf method, tf refers to term frequency wth respect to a gven category and df ndcates the rato of documents n the collecton that have a gven term. In the tfcf method, cf ndcates the rato of categores that have a gven term. Some terms may appear only n one category for a small number of tmes. Although these terms appear n only a sngle category or document, they are purged durng the feature selecton process f Table 1. Dfferent feature selecton algorthms Method Formula Ref. Odds Rato Tfcf Tfdf Improved Gn Index Info. Gan Cross Entropy (CE CHI t c.[ 1 t c ] OR ( tc [17] [1 t c ]. t c tfcf ( t, c tfdf ( t, d C TF( t, c *log cf ( t D TF( t, d *log df ( t [] [] Gn ( t t C. C t [13] t, c IG ( t, c t, clog t c [15] c { c, c } t { t, t } CE t C t m ( t C tlog [13] C N[ t, c. t, c t, c. t, c CHI ( t, c [15] t. t. c. c they have a low term frequency. Furthermore, some terms frequently appear n a few categores or documents (.e., a hgh cf or df wth a smlar dstrbuton of occurrence n all categores. Such terms are ambguous, as they do not pont strongly to only a sngle category. However, as the term frequency of such terms s hgh, these terms may be selected as good features. The AM feature selecton method avods such stuatons by only consderng the rato between the numbers of occurrences of a term n a gven category to the total number of occurrences of the term n tranng set. Thus, both these stuatons are avoded. 3. METHODOLOGY Intally, we descrbe the ntutve motvaton behnd our approach and then provde a formal defnton of our method. We consder the human percepton of dentfyng the topc of a document by a glance at the document and capturng the eywords. Normally one bases hs/her decson about the topc of a document based on the most unambguous words that the eye captures. We explan ths usng a hypothetcal example. Consder the short paragraph (below that s extracted from [6]. Metallca s a Grammy Award-wnnng Amercan heavy metal/thrash metal band formed n 1981 and has become one of the most commercally successful muscal acts of recent decades. They are consdered one of the "Bg Four" poneers of thrash metal, along wth Anthrax, Slayer, and Mega-death. Metallca has sold more than 90 mllon records worldwde, ncludng 57 mllon albums n the Unted States alone. The paragraph seems to be about Musc. Our human percepton s based on our nowledge of the doman or what we hear daly on varous subjects. Thus, f one s famlar wth the famous roc metal band Metallca, then wthout readng the text, one can confdently clam that the text belongs to Musc rather than Medcne or Sports. Thus, f a feature ponts to only one category, then we assgn a hgher ambguty measure to such a feature and f a feature s vague and does not pont to any gven category n partcular, then we assgn a lower ambguty measure to such a feature. Formally, Ambguty measure (AM s defned as the probablty that a term falls nto a partcular category and s calculated usng the followng formula. The closer the AM value s to 1 then the term s consdered less ambguous. Conversely, f AM s closer to 0, the term s consdered more ambguous wth respect to a gven category. The formula for calculatng AM s gven as follows. AM( t, C tf ( t, c tf ( t AM ( t max( AM( t, C Where tf(t,c s the term frequency of a term t n category c and tf(t s the term frequency of a term t n the entre collecton. The result of the calculaton of Ambguty measure (AM for the feature Metallca s gven n Table, ndcatng Musc category for the term. The AM value for the feature Metallca s 0.99, whch ndcates that the feature Metallca s an unambguous feature and should be ept and not fltered. The feature Anthrax s related to the Medcne category wth an AM value of Anthrax s also the name of a famous musc band n 1980s. Hence, t also appears n the category Musc. Thus, the ambguty measure of Anthrax s less than Metallca. In some cases the

3 ambguty measure of some features s low as they appear consstently n dfferent categores. Example of such s the term Records, whch may appear n all dfferent categores. Thus, the AM value of such term s low (0.33 and t s desrable to flter out such features. Ths reducton n dmensonalty of the feature set ncreases the accuracy by avodng the terms that have lower AM values. We emprcally determne a threshold and flter out all the features whose AM measure s below that gven threshold. Table. Ambguty Measure (AM example Term Metallca Anthrax Records Category Count AM Count AM Count AM Medcne Musc Sports Poltcs Furthermore, we also use AM value of a feature as ts weght. In the SVM classfer, a weght of mportance s assgned to each feature. Thus, f the AM value of a feature s hgher, then the feature has more weght and f the AM value s lower, that feature has less weght. 4. EXPERIMENTAL SETUP In all our experments, we use a sngle computer, wth AMD Athlon.16Ghz processor and 1 GB of RAM. We use the lnear SVM ernel n our experments, as the non-lnear versons gan very lttle n terms of performance [11]. For tranng and testng the SVM model, we use LbSVM.84 [1], a software that s commonly used for classfyng the documents nto bnary or mult-labeled categores. 4.1 Datasets To demonstrate the effectveness of AM feature selecton algorthm, we perform experments on two standard benchmar datasets: 0 Newsgroup and Reuters News Group 0 Newsgroup (0NG [7] conssts of a total of 19,997 documents that are categorzed nto twenty dfferent news groups. Each category contans one thousand documents. Some of the categores are very closely related to each other (e.g. comp.sys.bm.pc.hardware and comp.sys.mac.hardware, whle others are hghly unrelated (e.g. msc.forsale and soc.relgon.chrstan. Ths characterstc contrbutes to the dffculty of categorzaton of documents that belong to very smlar categores. We use a 9-1 tran-test splt for 0 Newsgroup dataset. Thus we have 18,000 documents for tranng and 1,997 documents for testng. The total number of unque features n 0 Newsgroup dataset s 6,061. Reuters 1578 The Reuters 1578 corpus [8] contans Reuters news artcles from The documents range from beng mult-labeled, sngle labeled, or not labeled. Reuters dataset conssts of a total number of 135 categores (labels. However, ten of these categores have sgnfcantly more documents than the rest of the categores. Thus, commonly the top 10 categores are used for expermentatons and to compare the accuracy of the classfcaton results. The top 10 categores of Reuters 1578 are earn, acq, money-fx, gran, trade, crude, nterest, wheat, corn and shp. We use Mod-Apte tran-test splt for Reuters 1578 dataset. There are 7,053 documents n tranng set and,76 documents n testng set. The total number of unque features n Reuters 1578 dataset s 19, Evaluaton Measures To evaluate the accuracy of our approach and compare AM to the results of the state of the art feature selecton methods we use mcro-f1 measure. F1 measure s a common measure n text classfcaton that combnes recall and precson nto a sngle score wth equal mportance accordng to the formula:. P. R F1measure= ( P R where P s precson and R s recall. 5. RESULTS & ANALYSIS We organze the results nto two subsectons. In secton 5.1, the effectveness of our approach on two standard benchmar datasets s presented. We compare our results wth the publshed state of the art results and show that AM performs statstcally sgnfcantly better than the seven exstng feature selecton algorthms that are summarzed and publshed n [13][14]. To our nowledge, the classfcaton results for SVM algorthm usng odds rato, tfdf and tfcf are not reported n any pror wors on Reuters 1578 and 0 Newsgroups datasets, thus, we mplemented these feature selecton methods on SVM and report the results n Fgure 1 and Fgure. In secton 5., we demonstrate how AM feature selecton reduces the tranng tme whle optmzng the F1 measure. We also explan the effects of the threshold value on the classfcaton results. 5.1 Accuracy Comparson The comparson of classfcaton performance of AM feature selecton method wth varous feature selecton methods that are reported n [13] on Reuters 1578 dataset s summarzed n Fgure 1. [13] proposed an mproved verson of gn ndex that performs better than the other reported feature selecton algorthms. Our proposed AM feature selecton method statstcally sgnfcantly outperforms the Improved Gn ndex and other feature selecton methods depcted n fgure 1 wth a confdence level of 99% on Reuters 1578 usng a two-taled pared t-test. Smlarly, the classfcaton performance on 0 Newsgroups dataset s summarzed n Fgure. We compare our results to orthogonal centrod feature selecton (OCFS method reported n [14]. To eep our results presentaton consstent wth that reported n [14], we too, report the mcro F1 measures of OCFS by applyng celng functon to the results and roundng to the next hghest nteger. As shown, AM feature selecton method clearly outperforms OCFS method on 0 Newsgroups dataset wth a sgnfcant mprovement. Moreover, AM also statstcally sgnfcantly outperforms the accuracy of nformaton gan, CHI, odds rato, tfdf and tfcf feature selecton methods. As depcted n Fgure 1 and Fgure the F1 measure on Reuters dataset (89.14% s sgnfcantly hgher than F1 measure on 0 Newsgroups dataset (78.74%. The dfference between the F1

Fgure 1: Comparson of AM wth other feature selecton methods n terms of F1 measure on Reuters 1578 dataset Fgure : Comparson of AM wth other feature selecton methods n terms of F1 measure on 0

thresholds and tranng/testng tme; and between the AM threshold and Mcro-F1, usng SVM classfer on 0 Newsgroup dataset results of Reuters 1578 and 0NG datasets s due to the percentage of postve and

4 Fgure 1: Comparson of AM wth other feature selecton methods n terms of F1 measure on Reuters 1578 dataset Fgure : Comparson of AM wth other feature selecton methods n terms of F1 measure on 0 Newsgroup dataset Fgure 3: Correlaton between AM thresholds and tranng/testng tme; and between the AM threshold and Mcro-F1, usng SVM classfer on Reuters 1578 dataset Fgure 4: Correlaton between AM thresholds and tranng/testng tme; and between the AM threshold and Mcro-F1, usng SVM classfer on 0 Newsgroup dataset results of Reuters 1578 and 0NG datasets s due to the percentage of postve and negatve examples n the tranng sets of each. That s, we only consder the top 10 categores for Reuters 1578 dataset. The tranng set conssts of 10% of every category on average. As SVM s a bnary classfer and we use one-aganst-rest approach for mult-labelled datasets, the number of postve examples (actual category n tranng set s 10% and number of negatve examples s 90%. In the 0NG dataset, we have 0 categores wth 5% of documents of each category n the tranng set. Thus, durng classfcaton, we have 5% postve examples and 95% negatve examples. Hence, there are less postve examples to learn from n 0NG dataset as compared to the Reuters dataset, resultng to a better accuracy for Reuters 1578 dataset. 5. Tradeoff of accuracy and tme wth respect to threshold values In ths secton, we report the effects of the AM thresholds n the process of feature selecton on the values of F1 measure and the correspondng tme taen to tran the model and classfy the documents usng SVM classfer. Fgure 3 and fgure 4 show the results for Reuters 1578 and 0 Newsgroups datasets, respectvely. The x-axs represents dfferent threshold values and the y-axs represents mcro-f1 measure and tme. The threshold value ndcates that all the features whose weghts are above that value are selected and the remanng features are fltered. The % of eywords value (fgures 3 & 4 ndcates the correspondng percentage of eywords selected when the threshold was set to a gven value. As shown n Fgure 3, when we apply AM feature selecton method, mcro-f1 measure ncreases as we flter out the features wth lower AM value. We obtan the best mcro-f1 value when the threshold s set to 0.3. Only 70.16% of the features are retaned when the threshold s 0.3. As the threshold s ncreased, the mcro-f1 measure starts droppng. Ths ndcates that when the threshold s less than 0.3, most of the features that are fltered are ambguous and lead to a hgher accuracy of the classfer. When the threshold s above 0.3, most of the features that are fltered contan nformaton relevant to text classfcaton. Thus, when these features are fltered, the accuracy of the classfer decreases. The tranng tme ncludes the feature selecton tme and the tme taen to tran the SVM model usng LbSVM. The testng tme s the tme taen by LbSVM to classfy the testng data. Fgure 3 demonstrates that when no feature selecton s appled,.e. when threshold s equal to zero, tme taen for tranng s 33 seconds. When we reduce the dmensonalty of feature set by settng the threshold to 0.3, the tranng tme also reduces to 1 seconds. Ths demonstrates the effect of feature selecton n reducng the tranng tme for SVM whle optmzng the results.

5 As shown n Fgure 4, the behavor of mcro-f1 measure on 0 Newsgroups dataset s smlar to the results on Reuters dataset. The results consstently mprove when the threshold s below 0.. Only 41% of features are retaned when the threshold s set to 0.. As the threshold ncreases, more features are fltered and thus, from a certan pont the accuracy of the classfer consstently degrades as the threshold further ncreases. When no feature selecton s appled, tme taen for tranng s 387 seconds. However, when we reduce the dmensonalty of feature set by settng the threshold to 0., the tranng tme also reduces to 185 seconds. We also get the best F1 measure value when the threshold s set to 0.. Ths shows that even though the learnng tme s reduced by more than 50%, we stll obtan comparable or better results than when we do not apply any feature selecton. 0 Newsgroups dataset has more tranng documents (18,000 than Reuters 1578 dataset (7,053. Also the number of features (6,061 and the average document length (78 for 0 Newsgroups dataset s more than Reuters 1578 dataset (No. of features: 19,48, Avg. document length: 53. Thus, the tranng tme taen for 0 Newsgroups s more than the tranng tme taen for Reuters 1578 dataset. One of the lmtatons of usng feature selecton algorthms s to fnd a proper threshold for a gven dataset. We found the threshold for Reuters 1578 dataset as 0. and for 0 Newsgroups dataset as 0.3. Addtonally, we expermented usng stratfed 10- fold cross valdaton and confrmed the same thresholds as we reported for Reuters Mod-Apte splt and 0 News Groups 9-1 splt. To further nvestgate ths problem, we further expermented on two addtonal standard datasets from statlog collecton [10] called DNA dataset (3 categores;,000 tranng documents; 1,186 testng documents and Vehcle dataset (4 categores; 761 tranng documents; 85 testng documents. We found that the threshold for both DNA dataset (Mcro F1: 93.17% and Vehcle dataset (Mcro F1: 8.9% s also 0.3. Thus, the observaton ndcates that the threshold between 0. to 0.3 yelds the best results on the four datasets we used for our expermentatons. 6. CONCLUSION We explored an effectve feature selecton algorthm, Ambguty Measure (AM; and we appled AM on SVM text classfcaton. Wth an ever-ncreasng number of dgtal documents, many tradtonal text classfcaton technques fal to handle the scale of ths data due to ther tme complexty and space requrements. In ths paper, we have shown that AM feature selecton method can reduce the computaton tme of the SVM text classfer to an extent wthout hurtng the effectveness of the classfer. We performed experments on two standard benchmar datasets, Reuters 1578 and 0 Newsgroups. We showed that AM performs statstcally sgnfcantly better than the current publshed state of the art feature selecton algorthms on SVM. Furthermore, we provded analyss of how the mcro-f1 s affected as we set more strngent thresholds for feature selecton. We demonstrated that as the threshold for selectng the features s ncreased, the mcro-f1 measure mproves untl up to a specfc threshold. The tme taen for tranng a classfer s much lower than the scenaro when no feature selecton s used. By ncreasng the threshold beyond a pont, the effectveness of the text classfer decreases. 7. REFERENCES [1] Chang C.C., Ln C.J., LIBSVM: a lbrary for support vector machnes, 001. [] Chh H.B., Kulathuramayer N., An Emprcal Study of Feature Selecton for Text Categorzaton based on Term Weghtage. IEEE/WIC/ACM Internatonal Conference on Web Intellgence, 004. pg: [3] Cortes C., Vapn V., Support-vector networs. Machne Learnng, Volume 0, Number 3, September pg [4] Joachms T., Mang Large-scale support vector machne learnng practcal. In B. Schölopf et al. (Eds., Advances n ernel methods: Support vector learnng. MIT Press, 1999, pg [5] Joachms T., Text Categorzaton wth Support Vector Machnes: Learnng wth many relevant features. 10th European Conference on Machne Learnng, 1998, pg [6] [7] Lang K., Orgnal 0 Newsgroups Dataset. people.csa.mt.edu/jrenne/0newsgroups. [8] Lews D., Reuters-1578, resources/testcollectons/reuters1578. [9] Mengle S., Goharan N., Platt Alana., FACT: Fast Algorthm for Categorzng Text. IEEE 5 th Internatonal Conference on Intellgence and Securty Informatcs, 007. pg [10] Mche D., Spegelhalter D., Taylor C., Machne Learnng, Neural and Statstcal Classfcaton. Prentce Hall, [11] Mladenć D., Bran J, Grobeln M., Mlc-Fraylng N., Feature Selecton usng Lnear Classfer Weghts: Interacton wth Classfcaton Models. 7th ACM SIGIR Conference on Research and Development n Informaton Retreval, 004. pg [1] Novovcova J., Mal A., Informaton-theoretc feature selecton algorthms for text classfcaton. IEEE Internatonal Jont Conference on Neural Networs, IJCNN 005. Volume: 5, pg [13] Wenqan S., Houuan H., Habn Z., Yongmn L., Youl Q., Zhha W., A novel feature selecton algorthm for text classfcaton. Expert Systems wth Applcatons: An Internatonal Journal Volume 33, Issue 1, 007 pg 1-5. [14] Yan J., Lu N., Zhang B., Yan S., Chen Z., Cheng Q., Fan Q., Ma W. OCFS: optmal orthogonal centrod feature selecton for text categorzaton. Proceedngs of the 8th annual nternatonal ACM SIGIR conference on Research and development n Informaton Retreval, 005. pg [15] Yang Y., Pedersen J.. A comparatve study on feature set selecton n text categorzaton. 14 th Internatonal Conference on Machne Learnng, pg: [16] Yang Y., Zhang J., Ksel B, A scalablty analyss of classfers n text categorzaton. 6th ACM SIGIR Conference on Research and Development n Informaton Retreval, 003. pg: [17] Zheng Z., Srhar R., Optmally Combnng Postve and Negatve Features for Text Categorzaton. In Proceedngs of the ICML, Worshop on Learnng from Imbalanced Datasets II, Washngton DC, 003.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto