A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

Size: px

Start display at page:

Download "A Balanced Ensemble Approach to Weighting Classifiers for Text Classification"

Debra Baldwin
5 years ago
Views:

1 A Balanced Enseble Approach to Weghtng Classfers for Text Classfcaton Gabrel Pu Cheong Fung 1, Jeffrey Xu Yu 1, Haxun Wang 2, Davd W. Cheung 3, Huan Lu 4 1 The Chnese Unversty of Hong Kong, Hong Kong, Chna, {pcfung,yu}@se.cuhk.edu.hk 2 IBM T. J. Watson Research Center, New York, USA, haxun@us.b.co 3 The Unversty of Hong Kong, Hong Kong, Chna, dcheung@cs.hku.hk 4 Arzona State Unversty, Arzona, USA, hlu@asu.edu Abstract Ths paper studes the proble of constructng an effectve heterogeneous enseble classfer for text classfcaton. One ajor challenge of ths proble s to forulate a good cobnaton functon, whch cobnes the decsons of the ndvdual classfers n the enseble. We show that the classfcaton perforance s affected by three weght coponents and they should be ncluded n dervng an effectve cobnaton functon. They are: (1) Global effectveness, whch easures the effectveness of a eber classfer n classfyng a set of unseen docuents; (2) Local effectveness, whch easures the effectveness of a eber classfer n classfyng the partcular doan of an unseen docuent; and (3) Decson confdence, whch descrbes how confdent a classfer s when akng a decson when classfyng a specfc unseen docuent. We propose a new balanced cobnaton functon, called Dynac Classfer Weghtng (DCW), that ncorporates the aforeentoned three coponents. The eprcal study deonstrates that the new cobnaton functon s hghly effectve for text classfcaton. 1 Introducton Let U be a set of unseen docuents and C be a set of predefned categores. Autoated text classfcaton s the process of labelng U wth C, such that every d U wll be assgned to soe of the categores n C. Note that d can be assgned to none of the categores n C. If the nuber of categores n C s ore than two ( C > 2), t s a ultlabel text classfcaton proble. Snce every ult-label text classfcaton proble can be transfored to a bnarylabel text classfcaton proble, we focus on the bnary proble n ths paper ( C = 2). Let c C. Bnary-label text classfcaton s to construct a bnary classfer, denoted by Φ( ), for c such that: { 1 f f (d) > 0, Φ(d) = (1) 1 otherwse, where Φ(d) = 1 ndcates that d belongs to c and Φ(d) = 1 ndcates that d does not belong to t. f ( ) R s a decson functon. Every classfer, Φ, has ts own decson functon, f ( ). If there are dfferent classfers, there wll be dfferent decson functons. The goal of constructng a bnary classfer, Φ( ), s to approxate the unknown true target functon Φ( ), so that Φ( ) and Φ( ) are concdent as uch as possble [17]. In order to prove the effectveness, enseble classfers (a.k.a classfer cottee) were proposed [1, 3, 5, 6, 7, 8, 9, 15, 16, 17, 18, 19]. An enseble classfer s constructed by groupng a nuber of eber classfers. If the decsons of the eber classfers are cobned properly, the enseble s robust and effectve. There are two knds of enseble classfers: hoogeneous and heterogeneous. A hoogeneous enseble classfer contans bnary classfers n whch all classfers are constructed by the sae learnng algorth. Baggng and boostng [19] are two coon technques [1, 15, 16, 18]. A heterogeneous enseble classfer contans bnary classfers n whch all classfers are constructed by dfferent learnng algorths (e.g., one SVM classfer and one knn classfer are grouped together) [19]. The ndvdual decsons of the classfers n the enseble are cobned (e.g., through stackng [19]): { 1 f g ( Φ 1 (d),φ 2 (d),...,φ (d) ) > 0, Θ(d) = (2) 1 otherwse, where Θ( ) s an enseble classfer; g( ) s a cobnaton functon that cobnes the outputs of all Φ ( ). The effectveness of the enseble classfer, Θ( ), depends on the effectveness of g( ). In ths paper, we concentrate on analyzng heterogeneous enseble classfers. Our proble s thus to exane how to forulate a good g( ). Four wdely used g( ) are: (1) Majorty votng (MV) [8, 9]; (2) Weghted lnear cobnaton (WLC) [7]; (3) Dynac classfers selecton (DCS) [3, 8, 6, 5]; and (4) Adaptve classfers cobnaton (ACC) [8, 9]. Except for MV, the other three functons assgn dfferent weghts to the classfers n the enseble. The bgger the weght, the ore ef- 1

2 Fgure 1. Illustraton of local effectveness and decson confdence. fectve s that classfer. In MV, all classfers n the enseble are equally weghted. It can end up wth a wrong decson f the norty votes are sgnfcant. WLC assgns statc weghts to the classfers based on ther perforance on a valdaton data. However, a generally well-perfored classfer can perfor poorly n soe specfc doans. For nstance, the cro-f 1 scores of SVM and Nave Bayes (NB) for the benchark Reuters21578 are respectvely and In ths sense, SVM excels NB. Yet, for the categores Potato and Retal n Reuters21578, the F 1 scores for NB are both 0.667, but are both 0.0 for SVM. DCS and ACC weght the classfers by parttonng the valdaton data (doan specfc), they do not cobne the classfers decsons, but select one of the classfers fro the enseble and rely on t solely. We wll show n the experents that ths wll lead to nferor results. In ths paper, we propose a new cobnaton functon called Dynac Classfers Weghtng (DCW). We consder three coponents when cobnng classfers: (1) Global Effectveness, whch s the effectveness of a classfer n an enseble when t classfes a set of unseen docuents; (2) Local effectveness, whch s the effectveness of a classfer n an enseble when t classfes the partcular doan of the unseen docuent; and (3) Decson confdence, whch s the confdence of a classfer n akng a decson of the enseble for a specfc unseen docuent. 2 Motvatons Let Φ 1 ( ),Φ 2 ( ),...,Φ ( ) be dfferent bnary classfers and f 1 ( ), f 2 ( ),..., f ( ) be ther correspondng decson functons. Conceptually, Φ ( ) dvdes the entre doan nto two parts accordng to f ( ). Fgure 1 llustrates ths dea. The dashed lnes are the decson boundares. If the unseen docuent, d, falls nto the upper (lower) trangle, t would be labeled as postve (negatve). Usually, f d s further away fro the decson boundary, the decson of d by Φ (d) s ore confdent. Every classfer has dfferent effectveness. For nstance, Support Vectors Machne (SVM) s beng regarded as ore accurate (effectve) than Nave Bayes (NB) [20]). Although t does not ply all of the decsons ade by SVM ust be superor than NB, t does ply that we should value the judgent of SVM hgher than that of NB n general. In ths paper, we ter ths knd of effectveness as global effectveness of a classfer, denoted by α (E.g. α SVM > α NB ). α gves us good nsght about how to weght the classfers n an enseble. Intutvely, f we construct an enseble classfer by groupng Φ a ( ) and Φ b ( ) together, where α a > α b, then we should value Φ a ( ) hgher than Φ b ( ). Yet, a globally effectve classfer ay soetes perfor poorly on soe specfc dataset (doan). As an exaple, consder two classfers, SVM and NB. Accordng to the benchark Reuters21578, the cro-f 1 scores for SVM and NB are respectvely and Unfortunately, the F 1 score for SVM when classfyng Retal (Retal Reuters21578) s 0.0, but t s for NB. As a result, an effectve classfer ay not always perfor well n all doans (e.g., SVM perfors poorly n Retal). Ths can be further llustrated n Fgure 1. The two ovals, A and B, represent two dfferent doans. Oval A covers over the decson boundary, whereas Oval B resdes n the lower trangle. All of the docuents wthn the doan of Oval A are algned near the decson boundary. An unseen docuent that belongs to ths doan ay easly be classfed wrongly. On the other hand, the docuents wthn the doan of Oval B are well separated by the decson boundary. An unseen docuent that belongs to ths doan wll ost lkely be classfed correctly. So, the effectveness of the classfer also reles on the doan of the unseen data. We ter ths knd of effectveness as local effectveness of the classfer, denoted by β. β helps us to adjust the weghts of the classfers n the enseble. If the α of Φ s very hgh but t s not effectve n classfyng the doan of the unseen docuent, we should re-consder ts effectveness. For every decson a classfer akes, one ay ask how confdent the classfer s about the decson? Consder the two unseen docuents, docuent 1 and docuent 2, n the sae doan (Oval B) n Fgure 1. Whle both docuent 1 and docuent 2 resde near the boarder of ther doan, docuent 2 locates closer to the decson boundary (the dashed lne) whereas docuent 1 locates far away fro t. Snce both docuent 1 and docuent 2 belong to the sae doan, the local effectveness of the classfer upon the are the sae. Yet, the confdence n akng a correct decson for docuent 1 should be hgher than that of docuent 2, as docuent 1 s further away fro the decson boundary (d 1 > d 2 ). In ths paper, we ter t as decson confdence. It s estated accordng to the dstance between the unseen docuent and the decson boundary. We suarze the needs for the above coponents as follows: f we gnore α, over-fttng ay result as we neglect the cobned nfluence of all doans. If we gnore 2

3 β, over-generalzaton ay ensue as t reles on the doan where the unseen docuent appears. α and β do not easure the classfer s decson confdence, γ s proposed as t ndcates how uch confdence a classfer has when t classfes the unseen docuents. 3 Dynac Classfers Weghtng (DCW) In the prevous secton, we have explaned why the three weght coponents (α, β and γ ) are helpful n constructng an effectve cobnaton functon, g( ). We now descrbe how they are estated and how they are cobned n an enseble classfer. α s the effectveness of the classfer when we use t to classfy a set of unseen docuents. Durng the tranng phase, although we do not have a set of labeled unseen docuents, we can estate α fro the tranng data, D: we estate α by 10-folded cross valdaton. Whle our experence suggested that estatng the effectveness of a classfer based on cross valdaton would always yeld an optstc result than evaluatng t fro the unseen data, ths would not be a proble n our stuaton, as we are not targetng for evaluatng the real global effectveness of the classfers, but ang at obtanng the relatve global effectveness. We noralze α such that 0 < α < 1 and =1 α = 1. β s the effectveness of the classfer when we use t to classfy the doan of the unseen docuent, d. For an unseen docuent, we would never know what the true doan of d s. As above, we can only estate ts doan accordng to the tranng data, D. Let D be a subset of docuents n the tranng data,.e., D D. We can fnd the doan of the unseen docuent, d, by usng D, to extract the docuents n D that are slar to d. Accordngly, the extracton of D s based on a nearest neghbor strategy. We extract the top n docuents that are ost slar to d fro D. The value n can be readly obtaned through a valdaton dataset. The slartes aong these n docuents are easured by the cosne coeffcent [13]. Snce D s a subset of the tranng data (D D), we wll know precsely the labels of those docuents that appear n D. We estate β by evaluatng D usng the F 1 score. β s noralzed such that 0 < β < 1 and =1 β = 1. γ s a easure about how confdent the classfer s when t akes a decson upon d. Fro Eq. (1), the classfcaton decson of the classfer, Φ ( ), s based on the decson functon, f ( ). For ost cases, f not all, the hgher the agntude of f ( ), the ore confdent are ther decsons. Consequently, we can copute γ by usng the decson functon, f ( ). Unfortunately, the range of f ( ) vares aong dfferent algorths. For exaple, Φ ( ) ay have f ( ) n the range of [ 1,1], whereas Φ j ( ) ay have another f j ( ) n the range of (, + ). Snce dfferent decson functons have dfferent ranges, a drect coparson aong the s napproprate. We solve the proble as follows: Let D be the doan of the unseen docuent. D s obtaned by the technque descrbed prevously. We copute γ as follows: γ = f (d) µ, (3) µ = 1 D f (d ), (4) d D where µ s the average confdence of the decsons ade by f ( ) aong the docuents n D. Snce D D, we can presue that µ s non-zero. When γ > 1, f (d), has ore than average confdence to ake a correct classfcaton on d, where d wll be far away fro the decson boundary (e.g., docuent 1 n Fgure 1). When γ < 1, the decson functon, f (d), has less than average confdence to ake a correct classfcaton on d, where d wll be closer to the decson boundary (e.g., docuent 2 n Fgure 1). We noralze γ such that 0 < γ < 1 and =1 γ = 1. We now present how α, β and γ are cobned. Assue that there are classfers n the enseble. In the ost splest for, the cobnaton functon, g( ) s: decson, (5) where decson = Φ (d) {1,-1} (Eq. (eq:c)). Here, all classfers n the enseble are equally weghted (.e. MV). In DCW, snce a confdence (γ ) s assocated wth each decson, therefore: decson γ. (6) Yet, even for a confdent decson, we need to revew whether the classfer, whch akes ths decson, s effectve n the enseble. Consequently: decson γ effectveness. (7) Snce there are two knds of effectveness for each of the classfer (α and β ), we have: 4 Experental Study (Φ (d) α β γ ), (8) The purpose of the experents s twofold. (1) We want to exane how effectve the Dynac Classfers Weghtng (DCW) s, when t s copared wth the other knds of heterogeneous enseble classfers. As such, we pleented four exstng enseble classfers for coparson: 3

4 No. Cobnaton Reuters21578 Newsgroup20 MV WLC DCS ACC DCW MV WLC DCS ACC DCW 1 S+N S+R S+K R+N K+N K+R S+K+R S+K+N S+R+N K+R+N S+K+R+N Table 1. The results of the cro-f 1 for dfferent enseble classfers. Majorty votng (MV) [8, 9], Weghted lnear cobnaton (WLC) [7], Dynac classfers selecton (DCS) [3, 8, 6, 5], and Adaptve classfers cobnaton (ACC) [8, 9]. We report the results n Secton 4.1. (2) We want to understand how sgnfcant the results are whenever one of the enseble classfers outperfors the others. As such, we perfored a parwse sgnfcant test n Secton 4.2. In the experents, two bencharks are used: Reuters21578 and Newsgroup20. For Reuters21578, we separate the dataset nto tranng data and testng data usng the ModApte splt [2]. For Newsgroup20, for each of the categores, we randoly select 80% of the postngs as tranng data, and the reanng as testng data. For the data preprocessng, punctuaton, nubers, web page addresses, and eal addresses are reoved. All features are steed and converted to lower cases, and are weghted usng the standard tf df schea [14]. Features that appear n only one docuent are gnored. All features are ranked based on the NGL Coeffcent[12], and the top X features are selected. Ths X s tuned for dfferent classfers and for dfferent bencharks. For creatng the enseble classfers, dfferent cobnatons of four knds of classfers are used: (1) Support Vectors Machne (SVM); (2) k-nearest Neghbor (knn); (3) Roccho (ROC); (4) Nave Bayes (NB). Ther default settngs are as follows: For SVM, we use lnear kernel wth C = 1.0. No feature selecton s requred [4]. For knn, we set k = 50 and select 2,750 and 4,900 features for Reuters21578 and Newsgroup20. For ROC, we pleent the verson n [11] and selects 2,750 and 7,500 features for Reuters21578 and Newsgroup20. For NB, we pleent the ultnoal verson [10] and selects 2,750 and 9,500 features for Reuters21578 and Newsgroup Effectveness Analyss Table 1 shows the results of the cro-f 1 score for all enseble classfers (MV, WLC, DCS, ACC and DCW) when they are created usng dfferent cobnatons of the bnary classfers for both bencharks. The left ost colun denotes whch of the bnary classfers are used for creatng the correspondng enseble classfer. We use S, K, R and N to denote SVM, knn, Roccho and Nave Bayes, respectvely. For exaple, S+K+R represents an enseble classfer whch s coprsed of SVM, knn and Roccho. Note that MV cannot be created f the nuber of bnary classfers n the enseble s an even nuber, hence the entres n Table 1. At the frst glance, the results are prosng. DCW, the proposed approach, donates over all other approaches when they are beng created usng the sae set of bnary classfers. Slar results are obtaned when we use the acro-f 1 score. The only case where DCW perfors nferor s case 6 when DCW s created by knn and Roccho (K+R), eanwhle t s evaluated usng Reuters Its cro-f 1 s 0.831, whch s lower than DCS (Dynac Classfers Weghtng). Nevertheless, such a dfference can be neglgbled. Concernng DCW, the best cobnaton of bnary classfers n the enseble s SVM and Roccho (case 2) for Reuters The cro-f 1 score s It s also the best results obtaned aong all of the enseble classfers that we have evaluated. For Newsgroup20 the best result s obtaned by coprsng SVM, knn and Roccho together (case 8). The cro-f 1 score s It s also the best result obtaned aong all approaches. For MV, ts phlosophy s to take the ajorty agreeent aong the bnary classfers n the enseble. Hence, the nuber of bnary classfers ust be an odd nuber. So we can only create MV usng three dfferent bnary classfers. Interestngly, all cobnatons perfor slarly. Concernng WLC, the best cobnaton for Reuters21578 (case 2), ts cro-f 1 score s 0.883, whch s hgher than all enseble classfers (except DCW). For Newsgroup20, slar observatons are ade, where ts best cobnaton s case 8. Although the dea of WLC s very sple assgns statc weghts to the classfers n the enseble accordng to ther global effectveness and 4

5 cobnes the lnearly t perfors surprsngly well. Another nterestng fndng s that when SVM s ncluded n the enseble, the effectveness of WLC would be ncreased draatcally. Ths suggests that the choce of the classfers n WLC s partcularly portant. Concernng DCS, ts best cro-f 1 score for Reuters21578 (case 2) s only. It s far lag behnd all the other approaches. For Newsgroup20, none of the F 1 score s hgher than We beleve that the reasons of why DCS perfors poorly are because: (1) It does not cobne the classfers decsons. Rather, t selects one of the classfer n the enseble and reles on t copletely. (2) It nether pays attenton to the global effectveness of the classfers nor the decson confdence. ACC perfors slghtly better than DCS. Ths ay be because the decson strategy for ACC s ore sophstcated that DCS. The best ensebles for Reuters21578 and Newsgroup20 are both case 7. However, these results are all nferor than both WLC and our DCW. 4.2 Sgnfcant Test In ths secton, we conduct a parwse coparson aong the usng the sgnfcant test [20]. Gven two classfers, Φ A ( ) and Φ B ( ), the sgnfcant test deternes whether Φ A ( ) perfors better than Φ B ( ) based on the errors that Φ A ( ) and Φ B ( ) ade. Let N be the total nuber of the unseen docuents, and a = {0,1} (b = {0,1}) ndcate whether Φ A ( ) (Φ B ( )) akes a correct classfcaton upon the th unseen docuent. a = 0 eans Φ A ( ) akes an ncorrect classfcaton whereas a = 1 eans Φ A ( ) akes a correct one. Slar defnton s also appled to b. Let d a be the nuber of tes that Φ A ( ) perfors better than Φ B ( ), and d b be the nuber of tes that Φ B ( ) perfors better than Φ A ( ). In ths test, the null hypothess s that both classfers perfor the sae (H 0 : d a = d b ). The alternatve s that Φ A ( ) and Φ B ( ) perfors dfferently (H 1 : d a d b ). Table 2 shows the results of coparng the perforance of DCW wth the other enseble classfers. A B eans A perfors sgnfcantly better than B (P-Value 0.01). A > B eans A perfors slghtly better than B. A B eans no evdence ndcates A and B has any dfferences n ters of the errors they ade. A suary s gven below: Reuters21578: {DCW, WLC} > {MV, ACC} DCS Newsgroup20: DCW > WLC > ACC MV DCS 5 Conclusons In order to forulate an effectve cobnaton functon for heterogeneous enseble classfer, three weght coponents are necessary: Global Effectveness, Local Effectveness, and Decson Confdence. We copare DCW wth A B Reuters21578 Newsgroup20 MV WLC < MV DCS MV ACC MV DCW WLC DCS WLC ACC > > WLC DCW < DCS ACC DCS DCW ACC DCW < Table 2. Results of the sgnfcant test. four other knds of heterogeneous enseble classfers usng two bencharks. The results ndcated that DCW can effectvely balance the contrbutons of the three coponents and outperfors the exstng approaches. References [1] W. W. Cohen and Y. Snger. Context-senstve learnng ethods for text categorzaton. ACM Transactons on Inforaton Systes (TOIS), 17(2): , [2] F. Debole and F. Sebastan. An analyss of the relatve hardness of Reuters subsets. Journal of the Aercan Socety for InforatonScence and Technology, 56(6): , [3] G. Gacnto and F. Rol. Adaptve selecton of age classfers. In Proceedngs of the 9th InternatonalConference on Iage Analyss and Processng (ICIAP 97), pages 38 45, Florence, Italy, [4] T. Joachs. Text categorzaton wth support vector achnes: Learnng wth any relevant features. In Proceedngs of 10th European Conference on Machne Learnng (ECML 98), pages , Chentz, Gerany, [5] K. B. Kevn Woods, W. Phlp Kegeleyer. Cobnaton of ultple classfers usng local accuracy estates. IEEE Transactons on Pattern Analyss and Machne Intellgence (TPAMI), 19(4): , [6] W. La and K.-Y. La. A eta-learnng approach for text categorzaton. In Proceedngs of the 24th Annual InternatonalACMSIGIR Conference on Research anddevelopent ninforatonretreval (SIGIR 01), pages , New Orleans, Lousana, USA, [7] L. S. Larkey and W. B. Croft. Cobnng classfers n text categorzaton. In Proceedngs of the 19th Annual InternatonalACMSIGIR Conference on Research anddevelopent ninforatonretreval (SIGIR 96), pages , Zurch, Swtzerland, [8] Y. H. L and A. K. Jan. Classfcaton of text docuents. The Coputer Journal, 41(8): , [9] R. Lere and P. Tadepall. Actve learnng wth cottees for text categorzaton. In Proceedngs of 14th NatonalConference on Artfcal Intellgence (AAAI 97), pages , Provdence, Rhode Island, [10] A. McCallu and K. Nga. A Coparson of Event Models for Nave Bayes Text Classfcaton. In The 15th NatonalConference on Artfcal Intellgence (AAAI 98) Workshop on Learnng for Text Categorzaton, [11] A. Moschtt. Astudyonoptalparaeter tunngforrocchotextclassfer. InProceedngs ofthe25theuropean Conference on InforatonRetreval Research (ECIR 03), pages , Psa, Italy, [12] H. T. Ng, W. B. Goh, and K. L. Low. Feature selecton, percepton learnng, and a usablty case study for text categorzaton. In Proceedngs of the 20th Annual Internatonal ACM SIGIR Conference on Research and Developent n InforatonRetreval (SIGIR 97), pages 67 73, Phladelpha, PA, USA, [13] E. Rasussen. Clusterngalgorth. In W. B. Freakes and R. Baeza-Yates, edtors, Inforaton Retreval Data Structures & Algorths, pages Prentce Hall PTR, [14] G. Salton and C. Buckley. Ter-weghtng approaches n autoatc text retreval. Inforaton Processng and Manageent (IPM), 24(5): , [15] R. E. Schapre and Y. Snger. BoosTexter: a boostng-based syste for text categorzaton. Machne Learnng, 39(2 3): , [16] R. E. Schapre, Y. Snger, and A. Snghal. Boostng and Roccho appled to text flterng. In Proceedngs of the 21st Annual Internatonal ACM SIGIR Conference on Research and Developent n Inforaton Retreval (SIGIR 98), pages , Melbourne, Australa, [17] F. Seabastan. Machne learnng n autoated text categorzaton. ACM CoputngSurveys, 34(1):1 47, [18] S. M. Wess, C. Apte, F. J. Daerau, D. E. Johnson, F. J. Oles, T. Goetz, and T. Happ. Maxzng text-nng perforance. IEEE Intellgent Systes, 14(4):63 69, [19] I. H. Wttenand E. Frank. DataMnng: Practcal Machne Learnng Tools andtechnques. Morgan Kaufann, second edton, [20] Y. Yang and X. Lu. A re-exanaton of text categorzaton ethods. In Proceedngs of the 22nd Annual InternatonalACMSIGIR Conference on Research anddevelopent ninforatonretreval (SIGIR 99), pages 42 49, Berkeley, Calforna, USA,

Optimally Combining Positive and Negative Features for Text Categorization

Optimally Combining Positive and Negative Features for Text Categorization Optally Cobnng Postve and Negatve Features for Text Categorzaton Zhaohu Zheng ZZHENG3@CEDAR.BUFFALO.EDU Rohn Srhar ROHINI@CEDAR.BUFFALO.EDU CEDAR, Dept. of Coputer Scence and Engneerng, State Unversty