016 Internatonal onference on Electrcal Engneerng and utomaton (IEE 016) ISN: 978-1-60595-407-3 Weghted ethod to Improve the entrod-based lassfer huan LIU, Wen-yong WNG *, Guang-hu TU, Nan-nan LIU and Yu XING Department of omputer Scence and Engneerng, Unversty of Electronc Scence and Technology of hna, hengdu 611731, hna *orrespondng author Keywords: Text categorzaton, entrod-based classfer, achne learnng, Gravtaton model. bstract. entrod-ased lassfer () s one of the most wdely used text classfcaton method due to ts theoretcal smplcty and computatonal effcency. However, the accuracy of s not satsfactory when t deals wth the skewed dstrbuted data. In ths paper, we propose a new classfcaton model named as Gravtaton odel (G) to solve the model msft of. In the proposed model, we gve each category a mass factor to ndcate ts dstrbuton n vector space and ths factor can be learned from tranng data. We provde the performance comparsons wth and ts mproved methods based on the results of experments conducted on twelve real datasets, whch show that the proposed gravtaton model consstently outperforms. Furthermore, t reaches the same performance as the best centrod-based classfer and s more stable than the best one. Introducton entrod-ased lassfer () [1,,3,4,5,6] s one of the most popular T methods. The basc dea of s that an unlabeled sample should be assgned to a partcular class f the smlarty of ths sample to the centrod of the class s the largest. ompared wth other T methods, s more effcent snce ts computatonal complexty s lnear n the tranng/testng phase, ths mert s mportant for onlne text classfcaton task. lthough t has been shown that consstently outperforms other methods such as k-nearest-neghbors, Nave ayes, and Decson Tree on a wde range of datasets [7], often suffers from the model msft when the data s not well-dstrbuted. s we all know, the model msft wll lead to poor classfcaton performance. To solve the model msft of, numerous approaches have been proposed, such as lass-feature-entrod (F) [3], Generalzed luster entrod based lassfer (G) [8], DragPushng (DP) [6] and Large argn DragPushng (LDP) [5]. However, the exstng varants of focus on the methods that am to obtan good centrods n the constructon or tranng phase. Therefore, they cannot solve the nherent dsadvantages of model. In ths paper, a new centrod-based classfcaton model s ntroduced to easly overcome the nherent shortcomngs (or bas) of n the class-mbalanced dataset. In the proposed model, each class s gven a mass factor to ndcate the data dstrbuton of the correspondng class, and the value of mass factor wll be learned from the tranng set. Then, a new document wll be assgned to a partcular class wth the max gravtatonal force. The proposed method s emprcally evaluated by comparng t wth three frequently used centrod-based methods (.e.,, F, and DP) based on twelve real datasets n the feld of text categorzaton. The expermental results demonstrate that the proposed method works well on the real datasets, and outstandngly outperforms most of the state-of-the-art centrod-based approaches (e.g., and F). The remander of ths paper s organzed as follows: the proposed model s ntroduced n Secton, the performance of our model s evaluated n Secton 3, the paper closes wth the conclusons and future works n Secton 4. The Proposed Gravtaton odel The proposed Gravtaton odel (G) concentrates on the adustment of classfcaton hyper plane to reduce the bas nherent n, and t s entrely dfferent from the prevous works [3,4,5,6] whch
obtan the centrods wth good ntal term weghts n constructon phase or modfy the poston of centrods durng tranng phase. The otvaton of Gravtaton odel Gravtaton model s motvated by Newton's law of unversal gravtaton whch states that a partcle attracts every other partcle n the unverse by a force. The magntude of the attractve force F between two obects s as follows: Gm F, (1) r where G denotes the gravtatonal constant, s the mass of the frst obect, m s the mass of the second obect and r s the dstance between the two obects. Ths formula ndcates that the force s proportonal to the product of masses of the two obects and nversely proportonal to the square of the dstance between them. r r S 0 Obect Obect Fgure 1. Gravtaton equlbrum pont S between obect and. 0 The Defnton of Gravtaton odel Smlar to the unversal gravtaton, as shown n Fg. 1, obect (or ) can be seen as class (or ), and obect S can be seen as the unknown sample d n the vector space. Hence, the label of sample d wll be determned by the attractve forces from class and class. For example, f the sample d locates on the left of Lagrange pont S0 where the gravtatonal attracton of to obect S counteracts 's gravtatonal attracton to obect S, then the sample d wll be classfed nto category due to F F. onversely, f F F, the sample d wll be classfed nto category. Therefore, the classfcaton hyper plane n gravtaton model s no longer the hyper plane that class and share the same smlarty wth the unknown sample d as n but Lagrange hyper plane that class and share the same attractve force. In the vector space, Lagrange hyper plane between two classes s defned as follows:, () r r where r (or r ) s the dstance from sample d to the centrod of class (or class ). In fact, 1 r (or 1 r ) can be regarded as the measure of smlarty between sample d and centrod of class (or class ). Therefore, the formula above can be rewrtten as below: smd, c smd, c where sm d, ( or d, (3) c sm, c ) s the measure of smlarty between sample d and the centrod of class (or ), and (or ) s unknown mass factor whch can be learned from the tranng set. The obectve of learnng mass factor s to obtan the best Lagrange hyper plane so that the classfcaton error (.e., 1-accuracy) s the lowest. In the testng phase, gravtaton model frst calculates the attractve force between the undentfed sample d and each class, whch uses the followng formula:
F d, c smd, c where 1,, (4), denotes mass factor of class. Then, the class of document d s determned drectly by assgnng t to the class wth the max attractve force, that s lass ( d) arg max F d, c. (5) Obvously, the mass factors ntroduced by gravtaton model are used to ndcate the class dstrbuton n the tranng phase, and then they are appled to determne the category of undentfed samples n the testng phase. In ths framework, learnng mass factors leads to a Vorono partton that drectly allevates the class mbalance problem. Learnng ass Factor The key to gravtaton model s to determne the nherent mass factor of each class so that the model fts the tranng data well. The dea of learnng mass factors s to use the msclassfed samples to gude the update of the factors untl the number of msclassfed samples falls to zero or a small constant. The way of learnng factors s to updates the correspondng two mass factors for each msclassfed sample. For nstance, f document d that belongs to class s msclassfed nto class. To classfy d correctly, t s needed to ncrease the attractve force of class and correspondngly decrease the force between class should be enlarged whle should be reduced. The concrete update formulas for as follows: on sample d and sample d. Thus, the mass factor and are :, (6) :, (7) where s a constant value gven by the user, whch s used to control the update strength. For a msclassfed sample, the correspondng mass factors are updated, untl the dfference of classfcaton error between two adacent teratons s less than the gven threshold (denoted by ), or the number of teratons reaches the maxmum. Experments In ths secton, the performance of gravtaton model s evaluated across a range of text collectons by comparng t wth, F, and DP. Datasets The benchmark collectons used n the experments are 0-Newsgroups, Reuters-1578 and other ten datasets from Karyps Lab. They are concsely ntroduced as follows: Newsgroups --- t s a benchmark corpus typcally used n the research feld of text categorzaton (or text clusterng). Ths corpus conssts of 19,997 artcles that are organzed nto 0 dfferent categores, and t s hghly balanced snce each category has nearly 1,000 texts. Reuters --- the verson of pte splt 90 categores whch contans 11,406 texts for 90 categores are used n the experments. The nstance dstrbuton over the 90 categores s hghly mbalanced. Datasets from Karyps Lab --- ten datasets ncludng cacmcs, classc, fbs, htech, new3, ohscal, re0, re1, tr1, and tr3 are pcked out from Karyps' dataset. The detal descrpton of each collecton can be referred to lterature [9]. Experments Desgn The fve-fold cross-valdaton scheme s appled to evaluate the performance of algorthms. Wth respect to F, the denormalzed prototype vector s used for predcton, and the parameter b s
fxed to e 1. 5 ( 1. 18 ). Note that cross-valdaton scheme was not appled n the evaluaton of F n lterature [3] (F was traned by the whole dataset and tested by selectng three quarters of all samples). Thus, n ths experments, F s nvestgated n two ways.e., fve-fold cross-valdaton and the method as n lterature [3] to gve more reasonable results. For concse of reference, "F'' s used to denote the orgnal F as n lterature [3], whle "F-V'' denotes F wth fve-fold cross-valdaton evaluaton method. Regardng DP, the learnng rate s set to 0.01, and the maxmum teraton equals 50. For gravtaton model, the parameters and are set to 0.0001, and the maxmum teraton s also set to 50 n all corpora. Expermental Results and nalyses The overall performance comparson n mcrof1 and macrof1 are lsted n Table 1 and. The max value n each row has been hgh-lghted n bold wthout consderng the performance of F snce the evaluaton method of F s dfferent from other algorthms. From Table 1 and, the followng conclusons can be drawn. G consstently outperforms n mcrof1 and macrof1. The mcrof1 of G s respectvely 9.0%, 5.%, 4.6%, 4.3%, and 3.8% hgher than on Reuters, cacmcs, classc, re0, and re1, and s slghtly better than on the remanng datasets. eanwhle, the macrof1 of G respectvely beats by 6.48%, 3.9%, and.8% on Reuters, cacmcs, and classc. In the rest of datasets, the macrof1 of G s also obtaned small ncrease compared wth. Table 1. The comparson of dfferent classfers n mcrof1. Dataset G F-V DP F Newsgroups 0.887 0.886 0.773 0.899 0.968 Reuters 0.874 0.783 0.705 0.893 0.95 cacmcs 0.97 0.875 0.910 0.944 0.998 classc 0.930 0.884 0.811 0.948 0.887 fbs 0.80 0.794 0.719 0.800 0.901 htech 0.710 0.704 0.53 0.690 0.969 new3 0.799 0.779 0.570 0.81 0.984 ohscal 0.773 0.761 0.549 0.773 0.934 re0 0.805 0.76 0.61 0.815 0.919 re1 0.853 0.815 0.680 0.849 0.978 tr1 0.908 0.899 0.674 0.908 1.000 tr3 0.814 0.790 0.637 0.859 0.987 Table. The comparson of dfferent classfers n macrof1. Dataset G F-V DP F Newsgroups 0.883 0.883 0.765 0.899 0.97 Reuters 0.780 0.716 0.571 0.764 0.963 cacmcs 0.91 0.88 0.91 0.938 0.998 classc 0.936 0.908 0.856 0.951 0.91 fbs 0.780 0.764 0.654 0.78 0.907 htech 0.677 0.670 0.473 0.657 0.971 new3 0.809 0.797 0.603 0.83 0.988 ohscal 0.763 0.758 0.535 0.763 0.933 re0 0.794 0.78 0.440 0.800 0.93 re1 0.787 0.787 0.539 0.784 0.978 tr1 0.908 0.900 0.658 0.906 1.000 tr3 0.807 0.793 0.581 0.847 0.99 F has the best performance whle F-V has the worst performance compared wth other algorthms on most of the datasets. The dfferent performance between F and F-V s due to the reason that F uses the whole dataset to calculate prototype vectors whle F-V only adopts the tranng set. The mcrof1 and macrof1 of F-V are lower than on all datasets except cacmcs. On the contrary, the mcrof1 and macrof1 of F have an overwhelmng advantage
comparng wth others on eleven datasets. Therefore, ths s also an ndcaton that F has a low generalzaton ablty (or overfttng) on tranng data. DP s the best centrod-based classfer, and G has the approxmate performance as DP on most of the datasets. s shown n Table 1, G and DP obtan the same mcrof1 that s the best performance on ohscal and tr1. Except for F, G obtans the best performance n fve out of twelve datasets, whle DP acheves n nne out of twelve datasets. lso, a smlar trend can be seen from Table, where G remans the leader wth fve datasets, and DP takes the crown n eght datasets. From Table 1 and, t should be apprecated that G s always a very close runner-up when DP s the wnner. However, t s also worth notng that DP has poorer performance than n some of the datasets such as htech and re1, whch does not happen on G. Thus, DP cannot reduce the nherent bas of n some partcular stuatons. G undoubtedly proves tself to be more stable than DP. onclusons and Future Works In ths paper, we proposed a novel centrod classfcaton model whch s motvated by Newton's unversal gravtaton law. To calculate the gravtatonal force, G gves each category a mass factor to ndcate the sample dstrbuton of ths category n the vector space. In partcular, G can ft the dataset well n the case of the skewed dstrbuton. n unlabeled sample wll be assgned to the class wth the largest gravtatonal force to ths sample. The expermental results show that the proposed gravtaton model s effectve and effcent due to ts lnear tranng and testng tme. cknowledgement Ths work has been supported by oe- (nstry of Educaton of hna - hna oble ommuncatons orporaton) Jont Scence Fund under grant 0130661. References [1] Takçı H., Güngör T. hgh performance centrod-based classfcaton approach for language dentfcaton [J]. Pattern Recognton Letters, 01, 33(16): 077-084. [] Dandan W., Qngca, Xaolong W. Framework of entrod-ased ethods for Text ategorzaton [J]. IEIE Transactons on Informaton and Systems, 014, 97(): 45-54. [3] Guan H., Zhou J., Guo. class-feature-centrod classfer for text categorzaton []//Proceedngs of the 18th nternatonal conference on World wde web., 009: 01-10. [4] Lertnattee V., Theeramunkong T. lass normalzaton n centrod-based text categorzaton [J]. Informaton Scences, 006, 176(1): 171-1738. [5] Tan S. Large margn DragPushng strategy for centrod text categorzaton [J]. Expert Systems wth pplcatons, 007, 33(1): 15-0. [6] Tan S. n mproved centrod classfer for text categorzaton [J]. Expert Systems wth pplcatons, 008, 35(1): 79-85. [7] Han E.H.S., Karyps G. entrod-based document classfcaton: nalyss and expermental results []//European conference on prncples of data mnng and knowledge dscovery. Sprnger erln Hedelberg, 000: 44-431. [8] Pang G., Jang S. generalzed cluster centrod based classfer for text categorzaton [J]. Informaton Processng & anagement, 013, 49(): 576-586. [9] Zhao Y., Karyps G., Du D.Z. rteron functons for document clusterng [R]. Techncal Report, 005.