An Anti-Noise Text Categorization Method based on Support Vector Machines *

An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn, agnes_nudt@yahoo.com.cn Abstract. Wth the rapd growth of onlne nformaton, text categorzaton has become one of the ey technques for handlng and organzng text data. Though the natve features of SVM (Support Vector Machnes) are better than Naïve Bayes for text categorzaton n theory, the classfcaton precson of SVM s lower than Bayesan method n real world. Ths paper tres to fnd out the mysteres by analyzng the shortages of SVM, and presents an ant-nose SVM method. The mproved method has two characterstcs: 1) It chooses the classfcaton space by defnng the optmal n-dmenson classfyng hyperspace. 2) It separates nose samples by preprocessng, and trans the classfer usng nose free samples. ompared wth naïve Bayes method, the classfcaton precson of ant-nose SVM s ncreased about 3 to 9 percent. Keywords: Support Vector Machnes; Outler detecton; Bayes Method 1 Introducton Wth the rapd growth of Internet, text categorzaton has become one of the ey technques for handlng and organzng text data. Text categorzaton s used to classfy of text documents nto categores of le documents that can reduce the overhead requred and provde smaller domans n whch the users may explore smlar documents. Snce buldng text classfers by hand s dffcult and tmeconsumng, more recently, researchers have explored the use of machne learnng technques to automatcally assocate documents wth categores usng a tranng set to adapt the classfer. A lot of statstcal classfcaton and machne learnng technques have been appled n text categorzaton. These nclude Naïve Bayes models [1-4], nearest neghbor classfers [5], decson trees [6][7], neural networs [8][9], symbolc rule learnng [10] and SVM Learnng [11-13]. * Ths wor s supported by the Natonal Grand Fundamental Research 973 Program of hna under Grant No. 2003B314802.

In the paper, we are ntent to fnd out how to mprove precson of SVM by comparng t wth Naïve Bayes method n text categorzaton. The nave vrtues of SVM mae t more approprate for text categorzaton than Bayesan method n theory. However, under the condton that tranng samples have noses, the hyperplane constructed wll badly devate from real optmal hyperplane. For example, there s a postve sample whose characterstc s more close to negatve samples. lassfcaton precson of SVM wll largely declne, even lower than Bayesan method. To solve ths problem, the paper presents an ant-nose classfyng method based on SVM. The mproved method optmzes hgh dmenson space frst, and then bulds classfer by removng noses from tranng samples. Experments prove that the classfyng precson of ant-nose SVM ncreased about 3 to 9 percent than Bayesan method. The rest of the paper s organzed as follows. Secton 2 ntroduces the theores of SVM and Naïve Bayes method. Secton 3 measurements the precson of SVM and Bayesan method, and then analyzes the shortage of SVM n text categorzaton. Secton 4 presents an optmal hyperspace choosng method and an ant-nose SVM classfcaton method. Smulated experments are offered n secton 5. Secton 6 concludes the paper. 2 Related wors 2.1 SVM (Support Vector Machnes) SVM [13] can solve two-class classfcaton problems, whch based on fndng a separaton between hyperplanes defned by classes of data. Agan label the tranng data { x, y }, d = 1,, l, y { 1, + 1}, x R.Suppose we have some hyperplanes whch separate the postve from the negatve examples (a separatng hyperplane ). The ponts x whch le on the hyperplane satsfy w x + b = 0, where w s normal to hyperplane, b w s the perpendcular dstance from the hyperplane to the orgn, and w s the Eucldean norm of w. Let d ( d ) be the shortest dstance + from the separatng hyperplane to the closest postve (negatve) example. Defne the margn of a separatng hyperplane to be d + d.for the lnearly separable case, the + support vector algorthm smply loo for the separatng hyperplane wth largest margn. The hyperplane wth largest margn s called optmal hyperplane. Ths can be formulated as follows: suppose that all the tranng data satsfy the followng constrants: w x + b + 1,for y = + 1 (1) w x + b 1,for y = 1 (2) These can be combned nto one set of nequaltes:

y ( wx + b) 1 0, = 1,, l (3) Thus we ntroduce postve Lagrange multplers α, = 1,, l, for equalty constrants, the Lagrange multplers are unconstraned. Ths gves Lagrangan: l 1 (4) 2 LP = w α { y ( x w+ b) 1} 2 = 1 We must now mnmze L P wth respect to all α vansh, all subject to the constrants α 0. Requrng that the gradent of L P wth respect to w and b vansh gve the condtons: w= α y x (5) α y = 0 (6) Snce these are equalty constrants n the dual formulaton, we can substtute them nto Eq. (4) to gve 1 LD = α αα yyx x 2, j j j j By applyng KKT (Karush-Kuhn-Tucer) condtons, the result must subject to: α { y ( ω x + b) 1} = 0, = 1,2,..., l (8) There s a Lagrange multpler α for every tranng pont. In the soluton, those ponts whchα > 0 are called support vector. For all other ponts, they haveα = 0, whch s unused when tranng. It can get b * * * from b = y ω x by choosng a support vector. At last, classfer can classfy texts from followng functon * * H ( x) = sgn( ω x+ b ) (9) (7) 2.2 Naïve Bayes classfer Naïve Bayes classfer learns from tranng data the condtonal probablty of each attrbute A gven the class label.classfcaton s then done by applyng Bayesan rule to compute the probablty of gven the partcular nstance of A 1,, A, and n then predctng the class wth the hghest posteror probablty. Ths computaton s

rendered feasble by mang a strong ndependence assumpton: all the attrbutes A are condtonally ndependent gven the value of the class. By ndependence we mean probablstc ndependence, that s, A s ndependent of B gven whenever Pr( A B, ) = Pr( A ) for all possble values of A, B and, whenever Pr( ) > 0 [2]. 2.3 SVM s better than Bayesan method n theory Thorsten Joachms [11] provdes several advantages of SVM for text categorzaton. We compare them wth Bayesan method. 1) SVM has potental to handle Hgh dmensonal nput space. The number of potental dfferent words used n text s very large, thus the nput space of text classfer s composed of many features. Snce SVM use over-fttng protecton, whch does not necessarly depend on the number of features, they have the potental to handle these large feature spaces. Whle Bayesan method must calculate posteror probablty from pror probablty, n hgh dmensonal space, whch may be affected by over fttng problem. Therefore, SVM s more effcent than Bayesan method for t can use the raw statstcal values. 2) SVM can process relevant features effectvely. Naïve Bayes classfcaton only uses rrelevant features. Unfortunately, there are very few rrelevant features n text categorzaton. Feature selecton s more dffcult n Bayesan method. When some features are assumed rrelevant, precson of classfcaton s decreased. Whle SVM can avod t, SVM can process both rrelevant features and relevant ones. 3) SVM s born to classfy two nds of samples. Most text categorzaton problems are lnearly separable. SVM s born to classfy two nds of samples. It can completely apart two samples by fndng an optmal hyperplane under lnear separable condtons. Mult-class classfcaton can transform to mult two class classfcaton problems. Bayesan method can deal wth the problem straghtforward. 4) SVM s well sutable for problems wth dense concepts and sparse nstances [13]. Document vectors are sparse. For each document, the correspondng document vector contans only few entres that are not zero. It has been proved that SVM s well sutable for problems wth dense concepts and sparse nstances. When document vectors are sparse, the result of naïve Bayes usng statstcal theory s poor. The natve features of SVM mae t more approprate for text categorzaton than Bayesan method.

3 Measurements and analyze 3.1 Measurements We choose 1000 texts about news and scence as test samples, and select 200 texts from canddates as tranng samples. When comparng two methods, the result n realty can not support the standpont n secton 2.3. In the followng tables, n represents number of features we selected. Table 1. Precson of SVM n=300 n=800 n=1000 n=1500 true postves 85.3% 88.5% 90.9% 93.1% false postves 86.2% 87.6% 92.6% 92.2% Table 2. Precson of Naïve Bayes method n=300 n=800 n=1000 n=1500 true postves 87.1% 89.4% 93.9% 96.6% false postves 88.7% 90.3% 94.1% 95.9% The strange results mae us to fnd what nfluence the SVM. There must be some mysteres n SVM when appled nto real world. 3.2 Shortages of SVM SVM has better nature features than Naïve Bayes, but n real world, t gets opposte results. We try to fnd out mysteres by analyzng the shortages of SVM. At last, we draw followng conclusons. 1) SVM has no crtera n feature choce. SVM can classfy text perfectly. However, f t uses every words emerged n text smply as a dmenson of hyperspace, the computaton of hyperplane wll be very dffcult and classfcaton precson wll be low. Thus, one of our research emphases s how to choose mportant and useful features to optmze mult-dmenson space. 2) The ant-nose ablty of SVM s wea. Although SVM s treated as a good text categorzaton method, ts ant-nose ablty s very wea. Support Vector s a tranng sample wth shortest dstance to the hyperplane. The number of support vector s small, but t contans all nformaton needed for classfcaton. lassfyng effect s decded by mnorty support vectors n the samples, so removng or reducng the samples that are not support vectors has no nfluence on the classfer. If a nose sample s treated as support vector, t wll largely reduce classfcaton precson of SVM. If we get rd of nose-samples frst, then tran SVM by optmzed samples, we can acheve hgher classfyng precson.

4 SVM-based Ant-nose text categorzaton methods In order to obtan hgher precson, we need to get over shortages of SVM. In ths secton, we enhance the method from two aspects. 4.1onstructng an optmal classfyng hyperspace Effcency and effect of SVM s largely nfluenced by the number of dmenson and every dmenson of hyperspace. Although SVM has advantages n text classfyng, t has no crtera n dmenson choce. Ths secton uses statstcal method to choose the most mportant features as dmensons of classfcaton space. Texts consst of words. Frequency of a word can be treated as a dmenson of hyperspace. Nevertheless, the number of words n texts s very large n general. Whch words are chosen as dmensons of hyperspace s very dffcult to decde. As fgure 1 shows, upper dots denote samples n class, and lower squares denote samples n class. We now that hyperspace n fgure (b) s better than fgure (a) s for dfference between and s the more apparent. HS 1 The optmal hyperplane HS 2 The optmal hyperplane Fg. 1. (a) n-dmenson hyperspace HS 1 (b) n-dmenson hyperspace HS 2 Therefore, we need a crteron to choose certan words accordng to ntal learnng samples and construct optmal hyperspace for classfcaton. Assumng HS as n-dmenson hyperspace, each dmenson s frequency of a word. Defnton 1. Barycentre of samples that belong to class n HS s t d Sample B =, Sample t = ( Frd( w1), Frd( w2),..., Frd( wn)) denotes for a sample pont n n-dmenson hyperspace HS, Frt( w ) denotes frequency of word w n text d. Defnton 2. we call HS as optmal classfyng n-dmenson hyperspace about, ff B B for all samples s maxmum under some w, and set cardnalty of w s n. (a) (b)

Defnton 3. The pror odds on class as O ( ) = P ( )/ P( ), O ( ) measures the predctve or prospectve support accorded to by bacground nowledge alone. In practce, we can calculate the pror odds on by the followng formula. [14] O( ) = { t t t Sample} / { t t t Sample} (10) Defnton 4. Defnng the lelhood rato of word w on as: Lw ( ) = Pw ( ) Pw ( ) (11) Lw ( ) denotes the retrospectve support gven to by evdence actually observed. Pw ( ) denotes the average frequency of word w n sample texts. Theorem 1. The posteror odds are gven by the product as follow: O ( w) = Lw ( ) O ( ) (12) In practce, we can calculate Pw ( ) by frequency of w n samples of and Pw ( ) by frequency of w n. At last, we can wor out O ( w ) from equaton (10) (11) (12). OSpam ( w ) represents the effect of classfyng accordng to w s frequency. Theorem 2. When choosng frst n maxmum O ( w ), we can construct optmal hyperspace HS by correspondng Fr( w ) as a dmenson. HS represents a hyperspace n whch the dfferent between and s the most apparent. Text d n HS can be calculated by t HS = ( Fr( w1), Fr( w2),..., Fr( w )), and w n s one of n maxmum O ( w ) words. 4.2 Improvng ant-nose ablty of SVM SVM has hgh classfcaton precson under condtons wth no noses. In nosy condtons, the precson reduces largely. As Fgure 2 shows, pont x s a nose sample n an n-dmenson hyperspace. Although x belong to postve samples, t s largely dfferent from other postve samples. If we consder x as a support vector when computng optmal hyperplane, t wll mae the hyperplane devate from real optmal hyperplane largely. lassfcaton precson s affected serously. The optmal hyperplane affected by nose sample x The optmal hyperplane n the non-nose condton Nose x

Fg. 2. Nose sample x effect the optmal hyperplane Although x s postve one n samples, ts characterstc s much more dfferent from other postve samples and may be close to negatve samples under some condtons. That s, the correspondng pont x n hgh dmenson space s an outler. Noses n negatve samples have the same characterstc. If we elmnate these noses n samples before tranng SVM, the classfcaton precson wll ncrease largely. As Fgure 3 shows, we can get more reasonable optmal hyperplane after gnorng the nfluence of x when tranng. The optmal hyper-plane Nose x Fg. 3. The optmal hyperplane when gnorng nose sample x In order to construct an ant-nose text classfer, we present a method that flter nose samples by outler detecton n hgh dmensonal space before tranng SVM. Supposng D s a classfed sample set, o, p, q are samples n D, d( p, q ) represents the dstance between samples p and q [15]. Defnton 5. ( dstance of sample p, dst( p) ) d( p, o ) represents the dstance between sample p and sample o n set D. If there are at least samples o' D subject to d( p, o') d( p, o) and at most ( 1) samples o' D subject to d( p, o') < d( p, o), whch called dstance of sample p, dst( p). Defnton 6. ( nearest neghbors of sample p, N ( p )) The sample set n set D whose dstance to p do not exceed dst( p) : N ( p) = { q D\{ p}, d( p, q) dst( p))}. Defnton 7. (Local densty of sample p, den ( p ) ) Local densty of sample p represents recprocal of N ( p) average dst -, that s den ( p) = 1/ avg{ dst( q) q N ( p)}. Defnton 8. (Local outler coeffcent of sample p, LOF ( p ) ) Local outler coeffcent of sample p represents the rato between average densty of N ( p ) and den ( p ), that s LOF ( p) = avg{ den ( q) q N ( p)}/ den ( p). Local outler coeffcent reflects dscrete case of sample p relatve to nearest neghbors around. In order to separate nose samples, we need to calculate LOF () t for each text t n class and,f LOF ( x ) s greater than threshold θ, we conclude that t s an t outler, that s, text t s nose n samples.

At last, we get a reasonable classfcaton functon flterng nose samples. * * H( x) = sgn( ω x + b ) by 5 Valdty test onsderng the problem of classfyng texts, we partton the tranng samples nto set and manually frst. Then, we select n words accordng to secton 4.1, and then remove nose samples accordng to threshold θ by calculatng LOF () t for * * each text n or. At last, classfcaton functon H( x) = sgn( ω x+ b ) s obtaned. We select 1000 test samples and 200 tranng samples as secton 3.1. We test the method usng parameter n ( n s the number of dmenson) and θ = 20%. Table 3. precson of ant-nose method by dfferent parameter n and θ = 20%. n=300 n=800 n=1000 n=1500 true postves 96.7% 97.8% 99.5% 99.8% false postves 97.2% 98.1% 99.7% 99.9% From table 1 and table2, we can conclude SVM ft text categorzaton better n theory, but ts precson s worse than Bayesan method n practce. From table 1 and table 3, we can fnd that precson of classfer ncreased about 6 to11 percent after we apply ant-nose method. And from table2 and table 3, we prove that ant-nose SVM method shows ts advantage n text categorzaton, the precson of classfer ncreased about 3 to 9 percent compared wth Naïve Bayes method. 6 onclusons Ths paper enhances support vector machnes for text categorzaton. Recognzng that SVM has better nave feature than Naïve Bayes method, we conclude that SVM s preferable at least for text categorzaton. But n practce, the classfcaton precson of SVM s lower than Naïve Bayes. The strange results mae us to fnd what nfluence the SVM. There must be some mysteres n SVM when appled nto real world. We found that SVM has no crtera n feature choce, so we construct optmal hyperspace for classfcaton by gvng a defnton of optmal n-dmenson classfyng hyperspace. Moreover, we fnd that the ant-nose ablty of SVM s wea, we separate nose samples by preprocessng and buld text classfer that s traned from nose free samples. In the overall comparson of ant-nose SVM and Naïve Bayes method for1000 test samples, the results over dfferent parameter n for precson ndcate sgnfcantly dfferences n the performance of the ant-nose SVM over Naïve Bayes method. lassfcaton precson of ant-nose SVM ncreased about 3 to 9 percent.

References [1] Yang, Y. An evaluaton of statstcal approaches to text categorzaton. MU Techncal Report, MU-S-97-127, Aprl 1997. [2] Fredman N, Goldszmdt M, Buldng classfer usng Bayesan Networs. In: Proc Natonal onference on Artfcal Intellgence, Menlo Par, A: AAAI Press, 1996:1277~1284. [3] Ion Androutsopoulos, John Koutsas, Konstantnos V. handrnos, George Palouras and onstantne D. Spyropoulos. An Evaluaton of Nave Bayesan Ant-Spam Flterng. 2000. [4] ross Valdaton for the nave Bayes lassfer of SPAM. http://stat-www.bereley.edu/users/nolan/stat133/spr04/projects/spampart2.pdf, 2004.3. [5] Lewm D.D. and Rnguuette, M. A comparson of two learnng algorthms for text categorzaton. In Thrds Annual Symposum on Document Analyss and Informaton Retreval, 81-93, 1994. [6] Sholom M. Wess, etc, Maxmzng Text-Mnng Performance, IEEE Intellgent Systems 2-8, July/August,1999. [7] Mchelne K, Lara W, Wang G, et al. Generalzaton and decson tree nducton: effcent classfcaton n data mnng. ftp://ftp.fas.sfu.ca/pub/cs/han/dd/rde97.ps.1997-02-13 [8]ZhouZ, hens, henz.fann: A fast adaptve neural networ classfer. Internatonal Journal of Knowledge and Informaton Systems,2000,2(1):115~129 [9] LeeJ, TsaJ. On-lne fault detecton usng ntegrated neural networs. In: Proc of Applcatons of Artfcal Neural Networs SPIE,1992.436~446 [10]J.Kven, M.Warmuth, and P.Auer. The percepton algorthm vs. wndow: Lnear vs. logarthmc mstae bounds when few nput varables are relevant. In onference on omputatonal Learnng Theory, 1995 [11] Thorsten Joachms. Text ategorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features. Proceedngs of EML-98, 10th European onference on Machne Learnng, 1997. [12] A. Basu,. Watters, and M. Shepherd. Support Vector Machnes for Text ategorzaton. Proceedngs of the 36th Hawa Internatonal onference on System Scences (HISS 03). [13] Burges., A Tutoral on Support Vector Machnes for Pattern Recognton, Journal of data Mnng and Knowledge Dscovery, 2(2),121-167, 1998. [14] Judea Pearl. Probablstc Reasonng n Intellgent Systems: Networs of Plausble Inference. ISBD 0-934613-73-7. [15] XU LongFe, XIONG JunL et al. Study on Algorthm for Rough Set based Outler Detecton n hgh Dmenson Space, omputer Scence, 2003 VOL.30, No.10 (n hnese).