Efficient Text Classification by Weighted Proximal SVM *

Effcent ext Classfcaton by Weghted Proxmal SVM * Dong Zhuang 1, Benyu Zhang, Qang Yang 3, Jun Yan 4, Zheng Chen, Yng Chen 1 1 Computer Scence and Engneerng, Bejng Insttute of echnology, Bejng 100081, Chna {zhuangdong, chenyng1}@bt.edu.cn Mcrosoft Research Asa, Bejng 100080, Chna {byzhang, zhengc}@mcrosoft.com 3 Computer Scence, Hong Kong Unversty of Scence and echnology, Hong Kong qyang@cs.ust.hk 4 Department of Informaton Scence, School of Mathematcal Scence, Pekng Unversty yanjun@math.pku.edu.cn Abstract In ths paper, we present an algorthm that can classfy large-scale text data wth hgh classfcaton qualty and fast tranng speed. Our method s based on a novel extenson of the proxmal SVM mode [3]. Prevous studes on proxmal SVM have focused on classfcaton for low dmensonal data and dd not consder the unbalanced data cases. Such methods wll meet dffcultes when classfyng unbalanced and hgh dmensonal data sets such as text documents. In ths work, we extend the orgnal proxmal SVM by learnng a weght for each tranng error. We show that the classfcaton algorthm based on ths model s capable of handlng hgh dmensonal and unbalanced data. In the experments, we compare our method wth the orgnal proxmal SVM (as a specal case of our algorthm) and the standard SVM (such as SVM lght) on the recently publshed RCV1-v dataset. he results show that our proposed method had comparable classfcaton qualty wth the standard SVM. At the same tme, both the tme and memory consumpton of our method are less than that of the standard SVM. 1. Introducton Automatc text classfcaton nvolves frst tranng a classfer by some labeled documents and then usng the classfer to predct the labels of unlabeled documents. Many methods have been proposed to solve ths problem. SVM (Support Vector Machne), whch s based on the statstcal learnng theory [11], has been shown to be one of the best methods for text classfcaton problems [6] [8]. Much research has been done to make SVM practcal to classfy large-scale dataset [4] [10]. he purpose of our work s to further advance the SVM classfcaton technque for largescale text data that are unbalanced. In partcular, we show that when the text data are largely unbalanced, that s, when the postve and negatve labeled data are n dsproporton, the classfcaton qualty of standard SVM deteorates. hs problem has been solved usng cross-valdaton based methods. But cross-valdaton methods are very neffcent due to ther tedous parameter adjustment routnes. In response, we propose a weghted proxmal SVM (WPSVM) model, n whch the weghts can be adjusted, to solve the unbalanced data problem. Usng ths weghted proxmal SVM method, we can acheve the same accuracy as the tradtonal SVM whle requrng much less computatonal tme. Our WPSVM model s an extended verson of the proxmal SVM (PSVM) model. he orgnal proxmal SVM was proposed n [3]. Accordng to the expermental results of [3], when classfyng low dmensonal data, tranng a proxmal SVM s much faster than tranng a standard SVM and the classfcaton qualty of proxmal SVM s comparable wth the standard SVM. However, the orgnal proxmal SVM s not sutable for text classfcaton because of the followng two reasons: 1), text data are hgh dmensonal data, but the method proposed n [3] s not sutable for tranng hgh dmensonal data; ), data are often unbalanced n text classfcaton, but proxmal SVM does not work well n ths stuaton. Moreover, n the experments we found that the classfcaton qualty of proxmal SVM deterorates more quckly than standard SVM when the tranng data becomes unbalanced. * hs work s done at Mcrosoft Research Asa. Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

In response, we propose a weghted proxmal SVM (WPSVM) model n ths paper. We show that ths method can be successfully appled to classfyng hgh dmensonal and unbalanced text data through the ntroducton of the followng two modfcatons: 1) n WPSVM, we added a weght for each tranng error and developed a smple method to estmate the weghts. We then adjusted the weghts automatcally solves the unbalanced data problem; ) Instead of solvng the problem by KK (Karush-Kuhn-ucker) condtons and Sherman-Morrson-Woodbury formula as shown n [3], we use a teratve algorthm to solve WPSVM, whch makes WPSVM sutable for classfyng hgh dmensonal data. Expermental results on RCV1-v [7] [8] show that the classfcaton qualty of WPSVM are as accurate as tradtonal SVM and more accurate than proxmal SVM when the data are unbalanced. At the same tme WPSVM s much more computatonally effcent than tradtonal SVM. he rest of ths paper s organzed as follows. In Secton, we revew the text classfcaton problems and the SVM and proxmal SVM algorthms. In Secton 3, we propose the weghted proxmal SVM model and explore how to solve t effcently. In Secton 4, we dscuss the mplementaton ssues. Expermental results are gven n Secton 5. In Secton 6, we gve the conclusons and future work.. Problem Defnton and Related Work.1. Problem Defnton In our formulaton, text documents are represented n the Vector Space Model [1]. In ths model, each document s represented by a vector of weghted term frequences usng the F*IDF [1] ndexng schema. For smplcty we frst consder the bnary classfcaton problem, where there are only two class labels n the tranng data: postve (+1) and negatve (- 1). Note that mult-class classfcaton problem can be solved by combnng multple bnary classfers; ths wll be done n our future work. Suppose that there are m documents and n terms n the tranng data, we use < x, y > to denote each tranng data, where n x R, = 1,,..., m are tranng vectors and y { + 1, 1}, = 1,,... m are ther correspondng class labels he bnary text classfcaton problem can be formulated as follows, Gven a tranng dataset { < x, n y > x R, y { 1,1}, = 1,... m}, fndng a classfer f( x ): R n { + 1, 1}, such that for any unlabeled data x we can predct the label of x by f ( x ). We frst revew the standard SVM and proxmal SVM. More detals could be found n [] and [3]. hs paper wll follow the notatons of [] whch may dffer somewhat from those used n [3]. he SVM algorthms ntroduced n ths paper all use the lnear kernel; t s also possble to use non-lnear kernels, but there are no sgnfcant advantages of usng non-lnear kernel for text classfcaton... Standard SVM Classfer he standard SVM algorthm ams to fnd an optmal hyperplane w x+ b = 0 and use ths hyperplane to separate the postve and negatve data. he classfer can be wrtten as: 1, f b 0 f ( ) = + x w+ x 1, f x w + b < 0 he separatng hyperplane s determned by two parameters w and b. he objectve of the SVM tranng algorthm s to fnd w and b from the nformaton n the tranng data. Standard SVM algorthm fnds w and b by solvng the followng optmzaton problem. mn 1 C w + ξ (1) s.t., y ( w x + b) + ξ 1 ξ 0 he frst term w controls the margn between the postve and negatve data. ξ represents the tranng error of the th tranng example. Mnmzng the objectve functon of (1) means mnmzng the tranng errors and maxmzng the margn smultaneously. C s a parameter that controls the tradeoff between the tranng errors and the margn. Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

Fgure 1. Standard SVM he ntuton of standard SVM s shown n Fgure 1. w x + b = 1 and w x + b = 1 are two boundng planes. he dstance between the two boundng planes s the margn. he optmzaton problem (1) can be converted to a standard Quadratc Programmng problem. Many effcent methods have been proposed to solve ths problem on large scale data [] [4]..3. Proxmal SVM Classfer he proxmal SVM also uses a hyperplane w x + b = 0 as the separatng surface between postve and negatve tranng examples. But the parameter w and b are determned by solvng the followng problem. mn 1 ( b ) C w + + ξ () s. t., y ( w x + b) + ξ 1 = he man dfference between standard SVM (1) and proxmal SVM () s the constrants. Standard SVM employs an nequalty constrant whereas proxmal SVM employs an equalty constrant. he ntuton of Proxmal SVM s shown n Fgure. We can see that standard SVM only consders ponts on the wrong sde of w x + b = 1 and w x + b = 1 as tranng errors. However, n proxmal SVM, all the ponts not located on the two planes are treated as tranng errors. In ths case the value of tranng error ξ n () may be postve or negatve. he second part of the objectve functon n () uses a squared loss functon ξ nstead of ξ to capture ths new noton of error. Fgure. Proxmal SVM he proxmal SVM made these modfcatons manly for effcency consderaton. [3] proposed an algorthm to solve () usng KK condtons and Sherman-Morrson-Woodbury formula. hs algorthm s very fast and has comparable effectveness wth standard SVM when the data dmenson s far less than the number of tranng data (n << m). However, n text classfcaton n usually has the same magntude wth m and the condton n << m s not hold anymore. o the best of our knowledge, lttle research works has been conducted to show the performance of proxmal SVM wth hgh dmensonal data. Although the orgnal PSVM algorthm of [3] s not sutable for hgh dmensonal data, Formula () can be solved effcently for hgh dmensonal data usng teratve methods. We have appled the proxmal SVM for text classfcaton but found that when the data are unbalanced,.e. when the amount of postve data are much more than negatve data, or vce versa, the effectveness of proxmal SVM deterorates more quckly than standard SVM. Data unbalance s common n text classfcaton, whch motvates us to search for an extenson to proxmal SVM to deal wth ths problem. 3. Weghted proxmal SVM Model We show the reason why the orgnal proxmal SVM s not sutable for classfyng unbalanced data n ths secton. o the unbalanced data, wthout lose of generalty, suppose the amount of postve data s much fewer than the negatve data. In ths case the total accumulatve errors of negatve data are much hgher than that of postve data. Consequently, the boundng plane w x + b = 1 wll shft towards the drecton opposte to the negatve data to produce a larger margn at the prce of ncreasng the postve errors. Snce the postve data are rare, ths acton wll lower the value of objectve functon (). hen the separatng plane wll be based to the postve data and result n a hgher precson and a lower recall for the postve tranng data. o solve ths problem, we assgn a non-negatve weght δ to each tranng error ξ and convert the optmzaton problem () to the followng form: mn 1 v( ) 1 w + b + δ ξ (3) s. t., y ( w x + b) + ξ = 1 he dfferences between () and (3) are: 1. Formula () assumes all the tranng errors ξ are equally weghted, but n Formula (3) we use a non- Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

negatve parameter δ to represent the weght of each tranng error ξ.. In Formula (3), we let v=1/(c) and move the tradeoff parameter C from ξ to ( w +b ). he purpose of ths movement s for notaton smplcty n the later development of our solvng method. hough (3) can be solved usng KK condtons and Sherman-Morrson-Woodbury formula as showed n [3], ths solvng strategy s neffcent for hgh dmensonal data lke text documents. Instead, we convert (3) to an unconstraned optmzaton problem that can be drectly solved usng teratve methods. he constrant of (3) can be wrtten as: ξ (1 y ( b)) ( y ( b)) = w x + = w x + (4) Usng (4) to substtute ξ n the objectve functon of (3), we get an unconstraned optmal problem: mn 1 1 f( w, b) = v( b ) ( y ( b)) w + + δ w x + (5) m n For notaton smplcty, let X R denote the F*IDF matrx of documents whose row vectors are x. Suppose e s a vector whose elements are all 1. m ( n+ 1) ( n+ 1) Let A= [ X, e] R, β = [ w, b] R and let Δ R m m denotes a dagonal matrx whose nonzero elements are Δ = δ then (5) can be wrtten as: 1 1 mn f ( β ) = v β + Δ( y A β ) 6 he gradent of f ( β ) s: f( β ) = vβ (ΔA) (Δy-ΔA β ) =( vi+ (ΔA) (ΔA)) β (ΔA) (Δy) he Hessan matrx of f ( β ) s: H= vi+ (ΔA) (ΔA) From v>0 and the elements of Δ and A are nonnegatve, t s easy to prove H s postve defnte. he soluton of (6) s found when f ( β ) =0, that s: ( vi+ (ΔA) (ΔA)) β= (ΔA) ( Δy) (7) Equaton (7) can be generally wrtten as (shft*i + A'A)x=A'b, where A s a hgh dmensonal sparse matrx. he CGLS /LSQR [9] algorthm s dedcated to effcently solve ths problem. 4. Algorthm Desgn here are two man concerns n the algorthm desgn: how to set the parameters and how to solve Equaton (7) effcently. We wll address these concerns n ths secton. 4.1. Parameter unng Several parameters need to be decded n the tranng algorthm. Parameter v controls the tradeoff between maxmzng the margn and mnmzng the tranng errors. Parameters δ, = 1,,..., m control the relatve error weghts of each tranng example. o smplfy the parameter settng for unbalanced data problem, we set the error weght of all postve tranng data to δ + and all negatve tranng data to δ. hen we only need to set three parameters: v, δ + and δ. hese parameters can be decded by statstcal estmaton methods on the tranng data, such as LOO (Leave-One-Out cross-valdaton), k-fold cross valdaton, etc. If we teratvely update the weghts by the separatng plane obtaned from prevous round of tranng, we essentally obtan a boostng based method such as AdaBoost [13]. However, a dsadvantage of usng these boostng based and cross-valdaton based methods s that they need too much tranng tme for parameter estmaton. o obtan a more effcent method than the boostng based methods, we have developed a smple method that can estmate the parameters based on the tranng data. It can acheve comparable effectveness as compared to algorthms that usng standard SVM plus cross valdaton technques. Our parameter estmaton method s as follows. o get a balanced accumulatve error on both postve and negatve data, t s better to have the followng condton: y δ + ξ = 1 y δ ξ = = 1 If we assume the error ξ of both postve and negatve tranng data has the same expectaton, we can get: N + δ + = δ N (8) where N+ s the number of postve tranng examples and N- s the number of negatve tranng examples. hen we set the parameter δ and δ + as follows. Set δ =1 Set rato= N / N+ Set δ + =1+ (rato-1)/ Notce that we do not set δ + =rato to exactly satsfy Equaton (8). Instead, we use a conservatve settng Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

strategy to make the precson of a mnor class a lttle hgher than recall. hs strategy usually results n hgher accuracy for unbalanced data. Parameter v s set as follows. v = * average( δ x ) When the data are exactly balanced (the number of postve examples s equal to the number of negatve examples), ths method wll result n δ = δ + =1 and make WPSVM equal to PSVM. herefore, PSVM can be vewed as a specal case of WPSVM. o gve an ntutve example of the dfferences between WPSVM and PSVM, we manually generated a balanced data set and an unbalanced dataset n a two dmensonal space. hen we calculated the separatng plane of WPSVM and PSVM respectvely. he results are shown n Fgure 3 and Fgure 4. Fgure 3 shows that the separatng planes for PSVM and WPSVM are almost the same when the data are balanced. Fgure 4 shows when the data s unbalanced, the separatng plane for WPSVM resdes n the mddle of the postve and negatve examples, but the separatng plane for PSVM s nclned to the postve examples. We tred several methods to solve equaton (7) and found CGLS [9] has the best performance. However, many other teratve optmal methods can also be used to solve Equaton (7). he complexty of the tranng algorthm s domnated by the algorthm used for solvng Equaton (7). Usually ths knd of algorthms has O(KZ) tme complexty and O(Z) space complexty where K s the number of teratons and Z s the number of non-zero elements n the tranng vectors. Iteratve method can only fnd an approxmate soluton to the problem. he more the number of teratons s used, the longer the tranng tme s requred and the teratve soluton s closer to the optmal soluton. However, when the teraton count archves a certan number, the classfcaton result wll not change when the number of teratons contnues to ncrease. herefore t s mportant to select a good termnatng condton to obtan a better tradeoff between tranng tme and classfcaton accuracy. Snce the number of requred teratons may vary for dfferent dataset, we make the termnatng condton as an adjustable parameter when mplementng the WPSVM algorthm. 5. Experments Fgure 3. Separatng planes for balanced data Fgure 4. Separatng planes for unbalanced data 4.. ranng Algorthms Ratonale: Our experments evaluate the relatve merts of WPSVM and other SVM based methods. We wll verfy the followng hypotheses for text datasets: 1. WPSVM (wth default parameter settngs) has the same classfcaton power as standard SVM plus crossvaldaton, has slghtly better classfcaton power than standard SVM (wth default parameter settngs) and has much better classfcaton power than PSVM. WPSVM s much more effcent than standard SVM Data sets: he dataset that we choose s a textual dataset RCV1-v [8]. RCV1 (Reuters Corpus Volume I) s an archve of over 800,000 manually categorzed newswre stores recently made avalable by Reuters, Ltd. for research purposes. Lews, et al [8] made some correctons to the RCV1 dataset and the resultng new dataset s called RCV1-v. he RCV1-v dataset contans a total of 804,414 documents. he benchmark results of SVM, weghted k-nn and Roccho-style algorthms on RCV1-v are reported n [8]. he results show that SVM s the best method on ths dataset. o make our expermental results comparable wth the benchmark results, we strctly follow the nstructon of [8]. hat s, we use the Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

same vector fles, tranng/test splt and effectve measures as n [8]. ext data representaton: he feature vector for a document was produced from the concatenaton of text n the <headlne> and <text> tags. After tokenzaton, stemmng and stopword removal. 47,19 terms that appears n the tranng data are used as features. he features are weghted usng the F*IDF ndexng schema and then beng cosne normalzed. he resultng vectors are publshed at [7]. We drectly use these vectors for our experments. ranng/test splt: he tranng/test splt s accordng to the publshng tme of the documents. Documents publshed from August 0, 1996 to August 31, 1996 are treated as tranng data. Documents publshed from September 1, 1996 to August 19, 1997 are treated as test data. hs splt produces 3,149 tranng documents and 781,56 test documents. Categores and Effectve measures: Each document can be assgned labels accordng to three dfferent category sets: opcs, Industres or Regons. For each sngle category, the one-to-rest strategy s used n the experments. In other words, when classfyng category X, all the examples labeled X are defned as postve examples, and the other examples are defned as negatve examples. he F1 measure s used to evaluate the classfcaton qualty of dfferent methods. F1 s determned by Precson and Recall. he Precson, Recall, and F1 measures for a sngle category are defned as follows. # of correctly classfed postve examples Precson= # of classfer predcted postve examples # of correctly classfed postve examples Recall = # of real postve examples F1 = (*Precson*Recall) / (Precson + Recall) he average effectveness s measured by the average mcro-f1 and average macro-f1. Average macro-f1 s the average value of each sngle F1 n the category set. Average mcro-f1 s defned as follows. # of correctly predcted docs for category mcrop= # of docs that are predcted as category mcror= # of correctly predcted docs for category # of docs that truely belong to category Ave mcro-f1=(*mcrop*mcror)/(mcrop+mcror) 5.1. Experments on WPSVM s Effectveness In the effectveness testng experments, we compare the F1 measure on the followng: WPSVM: Our proposed algorthm, usng the parameter estmatng method presented n secton 4.1. PSVM: Set all δ n WPSVM model equal to 1 and make t equvalent to the proxmal SVM algorthm. SVM lght: Usng SVM lght v 6.01 [5] wth default parameter settngs. SVM.1: hs algorthm s a standard SVM plus threshold adjustment. It s a benchmark method used n [8]. In ths algorthm, SVM lght was run usng default parameter settngs and was used to produce the score. he threshold was calculated by the SCutFBR.1 [1] algorthm. SVM.: hs algorthm s a standard SVM plus LOO cross valdaton. It was frst ntroduced n [6] and named as SVM. n [8]. In ths algorthm, SVM lght was run multple tmes wth deferent j parameters and the best j parameter was selected by LOO valdaton. he -j parameter controls the relatve weghtng of postve to negatve examples. hs approach solved the data unbalance stuaton by selectng the best j parameter. he experments were separately performed on each category usng the one-to-rest strategy. he dataset scale for each category s shown n table 1. able 1. Dataset scale for each category Number of tranng examples 3149 Number of test examples 78156 Number of features 4719 Average Number of non-zero elements 13.9 We frst ntroduce the results on the opcs categores. here are total 101 opcs categores that at least one postve example appears n the tranng data. We calculate the F1 value for the fve algorthms on each category (he F1 value of SVM.1 and SVM. s calculated by the contngency table publshed at [7]). Fgure 5 shows the changes of F1 value from unbalanced data to balanced data for the fve algorthms. Categores are sorted by tranng set frequency, whch s shown on the x-axs. he F1 value for a category wth frequency x has been smoothed by replacng t wth the output of a local lnear regresson over the nterval x 00 to x+00. From the results we can see that when the tranng data s relatvely balanced (the rght part Fgure 5), the F1 measure for the fve algorthms has no bg dfferences. When the tranng data s unbalanced (the left part of Fgure 5), the classfcaton qualty of WPSVM s between SVM.1 and SVM.. Both have better classfcaton qualty than SVM lght and PSVM. Fgure 5 also shows the classfcaton qualty of PSVM deterorates more quckly than that of SVM lght when the data become unbalanced. Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

Fgure 5. F1 measure for fve methods on 101 opc categores able shows the average F1 measure of the 101 categores. he results of SVM.1 and SVM. are the values reported n [8]. It can be seen that the overall performance of WPSVM, SVM.1 and SVM. are better than that of SVM lght and PSVM. SVM.1 has the best average effectveness, especally n average macro-f1. hs s manly because when the tranng data are extremely unbalanced (e.g. the postve rato s less than 0.1%), the threshold adjustment method s better than both WPSVM and SVM.. able. Average F1 measure for opcs Algorthms Average mcro- Average macro- F1 F1 PSVM 0.767 0.354 SVM lght 0.804 0.47 WPSVM 0.808 0.589 SVM. 0.810 0.557 SVM.1 0.816 0.619 able 3. Average F1 for Industres and Regons Algorthms Average mcro-f1 Industres SVM.1 0.513 0.97 (313) WPSVM 0.50 0.301 Regons SVM.1 0.874 0.601 (8) WPSVM 0.86 0.558 Average macro-f1 We also test the effectveness of WPSVM on the 313 Industres categores and 8 Regons categores. he average F1 measures of these categores are shown n able 3. he results of SVM.1 shown n table 3 are the values reported n [8]. We can see that n the Industres and Regons Splt, the effectveness of WPSVM s also comparable wth SVM.1. he effectveness experments show the overall classfcaton qualty of WPSVM s comparable wth SVM.1 and SVM., whch are the best methods of [8], and s better than SVM lght and PSVM. However, SVM.1 and SVM. requre tranng many tmes to estmate a good parameter whereas WPSVM only requre tranng once. 5.. Experments on Computatonal Effcency he computatonal effcency s measured by the actual tranng tme and memory usage respectvely. Snce SVM.1 and SVM. requre runnng SVM lght many tmes, ther effcency must be less than SVM lght. hus n the experments, we only compare the effcency of WPSVM and SVM lght. We run each algorthm on 5 tranng dataset wth dfferent sze. he vector fles of [8] are publshed as one tranng fle and 4 test fles. We use the tranng fle as the frst dataset and then ncrementally append the remanng four test fles to form the other four datasets. he number of tranng examples for the 5 datasets s 3149, 477, 41816, 6139 and 804414 respectvely. he tranng tme s measured n second. Both algorthms ran on an Intel Pentum 4 Xeon 3.06G computer. We found that when usng SVM lght for the same tranng sze, balanced data requred more tranng tme than the unbalanced data. hus, we dd two groups of effcency experments. One group uses category CCA as postve examples. he rato of CCA s 47.4% and t makes ths group as a balanced example. he other group s an unbalanced example. It uses GDIP as postve examples. he rato of GDIP s 4.7%. able 4 shows the tranng tme of WPSVM and SVM lght V6.01 on the two groups. We can see that the tranng tme of WPSVM s far less than the tranng tme of SVM lght and s not affected by the data unbalanced-ness problem. Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE

able 4. ranng tme comparson No. of CCA GDIP tranng SVM SVM data WPSVM lght WPSVM lght 3149 1.6 43.1 9.1 477 4.7 1313 35 317 41816 80.5 3306 100 884 6139 194.4 5110 171 1599 804414 73.4 10986 76 458 he memory usage requred for both WPSVM and SVM lght s determned by the tranng sze, regardless of whether the data are balanced or unbalanced. Fgure 6 shows the memory requrements of the two algorthms wth dfferent tranng szes. We can see that the memory requrement of WPSVM s slghtly less than SVM lght. hs s because WPSVM almost only requre the memory to store the tranng data but SVM lght requres addtonal workng space. Fgure 6. Memory consume comparson 6. Concluson and Future work In ths paper, we proposed a weghted proxmal SVM model, whch assgns a weght to each tranng error. We successfully appled the WPSVM model to text classfcaton problem by a smple parameter estmaton method and an algorthm for solvng the equatons drectly nstead of usng KK condtons and the Sherman-Morrson-Woodbury formula. he experments showed that our proposed method can acheve comparable classfcaton qualty as the standard SVM when supplemented wth valdaton technques, but s more computatonally effcent than the standard SVM. We only valdated the effectveness of our algorthm on text classfcaton n ths paper. As a general lnear SVM classfcaton algorthm, t can also be used n other classfcaton tasks. It s worth pontng out that n ths paper we only demonstrated the advantage of WPSVM n solvng the data unbalancedness problem. However the WPSVM model may have other potental use. In WPSVM, the relatve mportance of each tranng pont can be adjusted based on other pror knowledge. 7. Acknowledgement Qang Yang s supported by a grant from Hong Kong RGC: HKUS6187/04E. 8. References [1] Baeza-Yates, R. and Rbero-Neto, B., Modern Informaton Retreval. Addson Wesley, 1999. [] Burges, C., A utoral on Support Vector Machne for Pattern Recognton. Data Mnng and Knowledge Dscovery, 1998. [3] Fung, G. and Mangasaran, O. L., proxmal Support Vector Machne Classfers. In Proc. of the Seventh ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD 001), 001. [4] Joachms,., Makng Large-Scale SVM Learnng Practcal. Advances n Kernel Methods Support Vector Learnng, 1999 [5] Joachms., SVM Lght: Support Vector Machne. Feb 9th, 004. http://svmlght.joachms.org. [6] Lews, D. D., Applyng support vector machnes to the REC-001 batch flterng and routng tasks. In he enth ext REtreval Conference (REC 001), pages 86 9, Gathersburg, MD 0899-0001, 00. Natonal Insttute of Standards and echnology. [7] Lews, D. D., RCV1-v/LYRL004: he LYRL004 Dstrbuton of the RCV1-v ext Categorzaton est Collecton (1-Apr-004 Verson). http://www.jmlr.org/papers/volume5/lews04a/lyrl004_ rcv1v_readme.htm [8] Lews, D. D., Yang, Y. Rose,. and L, F., RCV1: A New Benchmark Collecton for ext Categorzaton Research. Journal of Machne Learnng Research, 5:361-397, 004. [9] Page C. C. and Saunders, M. A., Algorthm 583; LSQR: Sparse lnear equatons and least-squares problems. OMS 8(), 195--09, 198. [10] Platt, J., Fast ranng of Support Vector Machnes usng Sequental Mnmal Optmzaton. Advances n Kernel Methods Support Vector Learnng, 1998 [11] Vapnk, V. N., Statstcal Learnng heory. John Wley & Sons, 1998 [1] Yang Y., A study on thresholdng strateges for text categorzaton. In the wenty-fourth Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 01), 001. [13] Freund, Y. and Schapre, R, Experments wth a New Boostng Algorthm. Machne Learng: Proceedngs of the hrteenth Internatonal Conference (ICML 96), 199 Proceedngs of the Ffth IEEE Internatonal Conference on Data Mnng (ICDM 05) 1550-4786/05 $0.00 005 IEEE