Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification

Size: px

Start display at page:

Download "Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification"

Job Franklin
5 years ago
Views:

1 Credblty Adjusted Term Frequency: A Supervsed Term Weghtng Scheme for Sentment Analyss and Text Classfcaton Yoon Km New York Unversty yhk255@nyu.edu Owen Zhang zhonghua.zhang2006@gmal.com Abstract We provde a smple but novel supervsed weghtng scheme for adjustng term frequency n tf-df for sentment analyss and text classfcaton. We compare our method to baselne weghtng schemes and fnd that t outperforms them on multple benchmarks. The method s robust and works well on both snppets and longer documents. 1 Introducton Baselne dscrmnatve methods for text classfcaton usually nvolve tranng a lnear classfer over bag-of-words (BoW) representatons of documents. In BoW representatons (also known as Vector Space Models), a document s represented as a vector where each entry s a count (or bnary count) of tokens that occurred n the document. Gven that some tokens are more nformatve than others, a common technque s to apply a weghtng scheme to gve more weght to dscrmnatve tokens and less weght to non-dscrmnatve ones. Term frequency-nverse document frequency (tfdf ) (Salton and McGll, 1983) s an unsupervsed weghtng technque that s commonly employed. In tf-df, each token n document d s assgned the followng weght, w,d = tf,d log N df (1) where tf,d s the number of tmes token occurred n document d, N s the number of documents n the corpus, and df s the number of documents n whch token occurred. Many supervsed and unsupervsed varants of tf-df exst (Debole and Sebastan (2003); Martneau and Fnn (2009); Wang and Zhang (2013)). The purpose of ths paper s not to perform an exhaustve comparson of exstng weghtng schemes, and hence we do not lst them here. Interested readers are drected to Paltoglou and Thelwall (2010) and Deng et al. (2014) for comprehensve revews of the dfferent schemes. In the present work, we propose a smple but novel supervsed method to adjust the term frequency porton n tf-df by assgnng a credblty adjusted score to each token. We fnd that t outperforms the tradtonal unsupervsed tf-df weghtng scheme on multple benchmarks. The benchmarks nclude both snppets and longer documents. We also compare our method aganst Wang and Mannng (2012) s Nave-Bayes Support Vector Machne (NBSVM), whch has acheved state-of-the-art results (or close to t) on many datasets, and fnd that t performs compettvely aganst NBSVM. We addtonally fnd that the tradtonal tf-df performs compettvely aganst other, more sophstcated methods when used wth the rght scalng and normalzaton parameters. 2 The Method Consder a bnary classfcaton task. Let C,k be the count of token n class k, wth k { 1, 1}. Denote C to be the count of token over both classes, and y (d) to be the class of document d. For each occurrence of token n the tranng set, we calculate the followng, s (j) = { C,1 C, f y (d) = 1, f y (d) = 1 C, 1 C (2) Here, j s the j-th occurrence of token. Snce there are C such occurrences, j ndexes from 1 to C. We assgn a score to token by, ŝ = 1 C C j=1 s (j) (3) Intutvely, ŝ s the average lkelhood of makng the correct classfcaton gven token s occurrence n the document, f was the only token n 79 Proceedngs of the 5th Workshop on Computatonal Approaches to Subjectvty, Sentment and Socal Meda Analyss, pages 79 83, Baltmore, Maryland, USA. June 27, c 2014 Assocaton for Computatonal Lngustcs

2 the document. In a bnary classfcaton case, ths reduces to, ŝ = C2,1 + C2, 1 C 2 (4) Note that by constructon, the support of ŝ s [0.5, 1]. 2.1 Credblty Adjustment Suppose ŝ = ŝ j = 0.75 for two dfferent tokens and j, but C = 5 and C j = 100. Intuton suggests that ŝ j s a more credble score than ŝ, and that ŝ should be shrunk towards the populaton mean. Let ŝ be the (weghted) populaton mean. That s, ŝ = C ŝ (5) C where C s the count of all tokens n the corpus. We defne credblty adjusted score for token to be, s = C2,1 + C2, 1 + ŝ γ C 2 + γ (6) where γ s an addtve smoothng parameter. If C,k s are small, then s ŝ (otherwse, s ŝ ). Ths s a form of Buhlmann credblty adjustment from the actuaral lterature (Buhlmann and Gsler, 2005). We subsequently defne tf, the credblty adjusted term frequency, to be, tf,d = (0.5 + ŝ ) tf,d (7) and tf s replaced wth tf. That s, w,d = tf,d log N df (8) We refer to above as cred-tf-df hereafter. 2.2 Sublnear Scalng It s common practce to apply sublnear scalng to tf. A word occurrng (say) ten tmes more n a document s unlkely to be ten tmes as mportant. Paltoglou and Thelwall (2010) confrm that sublnear scalng of term frequency results n sgnfcant mprovements n varous text classfcaton tasks. We employ logarthmc scalng, where tf s replaced wth log(tf) + 1. For our method, tf s smply replaced wth log(tf) + 1. We found vrtually no dfference n performance between log scalng and other sublnear scalng methods (such as augmented scalng, where tf s replaced wth tf max tf ). 2.3 Normalzaton Usng normalzed features resulted n substantal mprovements n performance versus usng un-normalzed features. We thus use ˆx (d) = x (d) / x (d) 2 n the SVM, where x (d) s the feature vector obtaned from cred-tf-df weghts for document d. 2.4 Nave-Bayes SVM (NBSVM) Wang and Mannng (2012) acheve excellent (sometmes state-of-the-art) results on many benchmarks usng bnary Nave Bayes (NB) logcount ratos as features n an SVM. In ther framework, w,d = 1{tf,d } log (df,1 + α)/ (df,1 + α) (df, 1 + α)/ (df, 1 + α) (9) where df,k s the number of documents that contan token n class k, α s a smoothng parameter, and 1{ } s the ndcator functon equal to one f tf,d > 0 and zero otherwse. As an addtonal benchmark, we mplement NBSVM wth α = 1.0 and compare aganst our results. 1 3 Datasets and Expermental Setup We test our method on both long and short text classfcaton tasks, all of whch were used to establsh baselnes n Wang and Mannng (2012). Table 1 has summary statstcs of the datasets. The snppet datasets are: PL-sh: Short move revews wth one sentence per revew. Classfcaton nvolves detectng whether a revew s postve or negatve. (Pang and Lee, 2005). 2 PL-sub: Dataset wth short subjectve move revews and objectve plot summares. Classfcaton task s to detect whether the sentence s objectve or subjectve. (Pang and Lee, 2004). And the longer document datasets are: 1 Wang and Mannng (2012) use the same α but they dffer from our NBSVM n two ways. One, they use l 2 hnge loss (as opposed to l 1 loss n ths paper). Two, they nterpolate NBSVM weghts wth Multvarable Nave Bayes (MNB) weghts to get the fnal weght vector. Further, ther tokenzaton s slghtly dfferent. Hence our NBSVM results are not drectly comparable. We lst ther results n table All the PL datasets are avalable here. 80

3 Dataset Length Pos Neg Test PL-sh CV PL-sub CV PL-2k CV IMDB k 12.5k 25k AthR XGraph Table 1: Summary statstcs for the datasets. Length s the average number of ungram tokens (ncludng punctuaton) per document. Pos/Neg s the number of postve/negatve documents n the tranng set. Test s the number of documents n the test set (CV means that there s no separate test set for ths dataset and thus a 10-fold crossvaldaton was used to calculate errors). PL-2k: 2000 full-length move revews that has become the de facto benchmark for sentment analyss (Pang and Lee, 2004). IMDB: 50k full-length move revews (25k tranng, 25k test), from IMDB (Maas et al., 2011). 3 AthR, XGraph: The 20-Newsgroup dataset, 2nd verson wth headers removed. 4 Classfcaton task s to classfy whch topc a document belongs to. AthR: alt.athesm vs relgon.msc, XGraph: comp.wndows.x vs comp.graphcs. 3.1 Support Vector Machne (SVM) For each document, we construct the feature vector x (d) usng weghts obtaned from cred-tf-df wth log scalng and l 2 normalzaton. For credtf-df, γ s set to 1.0. NBSVM and tf-df (also wth log scalng and l 2 normalzaton) are used to establsh baselnes. Predcton for a test document s gven by y (d) = sgn (w T x (d) + b) (10) In all experments, we use a Support Vector Machne (SVM) wth a lnear kernel and penalty parameter of C = 1.0. For the SVM, w, b are obtaned by mnmzng, w T w+c N max(0, 1 y (d) (w T x (d) +b)) (11) d=1 usng the LIBLINEAR lbrary (Fan et al., 2008). 3 amaas/data/sentment/ndex.html Tokenzaton We lower-case all words but do not perform any stemmng or lemmatzaton. We restrct the vocabulary to all tokens that occurred at least twce n the tranng set. 4 Results and Dscusson For PL datasets, there are no separate test sets and hence we use 10-fold cross valdaton (as do other publshed results) to estmate errors. The standard tran-test splts are used on IMDB and Newsgroup datasets. 4.1 cred-tf-df outperforms tf-df Table 2 has the comparson of results for the dfferent datasets. Our method outperforms the tradtonal tf-df on all benchmarks for both ungrams and bgrams. Whle some of the dfferences n performance are sgnfcant at the 0.05 level (e.g. IMDB), some are not (e.g. PL-2k). The Wlcoxon sgned ranks test s a non-parametrc test that s often used n cases where two classfers are compared over multple datasets (Demsar, 2006). The Wlcoxon sgned ranks test ndcates that the overall outperformance s sgnfcant at the <0.01 level. 4.2 NBSVM outperforms cred-tf-df cred-tf-df dd not outperform Wang and Mannng (2012) s NBSVM (Wlcoxon sgned ranks test p- value = 0.1). But t dd outperform our own mplementaton of NBSVM, mplyng that the extra modfcatons by Wang and Mannng (2012) (.e. usng squared hnge loss n the SVM and nterpolatng between NBSVM and MNB weghts) are mportant contrbutons of ther methodology. Ths was especally true n the case of shorter documents, where our unnterpolated NBSVM performed sgnfcantly worse than ther nterpolated NBSVM. 4.3 tf-df stll performs well We fnd that tf-df stll performs remarkably well wth the rght scalng and normalzaton parameters. Indeed, the tradtonal tf-df outperformed many of the more sophstcated methods that employ dstrbuted representatons (Maas et al. (2011); Socher et al. (2011)) or other weghtng schemes (Martneau and Fnn (2009); Deng et al. (2014)). 81

4 Method PL-sh PL-sub PL-2k IMDB AthR XGraph tf-df-un tf-df-b Our cred-tfdf-un results cred-tfdf-b NBSVM-un NBSVM-b MNB-un Wang & MNB-b Mannng NBSVM-un NBSVM-b Appr. Tax.* Str. SVM* aug-tf-m Other Dsc. Conn results Word Vec.* LLR RAE MV-RNN Table 2: Results of our method (cred-tf-df ) aganst baselnes (tf-df, NBSVM), usng ungrams and bgrams. cred-tf-df and tf-df both use log scalng and l 2 normalzaton. Best results (that do not use external sources) are underlned, whle top three are n bold. Rows 7-11 are MNB and NBSVM results from Wang and Mannng (2012). Our NBSVM results are not drectly comparable to thers (see footnote 1). Methods wth * use external data or software. Appr. Tax: Uses apprasal taxonomes from WordNet (Whtelaw et al., 2005). Str. SVM: Uses OpnonFnder to fnd objectve versus subjectve parts of the revew (Yessenalna et al., 2010). aug-tf-m: Uses augmented term-frequency wth mutual nformaton gan (Deng et al., 2014). Dsc. Conn.: Uses dscourse connectors to generate addtonal features (Trved and Esensten, 2013). Word Vec.: Learns sentment-specfc word vectors to use as features combned wth BoW features (Maas et al., 2011). LLR: Uses log-lkelhood rato on features to select features (Aue and Gamon, 2005). RAE: Recursve autoencoders (Socher et al., 2011). MV-RNN: Matrx-Vector Recursve Neural Networks (Socher et al., 2012). 5 Conclusons and Future Work In ths paper we presented a novel supervsed weghtng scheme, whch we call credblty adjusted term frequency, to perform sentment analyss and text classfcaton. Our method outperforms the tradtonal tf-df weghtng scheme on multple benchmarks, whch nclude both snppets and longer documents. We also showed that tf-df s compettve aganst other state-of-the-art methods wth the rght scalng and normalzaton parameters. From a performance standpont, t would be nterestng to see f our method s able to acheve even better results on the above tasks wth proper tunng of the γ parameter. Relatedly, our method could potentally be combned wth other supervsed varants of tf-df, ether drectly or through ensemblng, to mprove performance further. References A. Aue, M. Gamon Customzng sentment classfers to new domans: A case study. Proceedngs of the Internatonal Conference on Recent Advances n NLP, H. Buhlmann, A. Gsler A Course n Credblty Theory and ts Applcatons Sprnger-Verlag, Berln. F. Debole, F. Sebastan Supervsed Term Weghtng for Automated Text Categorzaton Proceedngs of the 2003 ACM symposum on Appled Computng J. Demsar Statstcal Comparson of classfers over multple data sets. Journal of Machne Learnng Research, 7: Z. Deng, K. Luo, H. Yu A study of supervsed term weghtng scheme for sentment analyss Ex- 82

5 pert Systems wth Applcatons. Volume 41, Issue 7, R. Fan, K. Chang, J. Hseh, X. Wang, C. Ln LI- BLINEAR: A lbrary for large lnear classfcaton. Journal of Machne Learnng Research, 9: , June. A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, C. Potts Learnng Word Vectors for Sentment Analyss. In Proceedngs of ACL J. Martneau, T. Fnn Delta TFIDF: An Improved Feature Space for Sentment Analyss. Thrd AAAI Internatonal Conference on Weblogs and Socal Meda G. Paltoglou, M. Thelwall A study of Informaton Retreval weghtng schemes for sentment analyss. In Proceedngs of ACL B. Pang, L. Lee A sentmental educaton: Sentment analyss usng subjectvty summarzaton based on mnmum cuts. In Proceedngs of ACL B. Pang, L. Lee Seeng stars: Explotng class relatonshps for sentment categorzaton wth respect to ratng scales. In Proceedngs of ACL R. Socher, J. Pennngton, E. Huang, A. Ng, C. Mannng Sem-Supervsed Recursve Autoencoders for Predctng Sentment Dstrbutons. Proceedngs of EMNLP R. Socher, B. Huval, C. Mannng, A. Ng Semantc Compostonalty through Recursve Matrx- Vector Spaces. In Proceedngs of EMNLP R. Trved, J. Esensten Dscourse Connectors for Latent Subjectvty n Sentment Analyss. In Proceedngs of NAACL G. Salton, M. McGll Introducton to Modern Informaton Retreval. McGraw-Hll. S. Wang, C. Mannng Baselnes and Bgrams: Smple, Good Sentment and Topc Classfcaton. In proceedngs of ACL D. Wang, H. Zhang Inverse-Category- Frequency Based Supervsed Term Weghtng Schemes for Text Categorzaton. Journal of Informaton Scence and Engneerng 29, C. Whtelaw, N. Garg, S. Argamon Usng apprasal taxonomes for sentment analyss. In Proceedngs of CIKM A. Yessenalna, Y. Yue, C. Carde Multlevel Structured Models for Document-level Sentment Classfcaton. In Proceedngs of ACL In 83

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto