A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

Size: px

Start display at page:

Download "A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval"

Vernon Hubbard
5 years ago
Views:

1 A Generaton Model to Unfy Topc Relevance and Lexcon-based Sentment for Opnon Retreval Mn Zhang State key lab of Intellgent Tech.& Sys, Dept. of Computer Scence, Tsnghua Unversty, Bejng, 00084, Chna Xngyao Ye School of Software Tsnghua Unversty Bejng, 00084, Chna ABSTRACT Opnon retreval s a task of growng nterest n socal lfe and academc research, whch s to fnd evant and opnonate documents accordng to a user s query. One of the key ssues s how to combne a document s opnonate score (the rankng score of to what extent t s subjectve or objectve) and topc evance score. Current solutons to document rankng n opnon retreval are generally ad-hoc lnear combnaton, whch s short of theoretcal foundaton and careful analyss. In ths paper, we focus on lexcon-based opnon retreval. A novel generaton model that unfes topc-evance and opnon generaton by a quadratc combnaton s proposed n ths paper. Wth ths model, the evance-based rankng serves as the weghtng factor of the lexcon-based sentment rankng functon, whch s essentally dfferent from the popular heurstc lnear combnaton approaches. The effect of dfferent sentment dctonares s also dscussed. Expermental results on TREC blog datasets show the sgnfcant effectveness of the proposed unfed model. Improvements of 28.% and 40.3% have been obtaned n terms of MAP and p@0 respectvely. The concluson s not lmted to blog envronment. Besdes the unfed generaton model, another contrbuton s that our work demonstrates that n the opnon retreval task, a Bayesan approach to combnng multple rankng functons s superor to usng a lnear combnaton. It s also applcable to other result re-rankng applcatons n smlar scenaro. Categores and Subject Descrptors H.3.3 [Informaton Search and Retreval]: Retreval Models General Terms: Algorthms, Expermentaton, Theory Keywords Generaton model, topc evance, sentment analyss, opnon retreval, opnon generaton model. INTRODUCTION In recent years, there s a growng nterest n fndng out people s opnons from web data. In many cases, obtanng subjectve atttudes towards some object, person or event s often a stronger Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGIR 08, July 20 24, 2008, Sngapore Copyrght 2008 ACM /08/07 $5.00. request than gettng encyclopeda-lke descrptons. General opnon retreval s an mportant ssue n practcal actvtes such as product survey, poltcal opnon polls, advertsement analyss, etc. Some researchers have observed ths underrepresented need of nformaton and made attempts towards effcent detecton, extracton and summarzaton of opnons from web data [7, 8, 5]. However, much of the work focused on presentng a comprehensve and detaled analyss of the sentments expressed n the text, wthout studyng how well each source document can meet the need of the user. In addton, ths branch of work seek solutons to a specfc data doman, such as product/move revew webstes [7,5] and weblogs [8], so they make use of many felddependent features such as dfferent aspects of a product, whch are not present for other types of text data. The rsng prospects of research and mplementaton on opnon search are opened up by the explosve amount of user-centrc data avalable recently. People have been wrtng about ther lves and thoughts more freely than ever on personal blogs, vrtual communtes and specal nterest forums. Drven by ths trend and ts ntrgung research values, TREC started a specal track on blog data n 2006 wth a man task of retrevng personal opnons towards varous topcs, and t has been the track that has the most partcpants n But how to combne opnon score (the rankng score of to what extent t s subjectve or objectve) wth evance score s a key problem n research. In prevous work, there are many examples that the exstng methods of document opnon rankng provde no mprovements over mere topc-evance rankng. [2] Thngs come better n But there s stll an nterestng observaton that the topc-evance result outperforms most opnon-based approaches [26]. Ad-hoc solutons have been adopted to combne evance rankng and the opnon detecton result, causng performance to suffer from lack of adequate theoretcal support. In ths paper, we focus on the problem of searchng opnons over general topcs wth the am of presentng a ranked lst of documents contanng personal opnons towards the gven query. We start from the general statstcs-based nformaton retreval, followng the dea of takng evance estmaton problem as query generaton and document generaton. Then consderng the opnon retreval background, we nduct the new constran of sentment expresson nto the model. Wth probablstc dervaton, * Supported by the Chnese Natonal Key Foundaton Research & Development Plan (2004CB3808), Natural Scence Foundaton ( , , ) and Natonal 863 Hgh Technology Project (2006AA0Z4. 4

2 we come to a novel generaton model that unfes the topcevance model and the opnon generaton model by a quadratc combnaton. It s essentally dfferent from the lnear nterpolaton between the document s evant score and ts opnon score, whch s popularly used n such tasks. Wth ths proposed model, the evance-based rankng crteron now serves as the weghtng factor for the lexcon-based sentment rankng functon. Expermental results show the sgnfcant effectveness of the proposed unfed model. It s reasonable snce the evance score s a able ndcator of whether opnons, f any, expressed n the document s ndeed towards the wanted object. Ths noton s a novel characterstc of our model because n prevous work, the opnon score s always calculated ndependently to the topcevance degree. Furthermore, ths process can be vewed as a result re-rankng. Our work demonstrates that n IR and sentment analyss, a Bayesan approach to combnng multple rankng functons s superor to usng a lnear combnaton. It s also applcable to other result re-rankng applcatons n smlar scenaro. Ths opnonate document rankng problem s of fundamental benefts to all opnon-ated research ssues, n that t can provde hgh qualty results for further feature extracton and user behavor learnng. Although the experments n ths paper are conducted on TREC (Text REtreval Conference) blog 06 and 07 data sets, no characterstc of blog data has been used, such as feature extracton, blog spammng flterng, processng on blog feed and comments, etc. In addton, the lexcons used n ths work are all doman-ndependent ones. Hence the concluson s not lmted to blog envronment and the proposed approach s applcable to all opnon retreval tasks on dfferent knds of resource. The rest of the paper s organzed as follows. We frst revew prevous work n secton 2. In secton 3, we present our generaton model for opnon retreval that unfes topc evance model and sentment-based opnon generaton. Detals for estmatng model parameters are also dscussed n the secton. After ntroducng experment settngs n secton 4, we test our generaton model wth comparatve experments n secton 5, together wth some further dscussons. Fnally, we summarze the paper and suggest avenues for future work n secton RELATED WORK There has long been nterest n ether the topcs dscussed or the opnons expressed n web documents. A popular approach to opnon dentfcaton s text classfcaton [7, 5, 22]. Typcally, a sentence classfer s learned from both opnonate and neutral web pages avalable usng language features such as local phrases [5] and doman-specfc adjectve-noun patterns [7]. In order to calculate an opnon score, the classfcaton result s then combned wth topc-evance score usng bnary operator [2]. Another lne of research on opnonate documents comes from natural language processng and deals wth pure text wthout constrants on the source of opnonate data. The work n general treats opnon detecton as a text classfcaton problem and use lngustc features to determne the presence and the polarty of opnons [3, 7, 22]. Nevertheless, they ether neglect the problem of retrevng valuable documents [3, 7], or adopt an ntutve soluton to rankng that s n a way out of ther opnon detecton [22]. It s the frst n Hurst and Ngam s work [4] that topcalty and polarty are frst fused together to form the noton of opnon retreval,.e. to fnd opnons about a gven topc. However n that work, the emphass s on how to judge the presence of such opnons and no rankng strategy s put forward. The frst opnon rankng formula s ntroduced by Eguch and Lavrenko [2] as the cross entropy of topcs and sentments under a generaton model. The nstantaton of ths formula, however, does not perform very well n the followng TREC opnon retreval experments. No encouragng result has been obtaned. Opnon search systems that perform well emprcally generally adopt a two-stage approach [2]. Topc-evance search s carred out frst by usng evance rankng (e.g. TF*IDF rankng or language modelng). Then heurstc opnon detecton s used to re-rank the documents. One major method to dentfy opnonate content s by matchng the documents wth a sentment word dctonary and calculatng term frequency [6, 0,, 9]. The matchng process s often performed mult-tmes for dfferent dctonares and dfferent restrctons on matchng. Dctonares are constructed accordng to exstng lexcal categores [6, 0, 9] or the word dstrbuton over the dataset [0,, 9]. Matchng constrants often concern wth the dstance between topc terms and opnon terms, whch can be thought of as a sldng wndow. Some requre the two types of words to be n the same sentence [0], others set the maxmum word allowed between them [9]. After the opnon score s calculated, an effectve rankng formula s needed to combne multple sources of nformaton. Most exstng approaches use a lnear combnaton of evance score and opnon score [6, 0, 9]. A typcal example s shown below. α * Score + β * Score opn () where α and β are combnaton parameters, whch are often tuned by hand or learned to optmze a target metrc such as bnary preference [0]. Other alternatves nclude demotng the rankng of neutral documents []. Doman specfc nformaton has always been studed by researchers. Mshne [22, 23] proposed three smple heurstcs wth mproved opnon retreval performance by usng blog-specfc propertes. Other works make use of many feld-dependent features such as dfferent aspects of a product or move [7, 5], whch are not present for other types of text data. TREC blog track s also an mportant research and expermental platform for opnon retreval. The major goal s to explore the nformaton seekng behavor n the blogosphere, wth an emphass on spam detecton, blog structure analyss, etc. Hence submtted work often goes to great lengths to explot the non-textual nature of a blog post [0, 2]. Ths approach makes strong assumptons on the problem doman and s dffcult to generalze. 3. GENERATION MODEL FOR OPINION RETRIEVAL 3. A New Generaton Model The opnon retreval task ams to fnd the documents that contan evant opnons accordng to a user s query. In exstng probablstc-based IR models, evance s modeled wth a bnary random varable to estmate What s the probablty that ths document s evant to ths query. There are two dfferent ways to factor the evance probablty,.e. query generaton and document generaton [5]. In order to rank the document by ther evance, the posteror probablty d s generally estmated, whch captures how well 42

3 the document d fts the partcular query q. Accordng to Bayes formula, d q (2) where s the pror probablty that a document d s evant to any query, and q denotes the probablty of query q beng generated by d. When assumng a unform document pror, the rankng functon s reduced to the lkelhood of generatng the expected query terms from the document. However, when explctly searchng for opnons, users nformaton need s now restrcted to only an opnonate subset of the evant documents. Ths subset s characterzed by sentment expressons s towards topc q. Thus the rankng estmaton for opnon retreval changes to d. In ths paper, for smplcty, when we dscuss the lexcon-based sentment analyss, the latent varable s s assumed to be a preconstructed bag-of-word sentment thesaurus, and all sentment words s are unformly dstrbuted. Then the pror probablty that the document d contans evant opnons to query q s gven by d = = = d s ) s, d s ) s d ) d ) s d, q d ) d ) where s s the number of words n sentment thesaurus s. When Referrng to Equaton 2, t s easy to fnd that Eq.3 s combned wth two factors: the last part q gves the estmaton of topc evance, and the remanng shows that gven query how probably a document d generates a sentment word s. Then Equaton 3 s rewrtten as: d = I I op op I s,, I where q Ths s the generaton model for opnon retreval. In ths model, I (d, s the document generaton probablty to estmate topc evance, and I op (d, s the opnon generaton probablty to sentment analyss. Essentally t presents a quadratc atonshp between document sentment and topc evance, whch s naturally nduced from the opnon generaton process and s proven more effectve n our experments than the popular lnear nterpolaton used n prevous work, e.g. rank d = ( λ ) s + λ q (5) where λ s the lnear combnaton weght. Ths result s reasonable snce the evance score s a able ndcator of whether opnons, f any, expressed n the document s ndeed towards the wanted object. Ths noton s a novel characterstc of our framework n that prevous work calculated d ndependent of the topc-evance degree. In the followng two sectons, we wll dscuss the two sub-models n the generaton opnon retreval model respectvely. (3) (4) 3.2 Topc Relevance Rankng In the topc evance model, I (d, s based on the noton of document generaton. A classc probablstc model, the Bnary Independent Retreval (BIR) model [5], s one of the most famous ones n ths branch. The heurstc rankng functon BM25 and ts varants have been successfully appled n many IR experments, ncludng TREC (Text Retreval Conference) evaluaton. Hence n ths paper, we adopt ths BIR-based document generaton model, by whch the topc evance score ScoreI (d, gven by the rankng functon presented n [25] can be shown as: N df ( w) ScoreI = (ln w q d df ( w) ( k + ) c( w, d ) ( k3 + ) c( w, (6) ) d k (, ) ( ) (, ) 3 + c w q k b + b + c w d avdl where c(w, s the count of word w n the document d, c(w, s the count of word w n the query N s the total number of documents n the collecton, df(w) s the number of documents that contan word w, d s the length of document d, avdl s the average document length, k (from.0 to 2.0),b (usually 0.75) and k 3 (from 0 to 000) are constants. 3.3 Opnon Generaton Model Parameter Estmaton In the opnon generaton model, I op (d, focus on the problem that gven query how probably a document d generates a sentment expresson s. Ths model s on the branch of query generaton, n whch language model has been shown qute effectve n nformaton retreval durng recently years. The sentment expressons s s a latent varable n our framework whch s not nputted n the query but expected to appear n search results. In ths work, we assume s to be a bag-of-word sentment thesaurus, and sentment words s s unformly dstrbuted. Hence I op s s Dfferent from query generaton-based language model n IR, where the number of query terms ( q ) s usually small (less than 00, and n most cases be or 2), n our opnon generaton model, the number of sentment words (.e. s ) s large (generally several thousan, and the sparseness problem s promnent. Hence smoothng has turned out to play an mportant role for parameter estmaton n ths proposed model. p s = p seen unseen ( s p S ( s f s s seen = ( s α d s C, otherwse where p S (s d, s the smoothed probablty of a word s seen n the document d gven query α d s a coeffcent controllng the probablty mass assgned to unseen words, s C, s the collecton language model gven query q. Ths ungram model can be estmated usng any exstng method. As luustrated n Zha & Lafferty s study [20], Jelnek-Mercer smoothng s much more effectve than the other two when the (8) (7) 43

4 queres are long and more verbose. In ths proposed opnon generaton model, the queres are sentment words. Therefore, under ths smlar scenaro, we use the MLE estmaton, smoothed by Jelnek-Mercer method. Accordng to Jelnek-Mercer smoothng, p s (s d, = (-λ) p ml (s d, + λ s C,, α d = λ where λ s the smoothng parameter, and p ml (s d, s the maxmum lkelhood estmaton of s d,. Then use ths smoothng to Equaton 7 and Equaton 8, we get the estmaton: = S = S = S = S = s S s + p ( s + S [( λ) p ( λ) p ( λ) p ml ml ml S d S d ( s + λ s C, ] + ( s + λ ( s + λ s α s C, d s C, S d λ s C, We use the co-occurrence of sentment word s and query word q nsde document d wthn a wndow W as the rankng measure of p ml (s d,. Hence the sentment score of a document d gven by the opnon generaton model s: co( s, q W ) ScoreI op = S ( λ ) + λ (0) c( W Where co(s,q W) s the frequency of sentment word s whch s co-occurred wth query q wthn wndow W, c( s the query term frequency n the document. 3.4 Rankng functon of generaton model for opnon retreval Takng the topc-evance rank (Equaton 6) and opnongeneraton rank (Equaton ), we get the overall rankng functon for the unfed generaton model: rank d = ScoreIop ScoreI co( s, q W ) = ( S ( λ) + λ) ScoreI c( W rank ( + λ TFCO( s, W ) ) ScoreI f λ 0 = ScoreI f λ = 0 λ co( s, q W ) where λ =, TFCO( s, W ) = S λ c( W (9) () Notce that ths rankng functon s not the precse quanttatve estmaton of d, because proporton factor / S n opnongeneraton rank s gnored. But ths factor has no affect to document rankng and hence ths approxmaton s orderpreservng. In ths rankng functon, we drectly use the co-occurrence frequency as the factor to estmate the generaton probablty p ml (s d,. But as mentoned n secton 3.3, generally, the number of query terms are atve small, such as or 2, but the sze of sentment thesaurus s really large, e.g. over several thousand or even tens of thousands. In order to reduce ths mpact of unbalance, the logarthm normalzaton s taken on opnon rankng. By ths way, the rankng functon turns out to be: rank [ + λ log( TFCO( s, W ) + )] ScoreI d = ScoreI f λ = 0 λ co( s, q W ) where λ =, TFCO( s, W ) = S d λ c( W f λ 0 (2) The expermental analyss on ths logarthm atonshp wll be made n secton 5.3, whch shows the effectveness of ths normalzaton. 4. EXPERIMENTAL SETUP 4. Data set We test our opnon retreval model on the TREC Blog06 and Blog07 corpus [2, 26], whch s the most authortatve opnon retreval dataset avalable up to date. The corpus s collected from 00,649 blogs durng a perod of two and a half months. We focus on retrevng permalnks from ths dataset snce human evaluaton result s only avalable for these documents. There are 50 topcs (Topc 85~900) from the TREC 2006 blog opnon retreval task, and 50 topcs (Topc 90~950) from TREC blog Query terms are extracted from the ttle feld usng porter stemmng and standard stop words removal. Generally, queres from blog 06 are used for parameter comparson study, ncludng selecton of sentment thesaurus, wndow sze, and the effectveness of dfferent models. And queres of blog 07 are used as the testng set, where all the parameters have been tuned n blog 06 data and no modfcaton s made. 4.2 Evaluaton To make the experments applcable to real word applcatons and comparable to TREC evaluatons, only short queres are used. The evaluaton metrcs used are general IR measures,.e. mean average precson (MAP), R-Precson (R-prec), and precson at top 0 results (p@0). Totally three approaches have been comparatve studed n our experments. () General lnear combnaton (Shown as Lnear Comb.) rank = d ( λ ) ScoreIo d, + λscorei( d, where the ScoreI op (d, and ScoreI (d, are computed usng the same way as that n the Equaton. (2) Our proposed generaton model wth Jelnek-Mercer smoothng (Shown as Generaton Model). See Equaton. (3) Our proposed generaton model wth Jelnek-Mercer smoothng and logarthm normalzaton (Shown as Generaton, log). See Equaton Selecton of Sentmental Lexcon For lexcon-based opnon detecton methods, the selecton of opnon thesaurus plays an mportant role. There are several onlne publc dctonares from the area of lngustcs, such as WordNet [8] and General Inqurer [4]. We follow the general way [6] to select a small seed sentment words lst of WordNet, and then ncrementally enlarge the lst wth synonyms and antonyms. Another opton s to y on a self-constructed dctonary. Wlson et al [7] manually selected 882 words as ther sentment lexcon and t has been used n some other works. Esul and Sebastan [3] 44

5 scored each word n WordNet regardng ts postve, negatve and neutral ndcatons to obtan a SentWordNet lexcon. Words wth postve or negatve score above a threshold n SentWordNet are used by some partcpants of the TREC opnon retreval task. Furthermore, we seek help from other languages. HowNet [] s a knowledge database of the Chnese language, and some of the words n the dctonary have propertes of postve or negatve. We use the Englsh translaton of those sentment words provded by HowNet. For comparson, sentmental words from HowNet, WordNet, General Inqurer and SentWordNet are used as lexcons respectvely. Table shows the detal nformaton on the lsts. Table. Sentment thesauruses used n our experments Thesaurus Name Sze HowNet WordNet Intersecton 43 4 Unon General Inqurer SentWordNet 333 Descrpton Englsh translaton of postve/negatve Chnese words Selected words from WordNet Words appeared n both and 2 Words appeared n ether or 2 Words n the postve and negatve category Words wth a postve or negatve score above EXPERIMENTAL RESULTS AND DISCUSSION 5. Effectveness of Sentmental Lexcons The retreval performance under dfferent sentment thesauruses s presented n Fgure. The cross-language HowNet dctonary performs better than all other canddates and s qute nsenstve to the smoothng parameter. SentWordNet and the Intersecton thesaur perform next and close to each other. General Inqurer does not perform well and has the worst result. There mght be two reasons that lead to the better performance of usng the words from HowNet than usng that from WordNet. Frst, the lst generated from WordNet mght be lack of dversty snce the words come from a lmted ntal seeds and only synonyms and antonyms are taken nto consderaton. Second, the Englsh translatons of the Chnese sentment words are annotated by non-natve speakers; hence most of them are common and popular terms, whch are generally used n the Web envronment. Snce the performance of SentWordNet and HowNet are wth no bg dfference when λ s hgher, and SentWordNet s open n the Internet, we choose SentWordNet as the sentment thesaurus n the followng experments to make the experments much easer to repeat by other researchers. Fgure MAP-λ curves wth dfferent thesaurus. (Blog 06) 5.2 Selecton of Wndow Sze It s ntutve that opnon modfers are less lkely to be ated to an object far away from t than those close to t n the text. Thus durng the opnon term matchng process, a proxmty wndow s often used to restrct the vald dstance between the sentment words and topc words. However, no one s sure about how close the two types of words should be to each other and ths threshold s often set by hand wth varous ndcatons. In prevous work, wndow szes that represent the length of drect modfcaton (e.g. 3 []), a sentence [0, 22] (e.g. 0~20), a paragraph (e.g. 30~50 []), or the whole document [6] have been used. We test the retreval performance under these settngs respectvely to llustrate how ths factor could nfluence the opnon retreval ablty of our model. The result s gven n Fgure 2. Fgure 2. MAP v.s. wndow sze wth dfferent λ. (Blog06) 45

6 It s clear that the larger the wndow s, the better the performance s. And ths tendency s nvarant to dfferent levels of smoothng. The result s reasonable snce the dstance between a query term and a sentment word s generally used to demonstrate the opnon evance to the topc, whch has already been taken nto consderaton n ths unfed model by the quadratc combnaton of topc evance. And n the Web documents, the opnon words may not always been located near the topc words. Therefore, we set the full document as the default wndow sze n the followng experments. 5.3 Opnon Retreval Model Comparson Three opnon rankng formulas are tested n our experment. Ther performance s compared n Fgure 3. We can see that the generaton model s more effectve than lnear combnaton especally when mld smoothng s performed. As the value of λ goes up, desred documents wth only a few opnon terms are deprved of the dscrmnatve ablty contaned n ther opnon expressons, as ths part of the probablty s dscounted to the whole document collecton. Generaton log model overcomes ths problem and gves the best retreval performance under all values of λ. Ths demonstrates the usefulness of our logsmoothng approach n the settng of opnon search. In addton, all three rankng schemes perform equvalent to or better than the best run at TREC 2006 owng to the careful selecton of sentment thesaurus and wndow sze as dscussed above. To further demonstrate the effectveness of our opnon retreval model, a comparson of opnon MAP wth prevous work s gven n Table 2. Performance mprovement after opnon re-rankng s shown n Fgure 4 n precson-recall curves. Fgure 3. MAP- curve for dfferent opnon rankng formulas. Fgure 4. Precson-recall curves before and after opnon rerankng of top 000 evant documents. Table 2. Comparson of opnon retreval performance Data Method MAP R-Prec P@0 Set Blog 06 Blog 07 Best run at blog Best ttle-run at blog Our Relevance Baselne Our Unfed Model Most mprovement at blog % 8.6% 2.6% Our Relevance Baselne Our Unfed Model * mprovement 28.% 9.9% 40.3% *: on Blog 07 data, use the same parameters as those on Blog 06 data. λ=0.6, wndow=full, thesaurus: SentWordNet. All our approaches use ttle only run. In Fgure 5, per topc gan n opnon MAP and p@0 are vsualzed on blog 07 data set. Notce that no characterstc of blog data has been used n ths work, such as feature extracton, blog spammng flterng, processng on blog feed and comments, etc. In terms of MAP, 6 of the 50 topcs receve mprovement of more than 50%, whle only 5 topcs result n mnor performance loss. Few topcs that beneft the most from opnon re-rankng, such as topc 92 (44%) and topc 928 (35%), are those where only a few documents wth evant opnons are retreved and ranked lowly n the frst stage. Only 4 topcs performances decrease a lttle (less than 40%). In terms of p@0, even more sgnfcant results are gven. Three topcs get more than 200% mprovement, such as topc 946 (+900%), and only 6 topcs get a lttle drop on performance. Table 3 gves detaled descrptons of two topcs n blog06 and blog07. We can see our re-rankng procedure successfully rescores almost all the target documents nto the top 00 results. Ths proves our formula to be hghly accurate n dscrmnatng a few subjectve texts from a large amount of factual descrptons. 46

Detals of the best re-ranked topcs examples Topc Ttle Descrpton TREC 06-895 Oprah Fnd opnons about Oprah Wnfrey's TV show MAP Prec@0 Prec@30 Prec@00 Prec@000 Before re-rankng 0.0687 0.

7 Fgure 5. Per-topc analyss: Performance mprovement over 50 topcs after re-rankng on Blog 07 data. (a)map mprovement, (b) mprovement (n (b), the three topcs whose mprovement s much hgher than the fgure upper-bound have been annotated ndvdually.) Table 3. Detals of the best re-ranked topcs examples Topc Ttle Descrpton TREC Oprah Fnd opnons about Oprah Wnfrey's TV show MAP Prec@0 Prec@30 Prec@00 Prec@000 Before re-rankng After re-rankng Topc Ttle Descrpton TREC tvo Fnd opnons about TIVO brand dgtal vdeo recorders MAP Prec@0 Prec@30 Prec@00 Prec@000 Before re-rankng After re-rankng

8 6. CONCLUSION AND FUTURE WORK In ths work we deal wth the problem of opnon search towards general topcs. Contrary to prevous approaches that vew facts retreval and opnon detecton as two dstnct parts to be lnearly combned, we proposed a formal probablstc generaton model to unfy the topc evance score and opnon score. A couple of opnon re-rankng formulas are derved usng the language modelng approach wth smoothng, together wth logarthm normalzaton paradgm. Furthermore, the effectveness of dfferent sentment lexcons and varant dstances between sentment words and query terms are compared and dscussed emprcally. Experment shows that bgger wndows are better than smaller wndows. Accordng to the experments, the proposed model yelds much better results on TREC Blog06 and Blog07 dataset. The novelty of our work les n a probablstc generaton model for opnon retreval, whch s general n motvaton and flexble n practce. Ths work derves a unfed model from the quadratc aton between opnon analyss and topc evance, whch s essentally dfferent from general lnear combnaton. Furthermore, n ths work, we do not make any assumpton on the nature of blog-structured text. Therefore ths approach s expected to be generalzed to all knds of resources for opnon retreval task. Future drectons on opnon retreval may go beyond mey document re-rankng. An opnon-orented ndex, as well as deeper analyss on the structural nformaton of opnon resources such as blogs and forums could be helpful n understandng the nature of opnon expressng behavor on web. Another nterestng topc s to automatcally construct a collecton-based sentment lexcon, whch has been a hot research topc [26], and to nduct ths lexcon nto our generaton model. 7. REFERENCES [] Dong, Z. HowNet. [2] Eguch, K. and Lavrenko, V. Sentment Retreval usng Generatve Models. In Proceedngs of Emprcal Methods on Natural Language Processng (EMNLP) 2006, [3] Esul, A. and Sebastan, F. Determnng the semantc orentaton of terms through gloss classfcaton. In Proceedngs of CIKM 2005, [4] Hurst, M. and Ngam, K. Retrevng Topcal Sentments from Onlne Document Collectons. Document Recognton and Retreval XI [5] Lafferty, J. and Zha, C. Probablstc evance models based on document and query generaton. Language Modelng and Informaton Retreval, Kluwer Internatonal Seres on Informaton Retreval, Vol. 3, [6] Lao, X., Cao, D., Tan, S., Lu, Y., Dng, G., and Cheng X. Combnng Language Model wth Sentment Analyss for Opnon Retreval of Blog-Post. Onlne Proceedngs of Text Retreval Conference (TREC) [7] Lu, B., Hu, M., and Cheng, J. Opnon observer: analyzng and comparng opnons on the Web. WWW 2005: [8] Me, Q., Lng, X., Wondra, M., Su, H., and Zha, C. Topc sentment mxture: modelng facets and opnons n weblogs. WWW 2007: 7-80 [9] Metzler, D., Strohman T., Turtle H., and Croft, W.B. Indr at TREC 2004: Terabyte Track. Onlne Proceedngs of 2004 Text REtreval Conference (TREC 2004), 2004 [0] Mshne, G. Multple Rankng Strateges for Opnon Retreval n Blogs. Onlne Proceedngs of TREC, [] Oard, D., Elsayed, T., Wang, J., and Wu, Y. TREC-2006 at Maryland: Blog, Enterprse, Legal and QA Tracks. Onlne Proceedngs of TREC, [2] Ouns, I., de Rjke, M., Macdonald, C., Mshne, G., and Soboroff, I. Overvew of the TREC 2006 Blog Track. In Proceedngs of TREC 2006, [3] Pang, B., et al, Thumbs up? Sentment Classfcaton Usng Machne Learnng Technques. In Proceedngs of the Conference on Emprcal Methods n Natural Language Processng (EMNLP) 2002, [4] Stone, P., Dunphy, D., Smth, M., and Oglve, D. The General Inqurer: A Computer Approach to Content Analyss. MIT Press, Cambrdge, 966. [5] Tong, R An Operatonal System for Detectng and Trackng Opnons n on-lne dscusson. SIGIR Workshop on Operatonal Text Classfcaton [6] Turtle, H. and Croft, W.B. Evaluaton of an Inference Network-Based Retreval Model. ACM Transactons on Informaton System, n 9(3),87-222, 99. [7] Wlson, T., Webe, J., and Hoffmann, P. Recognzng Contextual Polarty n Phrase-Level Sentment Analyss. In Proceedngs of HLT/EMNLP [8] WordNet. [9] Yang, K., Yu, N., Valero, A., Zhang, H. WIDIT n TREC Blog track. Onlne Proceedngs of TREC, [20] Zha, C. and Lafferty, J. A study of smoothng methods for language models appled to nformaton retreval. ACM Transactons on Informaton Systems (ACM TOIS ), Vol. 22, No. 2, [2] Zha, C. A Bref Revew of Informaton Retreval Models, Techncal report, Dept. of Computer Scence, UIUC, 2007 [22] Zhang, W. and Yu, C. UIC at TREC 2006 Blog Track. Onlne Proceedngs of TREC, [23] Mshne, G. and Glance, N. Leave a Reply: An analyss of Weblog Comments. In WWE 2006 (WWW 2006 Workshop on Webloggng Ecosystem), [24] Mshne, G. Usng blog propertes to mprove retreval, In Proceedngs of the Internatonal Conference on Weblogs and. Socal Meda (ICSWM) [25] Snghal, A. Modern nformaton retreval: A bref overvew. Bulletn of the IEEE Computer Socety Techncal commttee on Data Engneerng, 24(4):35-43, 200. [26] Macdonald, C. and Ouns, I. Overvew of the TREC-2007 Blog Track. Onlne Proceedngs of the 6 th Text Retreval Conference (TREC2007). 48

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department