Language-specific Models in Multilingual Topic Tracking

Size: px

Start display at page:

Download "Language-specific Models in Multilingual Topic Tracking"

Sabrina Roberts
5 years ago
Views:

1 Language-specfc Moels n Multlngual Topc Trackng Leah S. Larkey, Fangfang Feng, Margaret Connell, Vctor Lavrenko Center for Intellgent Informaton Retreval Department of Computer Scence Unversty of Massachusetts Amherst, MA {larkey, feng, connell, lavrenko}@cs.umass.eu ABSTRACT Topc trackng s complcate when the stores n the stream occur n multple languages. Typcally, researchers have trane only Englsh topc moels because the tranng stores have been prove n Englsh. In trackng, non-englsh test stores are then machne translate nto Englsh to compare them wth the topc moels. We propose a natve language hypothess statng that comparsons woul be more effectve n the orgnal language of the story. We frst test an support the hypothess for story lnk etecton. For topc trackng the hypothess mples that t shoul be preferable to bul separate language-specfc topc moels for each language n the stream. We compare fferent methos of ncrementally bulng such natve language topc moels. Categores an Subject Descrptors H.3.1 [Informaton Storage an Retreval]: Content Analyss an Inexng Inexng methos, Lngustc processng. General Terms: Algorthms, Expermentaton. Keywors: classfcaton, crosslngual, Arabc, TDT, topc trackng, multlngual 1. INTRODUCTION Topc etecton an trackng (TDT) s a research area concerne wth organzng a multlngual stream of news broacasts as t arrves over tme. TDT nvestgatons sponsore by the U.S. government nclue fve fferent tasks: story lnk etecton, clusterng (topc etecton), topc trackng, new event (frst story) etecton, an story segmentaton. The present research focuses on topc trackng, whch s smlar to flterng n nformaton retreval. Topcs are efne by a small number of (tranng) stores, typcally one to four, an the task s to fn all the stores on those topcs n the ncomng stream. Permsson to make gtal or har copes of all or part of ths work for personal or classroom use s grante wthout fee prove that copes are not mae or strbute for proft or commercal avantage an that copes bear ths notce an the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to restrbute to lsts, requres pror specfc permsson an/or a fee. SIGIR 04, July 25-29, 2003, Sheffel, South Yorkshre, UK. Copyrght 2004 ACM /04/0007 $5.00. TDT evaluatons have nclue stores n multple languages snce TDT2 contane stores n Englsh an Manarn. TDT3 an TDT4 contane stores n Englsh, Manarn, an Arabc. Machne-translatons nto Englsh for all non-englsh stores were prove, allowng partcpants to gnore ssues of story translaton. All TDT tasks have at ther core a comparson of two text moels. In story lnk etecton, the smplest case, the comparson s between pars of stores, to ece whether gven pars of stores are on the same topc or not. In topc trackng, the comparson s between a story an a topc, whch s often represente as a centro of story vectors, or as a language moel coverng several stores. Our focus n ths research was to explore the best ways to compare stores an topcs when stores are n multple languages. We began wth the hypothess that f two stores orgnate n the same language, t woul be best to compare them n that language, rather than translatng them both nto another language for comparson. Ths smple asserton, whch we call the natve language hypothess, s easly teste n the TDT story lnk etecton task. The pcture gets more complex n a task lke topc trackng, whch begns wth a small number of tranng stores (n Englsh) to efne each topc. New stores from a stream must be place nto these topcs. The streame stores orgnate n fferent languages, but are also avalable n Englsh translaton. The translatons have been performe automatcally by machne translaton algorthms, an are nferor to manual translatons. At the begnnng of the stream, natve language comparsons cannot be performe because there are no natve language topc moels (other than Englsh). However, later n the stream, once non-englsh ocuments have been seen, one can base subsequent trackng on natve-language comparsons, by aaptvely tranng moels for atonal languages. There are many ways ths aaptaton coul be performe, an we suspect that t s crucal for the frst few non-englsh stores to be place nto topcs correctly, to avo bulng non-englsh moels from off-topc stores. Prevous research n multlngual TDT has not attempte to compare the bulng of multple language-specfc moels wth sngle-language topc moels, or to obtan natve-language moels through aaptaton. The focus of most multlngual work n TDT for example [2][12][13], has been to compare the effcacy of machne translaton of test stores nto a base language, wth other means of translaton. Although these researchers normalze scores for the source language, all story comparsons are one wthn the base language. Ths s also true n multlngual flterng, whch s a smlar task [14]. The present research s an exploraton of the natve language hypothess for multlngual topc trackng. We frst present results on story lnk etecton, to support the natve language hypothess n a smple, unerstanable task. Then we present experments that test the hypothess n the topc trackng task. Fnally we conser several fferent ways to aapt topc moels to allow natve language comparsons ownstream.

2 Although these experments were carre out n servce of TDT, the results shoul equally apply to other omans whch requre the comparson of ocuments n fferent languages, partcularly flterng, text classfcaton an clusterng. 2. EXPERIMENTAL SETUP Experments are replcate wth two fferent ata sets, TDT3 an TDT4, an two very fferent smlarty functons - cosne smlarty, an another base on relevance moelng, escrbe n the followng two sectons. Cosne smlarty can be seen as a basc efault approach, whch performs aequately, an relevance moelng s a state of the art approach whch yels top-rate performance. Confrmng the natve-language hypothess n both systems woul show ts generalty. In the rest of ths secton, we escrbe the TDT ata sets, then we escrbe how story lnk etecton an topc trackng are carre out n cosne smlarty an relevance moelng systems. Next, we escrbe the multlngual aspects of the systems. 2.1 TDT3 Data TDT ata consst of a stream of news n multple languages an from fferent mea - auo from televson, rao, an web news broacasts, an text from newswres. Two forms of transcrpton are avalable for the auo stream. The frst form comes from automatc speech recognton an nclues transcrpton errors mae by such systems. The secon form s a manual transcrpton, whch has few f any errors. The auo stream can also be ve nto stores automatcally or manually (so-calle reference bounares). For all the research reporte here, we use manual transcrptons an reference bounares. The characterstcs of the TDT3 ata sets for story lnk etecton an topc trackng are summarze n Tables 1-3. Table 1: Number of stores n TDT3 Corpus Englsh Arabc Manarn Total TDT3 37,526 15,928 13,657 67,111 Table 2: Characterstcs of TDT3 story lnk etecton ata sets Number of topcs 8 Number of lnk pars Same topc Dfferent topc Englsh-Englsh Arabc-Arabc Manarn-Manarn Englsh-Arabc Englsh-Manarn Arabc-Manarn Total ,995 Table 3: Characterstcs of TDT3 topc trackng ata sets Number of topcs Num. test stores On-topc All On-topc All Englsh , ,373 Arabc , ,563 Manarn , ,568 Total ,593, ,434, Story Representaton an Smlarty Cosne smlarty To compare two stores for lnk etecton, or a story wth a topc moel for trackng, each story s represente as a vector of terms wth tf f term weghts: log( ( N + 0.5) f ) a = tf (1) log( N + 1) where tf s the number of occurrences of the term n the story, N s the total number of ocuments n the collecton, an f s the number of ocuments contanng the term. Collecton statstcs N an f are compute ncrementally, base on the ocuments alreay n the stream wthn a eferral pero after the test story arrves. The eferral pero was 10 for lnk etecton an 1 for topc trackng. For lnk etecton, story vectors were prune to the 1000 terms wth the hghest term weghts. The smlarty of two (weghte, prune) vectors r a = a1,..., a n an r b = b,..., s the nner prouct between the two vectors: 1 b m Sm cos = 2 2 ( a b ) ( a )( b ) If the smlarty of two stores excees a yes/no threshol, the stores are consere to be about the same topc. For topc trackng, a topc moel s a centro, an average of the vectors for the N t tranng stores. Topc moels are prune to 100 terms base on the term weghts. Story vectors prune to 100 terms are compare to centros usng equaton (2). If the smlarty excees a yes/no threshol, the story s consere on-topc Relevance moelng Relevance moelng s a statstcal technque for estmatng language moels from extremely small samples, such as queres, [9]. If Q s small sample of text, an C s a large collecton of ocuments, the language moel for Q s estmate as: (2) w Q) = w M ) M Q) (3) C A relevance moel, then, s a mxture of language moels M of every ocument n the collecton, where the ocument moels are weghte by the posteror probablty of proucng the query M Q). The posteror probablty s compute as: M Q) = ) q Q ) C q Q q M ) q M Equaton (4) assgns the hghest weghts to ocuments that are most lkely to have generate Q, an can be nterprete as nearestneghbor smoothng, or a massve query expanson technque. To apply relevance moelng to story lnk etecton, we estmate the smlarty between two stores A an B by prunng the stores to short queres, estmatng relevance moels for the queres, an measurng the smlarty between the two relevance moels. Each story s replace by a query consstng of the ten wors n the query wth the lowest probablty of occurrng by chance n ranomly rawng A wors from the collecton C: ) (4)

3 P ( A chance w) Cw C Cw Aw A Aw = C A where A s the length of the story A, A w s the number of tmes wor w occurs n A, C s the sze of the collecton, an C w s the number of tmes wor w occurs n C. Story relevance moels are estmate usng equaton (4). Smlarty between relevance moels s measure usng the symmetrze clarty-ajuste vergence [11]: Sm RM B = w QA ) log + w w Q ) w GE) w (5) w QA ) (6) w QB ) log w GE) where w Q A ) s the relevance moel estmate for story A, an w GE) s the backgroun (General Englsh, Arabc, or Manarn) probablty of w, compute from the entre collecton of stores n the language wthn the same eferral pero use for cosne smlarty. To apply relevance moelng to topc trackng, the asymmetrc clarty ajuste vergence s use: w S) Smtrack ( T, S) = w T )log (7) w w GE) where w T) s a relevance moel of the topc T. Because of computatonal constrants, smoothe maxmum lkelhoo estmates rather than relevance moels are use for the story moel w S). The topc moel, base on Equaton (3), s: 1 P ( w T ) = w M ) S t S t where S t s the set of tranng stores. The topc moel s prune to 100 terms. More etal about applyng relevance moels to TDT can be foun n [2]. 2.3 Evaluaton TDT tasks are evaluate as etecton tasks. For each test tral, the system attempts to make a yes/no ecson. In story lnk etecton, the ecson s whether the two members of a story par belong to the same topc. In topc trackng, the ecson s whether a story n the stream belongs to a partcular topc. In all tasks, performance s summarze n two ways: a etecton cost functon (C Det ) an a ecson error traeoff (DET) curve. Both are base on the rates of two kns of errors a etecton system can make: msses, n whch the system gves a no answer where the correct answer s yes, an false alarms, n whch the system gves a yes answer where the correct answer s no. The DET curve plots the mss rate (P Mss ) as a functon of false alarm rate (P Fa ), as the yes/no ecson threshol s swept through ts range. P Mss an P Fa are compute for each topc, an then average across topcs to yel topc-weghte curves. An example can be seen n Fgure 1 below. Better performance s ncate by curves more to the lower left of the graph. The etecton cost functon s compute for a partcular threshol as follows: C Det = (C Mss * P Mss * P Target + C Fa * P Fa * (1-P Target )) (9) where: P Mss = #Msses / #Targets (8) P Fa = #False Alarms / #NonTargets C Mss an C Fa are the costs of a msse etecton an false alarm, respectvely, an are specfe for the applcaton, usually at 10 an 1, penalzng msses more than false alarms. P Target s the a pror probablty of fnng a target, an tem where the answer shoul be yes, set by conventon to The cost functon s normalze: (C Det ) Norm = C Det / MIN(C Mss * C Target, C Fa * (1-P Target )) (10) an average over topcs. Each pont along the etecton error traeoff curve has a value of (C Det ) Norm. The mnmum value foun on the curve s known as the mn(c Det ) Norm. It can be nterprete as the value of C Det ) Norm at the best possble threshol. Ths measure allows us to separate performance on the task from the choce of yes/no threshol. Lower cost scores ncate better performance. More nformaton about these measures can be foun n [5]. 2.4 Language-specfc Comparsons Englsh stores were lower-case an stemme usng the kstem stemmer [6]. Stop wors were remove. For natve Arabc comparsons, stores were converte from Uncoe UTF-8 to wnows (CP1256) encong, then normalze an stemme wth a lght stemmer [7]. Stop wors were remove. For natve Manarn comparsons, overlappng character bgrams were compare. 3. STORY LINK DETECTION In ths secton we present expermental results for story lnk etecton, comparng a natve conton wth an Englsh baselne. In the Englsh baselne, all comparsons are n Englsh, usng machne translaton (MT) for Arabc an Manarn stores. Corpus statstcs are compute ncrementally for all the Englsh an translate-nto-englsh stores. In the Natve conton, two stores orgnatng n the same language are compare n that language. Corpus statstcs are compute ncrementally for the stores n the language of the comparson. Cross language pars n the natve conton are compare n Englsh usng MT, as n the baselne. Fgure 1: DET curve for TDT3 lnk etecton base on Englsh versons of stores, or natve language versons, for cosne an relevance moel smlarty

4 Table 4: Mn(C et ) Norm for TDT3 story lnk etecton Smlarty Englsh Natve Cosne Relevance Moel Fgure 1 shows the DET curves for the TDT3 story lnk etecton task, an Table 4 shows the mnmum cost. The fgure an table show that natve language comparsons (otte) consstently outperform comparsons base on machne-translate Englsh (sol). Ths fference hols both for the basc cosne smlarty system (frst row) (black curves), an for the relevance moelng system (secon row) (gray curves). These results support the general concluson that when two stores orgnate n the same language, t s better to carry out smlarty comparsons n that language, rather than translatng them nto a fferent language. 4. TOPIC TRACKING In trackng, the system eces whether stores n a stream belong to preefne topcs. Smlarty s measure between a topc moel an a story, rather than between two stores. The natve language hypothess for trackng prects better performance f ncomng stores are compare n ther orgnal language wth topc moels n that language, an worse performance f translate stores are compare wth Englsh topc moels. The hypothess can only be teste nrectly, because Arabc an Manarn tranng stores were not avalable for all trackng topcs. In ths frst set of experments, we chose to obtan natve language tranng stores from the stream of test stores usng topc aaptaton, that s, graual mofcaton of topc moels to ncorporate test stores that ft the topc partcularly well. Aaptaton begns wth the topc trackng scenaro escrbe above n secton 2.2, usng a sngle moel per topc base on a small set of tranng stores n Englsh. Each tme a story s compare to a topc moel to etermne whether t shoul be classe as on-topc, t s also compare to a fxe aaptaton threshol θ a = 0.5 (not to be confuse wth the yes/no threshol mentone n secton 2.2.1). If the smlarty score s greater than θ a, the story s ae to the topc set, an the topc moel recompute. For clarty, we use the phrase topc set to refer to the set of stores from whch the topc moel s bult, whch grows uner aaptaton. The tranng set nclues only the orgnal N t tranng stores for each topc. For cosne smlarty, aaptaton conssts of computng a new centro for the topc set an prunng to 100 terms. For relevance moelng, a new topc moel s compute accorng to Equaton (8). At most 100 stores are place n each topc set. We have just escrbe global aaptaton, n whch stores are ae to global topc moels n Englsh. Stores that orgnate n Arabc or Manarn are compare an ae n ther machnetranslate verson. Natve aaptaton ffers from global aaptaton n makng separate topc moels for each source language. To ece whether a test story shoul be ae to a natve topc set, the test story s compare n ts natve language wth the natve moel, an ae to the natve topc set for that language f ts smlarty score excees θ a. The Englsh verson of the story s also compare to the global topc moel, an f ts smlarty score excees θ a, t s ae to the global topc set. (Global moels contnue to aapt for other languages whch may not yet have a natve moel, or for smoothng, scusse later.) At the start there are global topc moels an natve Englsh topc moels base on the tranng stores, but no natve Arabc or Manarn topc moels. When there s not yet a natve topc moel n the story s orgnal language, the translate story s compare to the global topc moel. If the smlarty excees θ a, the natve topc moel s ntalze wth the untranslate story. Yes/no ecsons for topc trackng can then be base on the untranslate story s smlarty to the natve topc moel f one exsts. If there s no natve topc moel yet for that language an topc, the translate story s compare to the global topc moel. We have escrbe three expermental contons: global aapte, natve aapte, an a baselne. The baselne, escrbe n Secton 2.2, can also be calle global unaapte. The baselne uses a sngle Englsh moel per topc base on the small set of tranng stores. A fourth possble conton, natve unaapte s problematc an not nclue here. There s no straghtforwar way to ntalze natve language topc moels wthout aaptaton when tranng stores are prove only n Englsh. Fgure 2: DET curves for TDT3 trackng, cosne smlarty (above) an relevance moels (below), N t =4 tranng stores, global unaapte baselne, global aapte, an natve aapte

5 Table 5: Mn(C et ) Norm for TDT3 topc trackng. Baselne Aapte Base- Aapte Global Natve lne Global Natve Cosne RM The TDT3 trackng results on three contons, replcate wth the two fferent smlarty measures (cosne smlarty an relevance moelng) an two fferent tranng set szes (N t =2 an 4) can be seen n Table 5. DET curves for N t =4 are shown n Fgure 2, for cosne smlarty (above) an relevance moelng (RM) (below). Table 5 shows a robust aaptaton effect for cosne an relevance moel experments, an for 2 or 4 tranng stores. Natve an global aaptaton are always better (lower cost) than baselne unaapte trackng. In aton, relevance moelng prouces better results than cosne smlarty. However, results o not show the precte avantage for natve aapte topc moels over global aapte topc moels. Only cosne smlarty, N t =4, seems to show the expecte fference (shae cells), but the fference s very small. The DET curve n Fgure 2 shows no sgn of a natve language effect. Table 6 shows mnmum cost fgures compute separately for Englsh, Manarn, an Arabc test sets. Only Englsh shows a pattern smlar to the composte results of Table 5 (see the shae cells). For cosne smlarty, there s not much fference between global an natve Englsh topc moels. For relevance moelng, Natve Englsh topc moels are slghtly worse than global moels. Arabc an Manarn appear to show a natve language avantage for all cosne smlarty contons an most relevance moel contons. However, DET curves comparng global an natve aapte moels separately for Englsh, Arabc, an Manarn, (Fgure 3) show no real natve language avantage. Table 6: Mn(C et ) Norm for TDT3 topc trackng; breakown by orgnal story language Englsh Baselne Aapte Base- Aapte Global Natve lne Global Natve Cosne RM Arabc Cosne RM Manarn Cosne RM Fgure 3: DET curves for TDT3 trackng, cosne smlarty, N t =4 tranng stores, global aapte vs. natve aapte breakown for Englsh, Arabc, an Manarn In tryng to account for the screpancy between the fnngs on lnk etecton an trackng, we suspecte that the root of the problem was the qualty of natve moels for Arabc an Manarn. For Englsh, aaptaton began wth 2 or 4 on-topc moels. However, Manarn an Arabc moels not begn wth ontopc stores; they coul begn wth off-topc moels, whch shoul hurt trackng performance. A relate ssue s ata sparseness. When a natve topc moel s frst forme, t s base on one story, whch s a poorer bass for trackng than N t stores. In the next three sectons we pursue fferent aspects of these suspcons. In secton 5 we perform a best-case experment, ntalzng natve topc sets wth on-topc stores, an smoothng natve scores wth global scores to aress the sparseness problem. If these contons o not show a natve language avantage, we woul reject the natve language hypothess. In secton 6 we explore the role of the aaptaton threshol. In secton 7 we compare some atonal methos of ntalzng natve language topc moels. 5. ON-TOPIC NATIVE CENTROIDS In ths secton, we conser a best-case scenaro, where we take the frst N t stores n each language relevant to each topc, to ntalze aaptaton of natve topc moels. Whle ths s cheatng, an not a way to obtan natve tranng ocuments n a realstc trackng scenaro, t emonstrates what performance can be attane f natve tranng ocuments are avalable. More realstc approaches to aaptng natve topc moels are consere n subsequent sectons. The baselne an global aapte contons were carre out as n Secton 4, an the natve aapte conton was smlar except n the way aaptaton of natve topcs began. If there were not yet N t natve stores n the topc set for the current test story n ts natve language, the story was ae to the topc set f t was relevant. Once a natve topc moel ha N t stores, we swtche to the usual non-cheatng moe of aaptaton, base on smlarty score an aaptaton threshol. To aress the ata sparseness problem, we also smoothe the natve smlarty scores wth the global smlarty scores:

6 Sm smooth ( T, S) = λsm ( T, S) + (1 λ) Sm ( T, S) (11) natve global The parameter λ was not tune, but set to a fxe value of 0.5. The results can be seen n Table 7. Shae cell pars ncate confrmaton of the natve language hypothess, where language-specfc topc moels outperform global moels. Table 7: Mn(C et ) Norm for TDT3 topc trackng, usng N t ontopc natve tranng stores an smoothng natve scores Baselne Aapte Base- Aapte Global Natve lne Global Natve Cosne Rel Fgure 4: DET curve for TDT3 trackng, ntalzng natve aaptaton wth relevant tranng stores urng aaptaton, cosne smlarty, N t =4 Fgure 4 shows the DET curves for cosne, N t =4 case. When the natve moels are ntalze wth on-topc stores, the avantage to natve moels s clearly seen n the trackng performance. DET curves showng results compute separately for the three languages can be seen n Fgure 5, for the cosne, N t =4 case. It can be clearly seen that Englsh trackng remans about the same but the Arabc an Manarn natve trackng show a large natve language avantage. 6. ADAPTATION THRESHOLD The aaptaton threshol was set to 0.5 n the experments escrbe above wthout any tunng. The ncrease n global trackng performance after aaptaton shows that the value s at least acceptable. However, an analyss of the etals of natve aaptaton showe that many Arabc an Manarn topcs were not aaptng. A summary of some of ths analyss can be seen n Table 8. Table 8: Number of topcs recevng new stores urng natve aaptaton, breakown by language Total Topcs recevng more stores Smlarty N t Topcs Englsh Arabc Manarn Cosne Relevance Moel Fewer than a thr of the topcs receve aapte stores. Ths means that for most topcs, natve trackng was base on the global moels. In orer to etermne whether ths was ue to the aaptaton threshol, we performe an experment varyng the aaptaton threshol from.3 to.65 n steps of.05. The results can be seen n Fgure 6, whch shows the mnmum cost, mn(c Det ) Norm, across the range of aaptaton threshol values. Although we see that the orgnal threshol,.5, was not always the optmal value, t s also clear that the pattern we saw at.5 (an n Fgure 6) oes not change as the threshol s vare, that s trackng wth natve topc moels s not better than trackng wth global moels. An mproperly tune aaptaton threshol was therefore not the reason that the natve language hypothess was not confrme for trackng. We suspect that fferent aaptaton threshols may be neee for the fferent languages, but t woul be better to hanle ths problem by language-specfc normalzaton of smlarty scores. Mn Cost Cosne Smlarty 0.09 Global Nt= Global Nt= Natve Nt=2 Natve Nt= Threshol Mn Cost Relevance Moel Global Nt=2 Global Nt= Natve Nt= Natve Nt= Threshol Fgure 6: Effect of aaptaton threshol on mn(c Det ) Norm on TDT3 trackng wth aaptaton. Fgure 5: DET curve for TDT3 trackng ntalzng natve aaptaton wth relevant tranng stores urng aaptaton an smoothng, vs. global aaptaton, cosne smlarty, N t =4, separate analyses for Englsh, Arabc, an Manarn. 7. IMPROVING NATIVE TOPIC MODELS In the prevous two sectons we showe that when natve topc moels are ntalze wth language specfc tranng stores that are truly on-topc, then topc trackng s nee better wth natve moels than wth global moels. However, n context of the TDT

7 test stuaton, the way we obtane our language-specfc tranng stores was cheatng. In ths secton we experment wth 2 fferent legal ways to ntalze better natve language moels: (1) Use both global an natve moels, an smooth natve smlarty scores wth global smlarty scores. (2) Intalze natve moels wth ctonary or other translatons of the Englsh tranng stores nto the other language. Smoothng was carre out n the natve aapte conton accorng to Equaton (11), settng λ=0.5, wthout tunng. The comparson wth unaapte an globally aapte trackng can be seen n Table 9. The smoothng mproves the natve topc moel performance relatve to unsmoothe natve topc moels (cf. Table 5), an brngs the natve moel performance to roughly the same level as the global. In other wors, smoothng mproves performance, but we stll o not have strong support for the natve language hypothess. Ths s apparent n Fgure 7. Natve aapte trackng s not better than global aapte trackng. Table 9: Mn(C et ) Norm for TDT3 topc trackng, smoothng natve scores wth global scores Baselne Global Natve lne Global Natve Aapte Base- Aapte Smooth Smooth Cosne RM Tranng story translatons nto Arabc use an Englsh/Arabc probablstc ctonary erve from the Lngustc Data Consortum s UN Arabc/Englsh parallel corpus, evelope for our cross-language nformaton retreval work Error! Reference source not foun.. Each Englsh wor has many fferent Arabc translatons, each wth a translaton probablty p(a e). The Arabc wors, but not the Englsh wors, have been stemme accorng to a lght stemmng algorthm. To translate an Englsh story, Englsh stop wors were remove, an each Englsh wor occurrence was replace by all of ts ctonary translatons, weghte by ther translaton probabltes. Weghts were summe across all the occurrences of each Arabc wor, an the resultng Arabc term vector was truncate to retan only terms above a threshol weght. We translate tranng stores only nto Arabc, because we not have a metho to prouce goo qualty Englsh to Manarn translaton. The results for Arabc can be seen n n Table 10. For translaton, t makes sense to nclue an unaapte natve conton, labele translate n the table. Table 10: Mn(C et ) Norm for Arabc TDT3 topc trackng, ntalzng natve topc moels wth ctonary-translate tranng stores Arabc N t =2 Unaapte Aapte Baselne Translate Global Natve Cosne RM Arabc N t =4 Cosne RM Fgure 7: DET curve for TDT3 trackng wth smoothng, cosne smlarty, N t =4 tranng stores The fnal metho of ntalzng topc moels for fferent languages woul be to translate the Englsh tranng stores nto the other languages requre. We not have machne translaton from Englsh nto Arabc or Manarn avalable for these experments. However, we have ha success wth ctonary translatons for Arabc. In [2] we foun that ctonary translatons from Arabc nto Englsh resulte n comparable performance to the machne translatons on trackng, an better performance on lnk etecton. Such translate stores woul not be natve language tranng stores, but mght be a better startng pont for languagespecfc aaptaton anyway. Fgure 8: DET curve for TDT3 trackng ntalzng natve topcs wth ctonary-translate tranng stores, cosne smlarty, N t =4, Arabc only The results are mxe. Frst of all, ths case s unusual n that aaptaton oes not mprove translate moels. Further analyss reveale that very lttle aaptaton was takng place. Because of ths lack of natve aaptaton, global aaptaton consstently outperforme natve aaptaton here. However, n the unaapte contons, translate tranng stores outperforme the global moels for Arabc n three of the four cases - cosne N t =4 an

8 relevance moels for N t =2 an N t =4 (the shae baselne-translate pars n Table 10). The DET curve for the cosne N t =4 case can be seen n Fgure 8. The natve unaapte curve s better (lower) than the global unaapte curve. The translate stores were very fferent from the test stores, so ther smlarty scores almost always fell below the aaptaton threshol. We beleve the nee to normalze scores between natve stores an ctonary translatons s part of the problem, but we also nee to nvestgate the compatblty of the ctonary translatons wth the natve Arabc stores. 8. CONCLUSIONS We have confrme the natve language hypothess for story lnk etecton. For topc trackng, the pcture s more complcate. When natve language tranng stores are avalable, goo natve language topc moels can be bult for trackng stores n ther orgnal language. Smoothng the natve moels wth global moels mproves performance slghtly. However, f tranng stores are not avalable n the fferent languages, t s ffcult to form natve moels by aaptaton or by translaton of tranng stores, whch perform better than the aapte global moels. Why shoul language specfc comparsons be more accurate than comparsons base on machne translaton? Machne translatons are not always goo translatons. If the translaton storts the meanng of the orgnal story, t s unlkely to be smlar to the topc moel, partcularly f proper names are ncorrect, or spelle fferently n the machne translatons than they are n the Englsh tranng stores, a common problem n Englsh translatons from Manarn or Arabc. Seconly, even f the translatons are correct, the choce of wors, an hence the language moels, are lkely to be fferent across languages. The secon problem coul be hanle by normalzng for source language, as n [12]. But normalzaton cannot compensate for poor translaton. We were surprse that translatng the tranng stores nto Arabc to make Arabc topc moels not mprove trackng, but agan, our ctonary base translatons of the topc moels were fferent from natve Arabc stores. We nten to try the same experment wth manual translatons of the tranng stores nto Arabc an Manarn. We are also plannng to nvestgate the best way to normalze scores for fferent languages. When TDT4 relevance jugments are avalable we nten to replcate some of these experments on TDT4 ata. 9. ACKNOWLEDGMENTS Ths work was supporte n part by the Center for Intellgent Informaton Retreval an n part by SPAWARSYSCEN-SD grant number N Any opnons, fnngs an conclusons or recommenatons expresse n ths materal are the author(s) an o not necessarly reflect those of the sponsor. 10. REFERENCES [1] Allan, J. Introucton to topc etecton an trackng. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.): Kluwer Acaemc Publshers, 1-16, [2] Allan, J. Bolvar, A., Connell, M., Cronen-Townsen, S., Feng, A, Feng, F., Kumaran, G., Larkey, L., Lavrenko, V., Raghavan, H. UMass TDT 2003 Research Summary. In Proceengs of TDT 2003 evaluaton, unpublshe, [3] Chen, H.-H. an Ku, L. W. An NLP & IR approach to topc etecton. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, , [4] Chen, Y.-J. an Chen, H.-H. Nlp an IR approaches to monolngual an multlngual lnk etecton. Presente at Proceengs of 19th Internatonal Conference on Computatonal Lngustcs, Tape, Tawan, [5] Fscus, J. G. an Dongton, G. R. Topc etecton an trackng evaluaton overvew. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, 17-32, [6] Krovetz, R. Vewng morphology as an nference process. In Proceengs of SIGIR '93, , [7] Larkey, Leah S. an Connell, Margaret E. (2003) Structure Queres, Language Moelng, an Relevance Moelng n Cross-Language Informaton Retreval. To appear n Informaton Processng an Management Specal Issue on Cross Language Informaton Retreval, [8] Larkey, L. S., Ballesteros, L., an Connell, M. E. Improvng stemmng for Arabc nformaton retreval: Lght stemmng an co-occurrence analyss. In Proceengs of SIGIR 2002, , [9] Lavrenko, V. an Croft, W. B. Relevance-base language moels. In Proceengs of SIGIR New Orleans: ACM, , [10] Lavrenko, V. an Croft, W. B. Relevance moels n nformaton retreval. In Language moelng for nformaton retreval, W. B. Croft an J. Lafferty (es.). Boston: Kluwer, 11-56, [11] Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollar, V., an Thomas, S. Relevance moels for topc etecton an trackng. In Proceengs of the Conference on Human Language Technology, , [12] Leek, T., Schwartz, R. M., an Ssta, S. Probablstc approaches to topc etecton an trackng. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, 67-83, [13] Levow, G.-A. an Oar, D. W. Sgnal boostng for translngual topc trackng: Document expanson an n-best translaton. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, , [14] Oar, D. W. Aaptve vector space text flterng for monolngual an cross-language applcatons. PhD ssertaton, Unversty of Marylan, College Park,

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department