Language-specific Models in Multilingual Topic Tracking

Size: px
Start display at page:

Download "Language-specific Models in Multilingual Topic Tracking"

Transcription

1 Language-specfc Moels n Multlngual Topc Trackng Leah S. Larkey, Fangfang Feng, Margaret Connell, Vctor Lavrenko Center for Intellgent Informaton Retreval Department of Computer Scence Unversty of Massachusetts Amherst, MA {larkey, feng, connell, lavrenko}@cs.umass.eu ABSTRACT Topc trackng s complcate when the stores n the stream occur n multple languages. Typcally, researchers have trane only Englsh topc moels because the tranng stores have been prove n Englsh. In trackng, non-englsh test stores are then machne translate nto Englsh to compare them wth the topc moels. We propose a natve language hypothess statng that comparsons woul be more effectve n the orgnal language of the story. We frst test an support the hypothess for story lnk etecton. For topc trackng the hypothess mples that t shoul be preferable to bul separate language-specfc topc moels for each language n the stream. We compare fferent methos of ncrementally bulng such natve language topc moels. Categores an Subject Descrptors H.3.1 [Informaton Storage an Retreval]: Content Analyss an Inexng Inexng methos, Lngustc processng. General Terms: Algorthms, Expermentaton. Keywors: classfcaton, crosslngual, Arabc, TDT, topc trackng, multlngual 1. INTRODUCTION Topc etecton an trackng (TDT) s a research area concerne wth organzng a multlngual stream of news broacasts as t arrves over tme. TDT nvestgatons sponsore by the U.S. government nclue fve fferent tasks: story lnk etecton, clusterng (topc etecton), topc trackng, new event (frst story) etecton, an story segmentaton. The present research focuses on topc trackng, whch s smlar to flterng n nformaton retreval. Topcs are efne by a small number of (tranng) stores, typcally one to four, an the task s to fn all the stores on those topcs n the ncomng stream. Permsson to make gtal or har copes of all or part of ths work for personal or classroom use s grante wthout fee prove that copes are not mae or strbute for proft or commercal avantage an that copes bear ths notce an the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to restrbute to lsts, requres pror specfc permsson an/or a fee. SIGIR 04, July 25-29, 2003, Sheffel, South Yorkshre, UK. Copyrght 2004 ACM /04/0007 $5.00. TDT evaluatons have nclue stores n multple languages snce TDT2 contane stores n Englsh an Manarn. TDT3 an TDT4 contane stores n Englsh, Manarn, an Arabc. Machne-translatons nto Englsh for all non-englsh stores were prove, allowng partcpants to gnore ssues of story translaton. All TDT tasks have at ther core a comparson of two text moels. In story lnk etecton, the smplest case, the comparson s between pars of stores, to ece whether gven pars of stores are on the same topc or not. In topc trackng, the comparson s between a story an a topc, whch s often represente as a centro of story vectors, or as a language moel coverng several stores. Our focus n ths research was to explore the best ways to compare stores an topcs when stores are n multple languages. We began wth the hypothess that f two stores orgnate n the same language, t woul be best to compare them n that language, rather than translatng them both nto another language for comparson. Ths smple asserton, whch we call the natve language hypothess, s easly teste n the TDT story lnk etecton task. The pcture gets more complex n a task lke topc trackng, whch begns wth a small number of tranng stores (n Englsh) to efne each topc. New stores from a stream must be place nto these topcs. The streame stores orgnate n fferent languages, but are also avalable n Englsh translaton. The translatons have been performe automatcally by machne translaton algorthms, an are nferor to manual translatons. At the begnnng of the stream, natve language comparsons cannot be performe because there are no natve language topc moels (other than Englsh). However, later n the stream, once non-englsh ocuments have been seen, one can base subsequent trackng on natve-language comparsons, by aaptvely tranng moels for atonal languages. There are many ways ths aaptaton coul be performe, an we suspect that t s crucal for the frst few non-englsh stores to be place nto topcs correctly, to avo bulng non-englsh moels from off-topc stores. Prevous research n multlngual TDT has not attempte to compare the bulng of multple language-specfc moels wth sngle-language topc moels, or to obtan natve-language moels through aaptaton. The focus of most multlngual work n TDT for example [2][12][13], has been to compare the effcacy of machne translaton of test stores nto a base language, wth other means of translaton. Although these researchers normalze scores for the source language, all story comparsons are one wthn the base language. Ths s also true n multlngual flterng, whch s a smlar task [14]. The present research s an exploraton of the natve language hypothess for multlngual topc trackng. We frst present results on story lnk etecton, to support the natve language hypothess n a smple, unerstanable task. Then we present experments that test the hypothess n the topc trackng task. Fnally we conser several fferent ways to aapt topc moels to allow natve language comparsons ownstream.

2 Although these experments were carre out n servce of TDT, the results shoul equally apply to other omans whch requre the comparson of ocuments n fferent languages, partcularly flterng, text classfcaton an clusterng. 2. EXPERIMENTAL SETUP Experments are replcate wth two fferent ata sets, TDT3 an TDT4, an two very fferent smlarty functons - cosne smlarty, an another base on relevance moelng, escrbe n the followng two sectons. Cosne smlarty can be seen as a basc efault approach, whch performs aequately, an relevance moelng s a state of the art approach whch yels top-rate performance. Confrmng the natve-language hypothess n both systems woul show ts generalty. In the rest of ths secton, we escrbe the TDT ata sets, then we escrbe how story lnk etecton an topc trackng are carre out n cosne smlarty an relevance moelng systems. Next, we escrbe the multlngual aspects of the systems. 2.1 TDT3 Data TDT ata consst of a stream of news n multple languages an from fferent mea - auo from televson, rao, an web news broacasts, an text from newswres. Two forms of transcrpton are avalable for the auo stream. The frst form comes from automatc speech recognton an nclues transcrpton errors mae by such systems. The secon form s a manual transcrpton, whch has few f any errors. The auo stream can also be ve nto stores automatcally or manually (so-calle reference bounares). For all the research reporte here, we use manual transcrptons an reference bounares. The characterstcs of the TDT3 ata sets for story lnk etecton an topc trackng are summarze n Tables 1-3. Table 1: Number of stores n TDT3 Corpus Englsh Arabc Manarn Total TDT3 37,526 15,928 13,657 67,111 Table 2: Characterstcs of TDT3 story lnk etecton ata sets Number of topcs 8 Number of lnk pars Same topc Dfferent topc Englsh-Englsh Arabc-Arabc Manarn-Manarn Englsh-Arabc Englsh-Manarn Arabc-Manarn Total ,995 Table 3: Characterstcs of TDT3 topc trackng ata sets Number of topcs Num. test stores On-topc All On-topc All Englsh , ,373 Arabc , ,563 Manarn , ,568 Total ,593, ,434, Story Representaton an Smlarty Cosne smlarty To compare two stores for lnk etecton, or a story wth a topc moel for trackng, each story s represente as a vector of terms wth tf f term weghts: log( ( N + 0.5) f ) a = tf (1) log( N + 1) where tf s the number of occurrences of the term n the story, N s the total number of ocuments n the collecton, an f s the number of ocuments contanng the term. Collecton statstcs N an f are compute ncrementally, base on the ocuments alreay n the stream wthn a eferral pero after the test story arrves. The eferral pero was 10 for lnk etecton an 1 for topc trackng. For lnk etecton, story vectors were prune to the 1000 terms wth the hghest term weghts. The smlarty of two (weghte, prune) vectors r a = a1,..., a n an r b = b,..., s the nner prouct between the two vectors: 1 b m Sm cos = 2 2 ( a b ) ( a )( b ) If the smlarty of two stores excees a yes/no threshol, the stores are consere to be about the same topc. For topc trackng, a topc moel s a centro, an average of the vectors for the N t tranng stores. Topc moels are prune to 100 terms base on the term weghts. Story vectors prune to 100 terms are compare to centros usng equaton (2). If the smlarty excees a yes/no threshol, the story s consere on-topc Relevance moelng Relevance moelng s a statstcal technque for estmatng language moels from extremely small samples, such as queres, [9]. If Q s small sample of text, an C s a large collecton of ocuments, the language moel for Q s estmate as: (2) w Q) = w M ) M Q) (3) C A relevance moel, then, s a mxture of language moels M of every ocument n the collecton, where the ocument moels are weghte by the posteror probablty of proucng the query M Q). The posteror probablty s compute as: M Q) = ) q Q ) C q Q q M ) q M Equaton (4) assgns the hghest weghts to ocuments that are most lkely to have generate Q, an can be nterprete as nearestneghbor smoothng, or a massve query expanson technque. To apply relevance moelng to story lnk etecton, we estmate the smlarty between two stores A an B by prunng the stores to short queres, estmatng relevance moels for the queres, an measurng the smlarty between the two relevance moels. Each story s replace by a query consstng of the ten wors n the query wth the lowest probablty of occurrng by chance n ranomly rawng A wors from the collecton C: ) (4)

3 P ( A chance w) Cw C Cw Aw A Aw = C A where A s the length of the story A, A w s the number of tmes wor w occurs n A, C s the sze of the collecton, an C w s the number of tmes wor w occurs n C. Story relevance moels are estmate usng equaton (4). Smlarty between relevance moels s measure usng the symmetrze clarty-ajuste vergence [11]: Sm RM B = w QA ) log + w w Q ) w GE) w (5) w QA ) (6) w QB ) log w GE) where w Q A ) s the relevance moel estmate for story A, an w GE) s the backgroun (General Englsh, Arabc, or Manarn) probablty of w, compute from the entre collecton of stores n the language wthn the same eferral pero use for cosne smlarty. To apply relevance moelng to topc trackng, the asymmetrc clarty ajuste vergence s use: w S) Smtrack ( T, S) = w T )log (7) w w GE) where w T) s a relevance moel of the topc T. Because of computatonal constrants, smoothe maxmum lkelhoo estmates rather than relevance moels are use for the story moel w S). The topc moel, base on Equaton (3), s: 1 P ( w T ) = w M ) S t S t where S t s the set of tranng stores. The topc moel s prune to 100 terms. More etal about applyng relevance moels to TDT can be foun n [2]. 2.3 Evaluaton TDT tasks are evaluate as etecton tasks. For each test tral, the system attempts to make a yes/no ecson. In story lnk etecton, the ecson s whether the two members of a story par belong to the same topc. In topc trackng, the ecson s whether a story n the stream belongs to a partcular topc. In all tasks, performance s summarze n two ways: a etecton cost functon (C Det ) an a ecson error traeoff (DET) curve. Both are base on the rates of two kns of errors a etecton system can make: msses, n whch the system gves a no answer where the correct answer s yes, an false alarms, n whch the system gves a yes answer where the correct answer s no. The DET curve plots the mss rate (P Mss ) as a functon of false alarm rate (P Fa ), as the yes/no ecson threshol s swept through ts range. P Mss an P Fa are compute for each topc, an then average across topcs to yel topc-weghte curves. An example can be seen n Fgure 1 below. Better performance s ncate by curves more to the lower left of the graph. The etecton cost functon s compute for a partcular threshol as follows: C Det = (C Mss * P Mss * P Target + C Fa * P Fa * (1-P Target )) (9) where: P Mss = #Msses / #Targets (8) P Fa = #False Alarms / #NonTargets C Mss an C Fa are the costs of a msse etecton an false alarm, respectvely, an are specfe for the applcaton, usually at 10 an 1, penalzng msses more than false alarms. P Target s the a pror probablty of fnng a target, an tem where the answer shoul be yes, set by conventon to The cost functon s normalze: (C Det ) Norm = C Det / MIN(C Mss * C Target, C Fa * (1-P Target )) (10) an average over topcs. Each pont along the etecton error traeoff curve has a value of (C Det ) Norm. The mnmum value foun on the curve s known as the mn(c Det ) Norm. It can be nterprete as the value of C Det ) Norm at the best possble threshol. Ths measure allows us to separate performance on the task from the choce of yes/no threshol. Lower cost scores ncate better performance. More nformaton about these measures can be foun n [5]. 2.4 Language-specfc Comparsons Englsh stores were lower-case an stemme usng the kstem stemmer [6]. Stop wors were remove. For natve Arabc comparsons, stores were converte from Uncoe UTF-8 to wnows (CP1256) encong, then normalze an stemme wth a lght stemmer [7]. Stop wors were remove. For natve Manarn comparsons, overlappng character bgrams were compare. 3. STORY LINK DETECTION In ths secton we present expermental results for story lnk etecton, comparng a natve conton wth an Englsh baselne. In the Englsh baselne, all comparsons are n Englsh, usng machne translaton (MT) for Arabc an Manarn stores. Corpus statstcs are compute ncrementally for all the Englsh an translate-nto-englsh stores. In the Natve conton, two stores orgnatng n the same language are compare n that language. Corpus statstcs are compute ncrementally for the stores n the language of the comparson. Cross language pars n the natve conton are compare n Englsh usng MT, as n the baselne. Fgure 1: DET curve for TDT3 lnk etecton base on Englsh versons of stores, or natve language versons, for cosne an relevance moel smlarty

4 Table 4: Mn(C et ) Norm for TDT3 story lnk etecton Smlarty Englsh Natve Cosne Relevance Moel Fgure 1 shows the DET curves for the TDT3 story lnk etecton task, an Table 4 shows the mnmum cost. The fgure an table show that natve language comparsons (otte) consstently outperform comparsons base on machne-translate Englsh (sol). Ths fference hols both for the basc cosne smlarty system (frst row) (black curves), an for the relevance moelng system (secon row) (gray curves). These results support the general concluson that when two stores orgnate n the same language, t s better to carry out smlarty comparsons n that language, rather than translatng them nto a fferent language. 4. TOPIC TRACKING In trackng, the system eces whether stores n a stream belong to preefne topcs. Smlarty s measure between a topc moel an a story, rather than between two stores. The natve language hypothess for trackng prects better performance f ncomng stores are compare n ther orgnal language wth topc moels n that language, an worse performance f translate stores are compare wth Englsh topc moels. The hypothess can only be teste nrectly, because Arabc an Manarn tranng stores were not avalable for all trackng topcs. In ths frst set of experments, we chose to obtan natve language tranng stores from the stream of test stores usng topc aaptaton, that s, graual mofcaton of topc moels to ncorporate test stores that ft the topc partcularly well. Aaptaton begns wth the topc trackng scenaro escrbe above n secton 2.2, usng a sngle moel per topc base on a small set of tranng stores n Englsh. Each tme a story s compare to a topc moel to etermne whether t shoul be classe as on-topc, t s also compare to a fxe aaptaton threshol θ a = 0.5 (not to be confuse wth the yes/no threshol mentone n secton 2.2.1). If the smlarty score s greater than θ a, the story s ae to the topc set, an the topc moel recompute. For clarty, we use the phrase topc set to refer to the set of stores from whch the topc moel s bult, whch grows uner aaptaton. The tranng set nclues only the orgnal N t tranng stores for each topc. For cosne smlarty, aaptaton conssts of computng a new centro for the topc set an prunng to 100 terms. For relevance moelng, a new topc moel s compute accorng to Equaton (8). At most 100 stores are place n each topc set. We have just escrbe global aaptaton, n whch stores are ae to global topc moels n Englsh. Stores that orgnate n Arabc or Manarn are compare an ae n ther machnetranslate verson. Natve aaptaton ffers from global aaptaton n makng separate topc moels for each source language. To ece whether a test story shoul be ae to a natve topc set, the test story s compare n ts natve language wth the natve moel, an ae to the natve topc set for that language f ts smlarty score excees θ a. The Englsh verson of the story s also compare to the global topc moel, an f ts smlarty score excees θ a, t s ae to the global topc set. (Global moels contnue to aapt for other languages whch may not yet have a natve moel, or for smoothng, scusse later.) At the start there are global topc moels an natve Englsh topc moels base on the tranng stores, but no natve Arabc or Manarn topc moels. When there s not yet a natve topc moel n the story s orgnal language, the translate story s compare to the global topc moel. If the smlarty excees θ a, the natve topc moel s ntalze wth the untranslate story. Yes/no ecsons for topc trackng can then be base on the untranslate story s smlarty to the natve topc moel f one exsts. If there s no natve topc moel yet for that language an topc, the translate story s compare to the global topc moel. We have escrbe three expermental contons: global aapte, natve aapte, an a baselne. The baselne, escrbe n Secton 2.2, can also be calle global unaapte. The baselne uses a sngle Englsh moel per topc base on the small set of tranng stores. A fourth possble conton, natve unaapte s problematc an not nclue here. There s no straghtforwar way to ntalze natve language topc moels wthout aaptaton when tranng stores are prove only n Englsh. Fgure 2: DET curves for TDT3 trackng, cosne smlarty (above) an relevance moels (below), N t =4 tranng stores, global unaapte baselne, global aapte, an natve aapte

5 Table 5: Mn(C et ) Norm for TDT3 topc trackng. Baselne Aapte Base- Aapte Global Natve lne Global Natve Cosne RM The TDT3 trackng results on three contons, replcate wth the two fferent smlarty measures (cosne smlarty an relevance moelng) an two fferent tranng set szes (N t =2 an 4) can be seen n Table 5. DET curves for N t =4 are shown n Fgure 2, for cosne smlarty (above) an relevance moelng (RM) (below). Table 5 shows a robust aaptaton effect for cosne an relevance moel experments, an for 2 or 4 tranng stores. Natve an global aaptaton are always better (lower cost) than baselne unaapte trackng. In aton, relevance moelng prouces better results than cosne smlarty. However, results o not show the precte avantage for natve aapte topc moels over global aapte topc moels. Only cosne smlarty, N t =4, seems to show the expecte fference (shae cells), but the fference s very small. The DET curve n Fgure 2 shows no sgn of a natve language effect. Table 6 shows mnmum cost fgures compute separately for Englsh, Manarn, an Arabc test sets. Only Englsh shows a pattern smlar to the composte results of Table 5 (see the shae cells). For cosne smlarty, there s not much fference between global an natve Englsh topc moels. For relevance moelng, Natve Englsh topc moels are slghtly worse than global moels. Arabc an Manarn appear to show a natve language avantage for all cosne smlarty contons an most relevance moel contons. However, DET curves comparng global an natve aapte moels separately for Englsh, Arabc, an Manarn, (Fgure 3) show no real natve language avantage. Table 6: Mn(C et ) Norm for TDT3 topc trackng; breakown by orgnal story language Englsh Baselne Aapte Base- Aapte Global Natve lne Global Natve Cosne RM Arabc Cosne RM Manarn Cosne RM Fgure 3: DET curves for TDT3 trackng, cosne smlarty, N t =4 tranng stores, global aapte vs. natve aapte breakown for Englsh, Arabc, an Manarn In tryng to account for the screpancy between the fnngs on lnk etecton an trackng, we suspecte that the root of the problem was the qualty of natve moels for Arabc an Manarn. For Englsh, aaptaton began wth 2 or 4 on-topc moels. However, Manarn an Arabc moels not begn wth ontopc stores; they coul begn wth off-topc moels, whch shoul hurt trackng performance. A relate ssue s ata sparseness. When a natve topc moel s frst forme, t s base on one story, whch s a poorer bass for trackng than N t stores. In the next three sectons we pursue fferent aspects of these suspcons. In secton 5 we perform a best-case experment, ntalzng natve topc sets wth on-topc stores, an smoothng natve scores wth global scores to aress the sparseness problem. If these contons o not show a natve language avantage, we woul reject the natve language hypothess. In secton 6 we explore the role of the aaptaton threshol. In secton 7 we compare some atonal methos of ntalzng natve language topc moels. 5. ON-TOPIC NATIVE CENTROIDS In ths secton, we conser a best-case scenaro, where we take the frst N t stores n each language relevant to each topc, to ntalze aaptaton of natve topc moels. Whle ths s cheatng, an not a way to obtan natve tranng ocuments n a realstc trackng scenaro, t emonstrates what performance can be attane f natve tranng ocuments are avalable. More realstc approaches to aaptng natve topc moels are consere n subsequent sectons. The baselne an global aapte contons were carre out as n Secton 4, an the natve aapte conton was smlar except n the way aaptaton of natve topcs began. If there were not yet N t natve stores n the topc set for the current test story n ts natve language, the story was ae to the topc set f t was relevant. Once a natve topc moel ha N t stores, we swtche to the usual non-cheatng moe of aaptaton, base on smlarty score an aaptaton threshol. To aress the ata sparseness problem, we also smoothe the natve smlarty scores wth the global smlarty scores:

6 Sm smooth ( T, S) = λsm ( T, S) + (1 λ) Sm ( T, S) (11) natve global The parameter λ was not tune, but set to a fxe value of 0.5. The results can be seen n Table 7. Shae cell pars ncate confrmaton of the natve language hypothess, where language-specfc topc moels outperform global moels. Table 7: Mn(C et ) Norm for TDT3 topc trackng, usng N t ontopc natve tranng stores an smoothng natve scores Baselne Aapte Base- Aapte Global Natve lne Global Natve Cosne Rel Fgure 4: DET curve for TDT3 trackng, ntalzng natve aaptaton wth relevant tranng stores urng aaptaton, cosne smlarty, N t =4 Fgure 4 shows the DET curves for cosne, N t =4 case. When the natve moels are ntalze wth on-topc stores, the avantage to natve moels s clearly seen n the trackng performance. DET curves showng results compute separately for the three languages can be seen n Fgure 5, for the cosne, N t =4 case. It can be clearly seen that Englsh trackng remans about the same but the Arabc an Manarn natve trackng show a large natve language avantage. 6. ADAPTATION THRESHOLD The aaptaton threshol was set to 0.5 n the experments escrbe above wthout any tunng. The ncrease n global trackng performance after aaptaton shows that the value s at least acceptable. However, an analyss of the etals of natve aaptaton showe that many Arabc an Manarn topcs were not aaptng. A summary of some of ths analyss can be seen n Table 8. Table 8: Number of topcs recevng new stores urng natve aaptaton, breakown by language Total Topcs recevng more stores Smlarty N t Topcs Englsh Arabc Manarn Cosne Relevance Moel Fewer than a thr of the topcs receve aapte stores. Ths means that for most topcs, natve trackng was base on the global moels. In orer to etermne whether ths was ue to the aaptaton threshol, we performe an experment varyng the aaptaton threshol from.3 to.65 n steps of.05. The results can be seen n Fgure 6, whch shows the mnmum cost, mn(c Det ) Norm, across the range of aaptaton threshol values. Although we see that the orgnal threshol,.5, was not always the optmal value, t s also clear that the pattern we saw at.5 (an n Fgure 6) oes not change as the threshol s vare, that s trackng wth natve topc moels s not better than trackng wth global moels. An mproperly tune aaptaton threshol was therefore not the reason that the natve language hypothess was not confrme for trackng. We suspect that fferent aaptaton threshols may be neee for the fferent languages, but t woul be better to hanle ths problem by language-specfc normalzaton of smlarty scores. Mn Cost Cosne Smlarty 0.09 Global Nt= Global Nt= Natve Nt=2 Natve Nt= Threshol Mn Cost Relevance Moel Global Nt=2 Global Nt= Natve Nt= Natve Nt= Threshol Fgure 6: Effect of aaptaton threshol on mn(c Det ) Norm on TDT3 trackng wth aaptaton. Fgure 5: DET curve for TDT3 trackng ntalzng natve aaptaton wth relevant tranng stores urng aaptaton an smoothng, vs. global aaptaton, cosne smlarty, N t =4, separate analyses for Englsh, Arabc, an Manarn. 7. IMPROVING NATIVE TOPIC MODELS In the prevous two sectons we showe that when natve topc moels are ntalze wth language specfc tranng stores that are truly on-topc, then topc trackng s nee better wth natve moels than wth global moels. However, n context of the TDT

7 test stuaton, the way we obtane our language-specfc tranng stores was cheatng. In ths secton we experment wth 2 fferent legal ways to ntalze better natve language moels: (1) Use both global an natve moels, an smooth natve smlarty scores wth global smlarty scores. (2) Intalze natve moels wth ctonary or other translatons of the Englsh tranng stores nto the other language. Smoothng was carre out n the natve aapte conton accorng to Equaton (11), settng λ=0.5, wthout tunng. The comparson wth unaapte an globally aapte trackng can be seen n Table 9. The smoothng mproves the natve topc moel performance relatve to unsmoothe natve topc moels (cf. Table 5), an brngs the natve moel performance to roughly the same level as the global. In other wors, smoothng mproves performance, but we stll o not have strong support for the natve language hypothess. Ths s apparent n Fgure 7. Natve aapte trackng s not better than global aapte trackng. Table 9: Mn(C et ) Norm for TDT3 topc trackng, smoothng natve scores wth global scores Baselne Global Natve lne Global Natve Aapte Base- Aapte Smooth Smooth Cosne RM Tranng story translatons nto Arabc use an Englsh/Arabc probablstc ctonary erve from the Lngustc Data Consortum s UN Arabc/Englsh parallel corpus, evelope for our cross-language nformaton retreval work Error! Reference source not foun.. Each Englsh wor has many fferent Arabc translatons, each wth a translaton probablty p(a e). The Arabc wors, but not the Englsh wors, have been stemme accorng to a lght stemmng algorthm. To translate an Englsh story, Englsh stop wors were remove, an each Englsh wor occurrence was replace by all of ts ctonary translatons, weghte by ther translaton probabltes. Weghts were summe across all the occurrences of each Arabc wor, an the resultng Arabc term vector was truncate to retan only terms above a threshol weght. We translate tranng stores only nto Arabc, because we not have a metho to prouce goo qualty Englsh to Manarn translaton. The results for Arabc can be seen n n Table 10. For translaton, t makes sense to nclue an unaapte natve conton, labele translate n the table. Table 10: Mn(C et ) Norm for Arabc TDT3 topc trackng, ntalzng natve topc moels wth ctonary-translate tranng stores Arabc N t =2 Unaapte Aapte Baselne Translate Global Natve Cosne RM Arabc N t =4 Cosne RM Fgure 7: DET curve for TDT3 trackng wth smoothng, cosne smlarty, N t =4 tranng stores The fnal metho of ntalzng topc moels for fferent languages woul be to translate the Englsh tranng stores nto the other languages requre. We not have machne translaton from Englsh nto Arabc or Manarn avalable for these experments. However, we have ha success wth ctonary translatons for Arabc. In [2] we foun that ctonary translatons from Arabc nto Englsh resulte n comparable performance to the machne translatons on trackng, an better performance on lnk etecton. Such translate stores woul not be natve language tranng stores, but mght be a better startng pont for languagespecfc aaptaton anyway. Fgure 8: DET curve for TDT3 trackng ntalzng natve topcs wth ctonary-translate tranng stores, cosne smlarty, N t =4, Arabc only The results are mxe. Frst of all, ths case s unusual n that aaptaton oes not mprove translate moels. Further analyss reveale that very lttle aaptaton was takng place. Because of ths lack of natve aaptaton, global aaptaton consstently outperforme natve aaptaton here. However, n the unaapte contons, translate tranng stores outperforme the global moels for Arabc n three of the four cases - cosne N t =4 an

8 relevance moels for N t =2 an N t =4 (the shae baselne-translate pars n Table 10). The DET curve for the cosne N t =4 case can be seen n Fgure 8. The natve unaapte curve s better (lower) than the global unaapte curve. The translate stores were very fferent from the test stores, so ther smlarty scores almost always fell below the aaptaton threshol. We beleve the nee to normalze scores between natve stores an ctonary translatons s part of the problem, but we also nee to nvestgate the compatblty of the ctonary translatons wth the natve Arabc stores. 8. CONCLUSIONS We have confrme the natve language hypothess for story lnk etecton. For topc trackng, the pcture s more complcate. When natve language tranng stores are avalable, goo natve language topc moels can be bult for trackng stores n ther orgnal language. Smoothng the natve moels wth global moels mproves performance slghtly. However, f tranng stores are not avalable n the fferent languages, t s ffcult to form natve moels by aaptaton or by translaton of tranng stores, whch perform better than the aapte global moels. Why shoul language specfc comparsons be more accurate than comparsons base on machne translaton? Machne translatons are not always goo translatons. If the translaton storts the meanng of the orgnal story, t s unlkely to be smlar to the topc moel, partcularly f proper names are ncorrect, or spelle fferently n the machne translatons than they are n the Englsh tranng stores, a common problem n Englsh translatons from Manarn or Arabc. Seconly, even f the translatons are correct, the choce of wors, an hence the language moels, are lkely to be fferent across languages. The secon problem coul be hanle by normalzng for source language, as n [12]. But normalzaton cannot compensate for poor translaton. We were surprse that translatng the tranng stores nto Arabc to make Arabc topc moels not mprove trackng, but agan, our ctonary base translatons of the topc moels were fferent from natve Arabc stores. We nten to try the same experment wth manual translatons of the tranng stores nto Arabc an Manarn. We are also plannng to nvestgate the best way to normalze scores for fferent languages. When TDT4 relevance jugments are avalable we nten to replcate some of these experments on TDT4 ata. 9. ACKNOWLEDGMENTS Ths work was supporte n part by the Center for Intellgent Informaton Retreval an n part by SPAWARSYSCEN-SD grant number N Any opnons, fnngs an conclusons or recommenatons expresse n ths materal are the author(s) an o not necessarly reflect those of the sponsor. 10. REFERENCES [1] Allan, J. Introucton to topc etecton an trackng. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.): Kluwer Acaemc Publshers, 1-16, [2] Allan, J. Bolvar, A., Connell, M., Cronen-Townsen, S., Feng, A, Feng, F., Kumaran, G., Larkey, L., Lavrenko, V., Raghavan, H. UMass TDT 2003 Research Summary. In Proceengs of TDT 2003 evaluaton, unpublshe, [3] Chen, H.-H. an Ku, L. W. An NLP & IR approach to topc etecton. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, , [4] Chen, Y.-J. an Chen, H.-H. Nlp an IR approaches to monolngual an multlngual lnk etecton. Presente at Proceengs of 19th Internatonal Conference on Computatonal Lngustcs, Tape, Tawan, [5] Fscus, J. G. an Dongton, G. R. Topc etecton an trackng evaluaton overvew. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, 17-32, [6] Krovetz, R. Vewng morphology as an nference process. In Proceengs of SIGIR '93, , [7] Larkey, Leah S. an Connell, Margaret E. (2003) Structure Queres, Language Moelng, an Relevance Moelng n Cross-Language Informaton Retreval. To appear n Informaton Processng an Management Specal Issue on Cross Language Informaton Retreval, [8] Larkey, L. S., Ballesteros, L., an Connell, M. E. Improvng stemmng for Arabc nformaton retreval: Lght stemmng an co-occurrence analyss. In Proceengs of SIGIR 2002, , [9] Lavrenko, V. an Croft, W. B. Relevance-base language moels. In Proceengs of SIGIR New Orleans: ACM, , [10] Lavrenko, V. an Croft, W. B. Relevance moels n nformaton retreval. In Language moelng for nformaton retreval, W. B. Croft an J. Lafferty (es.). Boston: Kluwer, 11-56, [11] Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollar, V., an Thomas, S. Relevance moels for topc etecton an trackng. In Proceengs of the Conference on Human Language Technology, , [12] Leek, T., Schwartz, R. M., an Ssta, S. Probablstc approaches to topc etecton an trackng. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, 67-83, [13] Levow, G.-A. an Oar, D. W. Sgnal boostng for translngual topc trackng: Document expanson an n-best translaton. In Topc etecton an trackng: Event-base nformaton organzaton, J. Allan (e.). Boston, MA: Kluwer, , [14] Oar, D. W. Aaptve vector space text flterng for monolngual an cross-language applcatons. PhD ssertaton, Unversty of Marylan, College Park,

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan

More information

The Modules and Methods of Topic Detection and Tracking

The Modules and Methods of Topic Detection and Tracking The Moules an Methos of Topc Detecton an Trackng ek Hoogma Unversty of Twente, Faculty of Eletrcal Engneerng, Mathematcs an Computer Scence n.hoogma@stuent.utwente.nl ABSTRACT Ths report presents the methos

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Efficient Load-Balanced IP Routing Scheme Based on Shortest Paths in Hose Model. Eiji Oki May 28, 2009 The University of Electro-Communications

Efficient Load-Balanced IP Routing Scheme Based on Shortest Paths in Hose Model. Eiji Oki May 28, 2009 The University of Electro-Communications Effcent Loa-Balance IP Routng Scheme Base on Shortest Paths n Hose Moel E Ok May 28, 2009 The Unversty of Electro-Communcatons Ok Lab. Semnar, May 28, 2009 1 Outlne Backgroun on IP routng IP routng strategy

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Learning Depth from Single Still Images: Approximate Inference 1

Learning Depth from Single Still Images: Approximate Inference 1 Learnng Depth from Sngle Stll Images: Approxmate Inference 1 MS&E 211 course project Ashutosh Saxena, Ilya O. Ryzhov Channng Wong, Janln Wang June 7th, 2006 1 In ths report, Saxena, et. al. [1] somethng

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example Unversty of Brtsh Columba CPSC, Intro to Computaton Jan-Apr Tamara Munzner News Assgnment correctons to ASCIIArtste.java posted defntely read WebCT bboards Arrays Lecture, Tue Feb based on sldes by Kurt

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search Can We Beat the Prefx Flterng? An Adaptve Framework for Smlarty Jon and Search Jannan Wang Guolang L Janhua Feng Department of Computer Scence and Technology, Tsnghua Natonal Laboratory for Informaton

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Cross-Language Information Retrieval

Cross-Language Information Retrieval Feature Artcle: Cross-Language Informaton Retreval 19 Cross-Language Informaton Retreval Jan-Yun Ne 1 Abstract A research group n Unversty of Montreal has worked on the problem of cross-language nformaton

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

MODULE - 9 LECTURE NOTES 1 FUZZY OPTIMIZATION

MODULE - 9 LECTURE NOTES 1 FUZZY OPTIMIZATION Water Resources Systems Plannng an Management: vance Tocs Fuzzy Otmzaton MODULE - 9 LECTURE NOTES FUZZY OPTIMIZTION INTRODUCTION The moels scusse so far are crs an recse n nature. The term crs means chotonomous.e.,

More information

The Objective Function Value Optimization of Cloud Computing Resources Security

The Objective Function Value Optimization of Cloud Computing Resources Security Open Journal of Optmzaton, 2015, 4, 40-46 Publshe Onlne June 2015 n ScRes. http://www.scrp.org/journal/ojop http://x.o.org/10.4236/ojop.2015.42005 The Objectve Functon Value Optmzaton of Clou Computng

More information

A Software Tool to Teach the Performance of Fuzzy IR Systems based on Weighted Queries

A Software Tool to Teach the Performance of Fuzzy IR Systems based on Weighted Queries A Software Tool to Teach the Performance of Fuzzy IR Systems base on Weghte Queres Enrque Herrera-Vema 1, Sergo Alonso 1, Francsco J. Cabrerzo 1, Antono G. Lopez-Herrera 2, Carlos Porcel 3 1 Dept. of Computer

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Reversible Digital Watermarking

Reversible Digital Watermarking Reversble Dgtal Watermarkng Chang-Tsun L Department of Computer Scence Unversty of Warwck Multmea Securty an Forenscs 1 Reversble Watermarkng Base on Dfference Expanson (DE) In some mecal, legal an mltary

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Using Query Contexts in Information Retrieval Jing Bai 1, Jian-Yun Nie 1, Hugues Bouchard 2, Guihong Cao 1 1 Department IRO, University of Montreal

Using Query Contexts in Information Retrieval Jing Bai 1, Jian-Yun Nie 1, Hugues Bouchard 2, Guihong Cao 1 1 Department IRO, University of Montreal Usng uery Contexts n Informaton Retreval Jng Ba 1, Jan-Yun Ne 1, Hugues Bouchard 2, Guhong Cao 1 1 epartment IRO, Unversty of Montreal CP. 6128, succursale Centre-vlle, Montreal, uebec, H3C 3J7, Canada

More information

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment Cross-lngual Pseudo Relevance Feedback Based on Weak Relevant opc Algnment WANG Xu-wen Insttute of Medcal Informaton & Lbrary, Chnese Academy of Medcal Scences, Beng 100020 wang.xuwen@mcams.ac.cn ZHANG

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Modeling Inter-cluster and Intra-cluster Discrimination Among Triphones

Modeling Inter-cluster and Intra-cluster Discrimination Among Triphones Modelng Inter-cluster and Intra-cluster Dscrmnaton Among Trphones Tom Ko, Bran Mak and Dongpeng Chen Department of Computer Scence and Engneerng The Hong Kong Unversty of Scence and Technology Clear Water

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval LRD: Latent Relaton Dscovery for Vector Space Expanson and Informaton Retreval Techncal Report KMI-06-09 March, 006 Alexandre Gonçalves, Janhan Zhu, Dawe Song, Vctora Uren, Roberto Pacheco In Proc. of

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

CS1100 Introduction to Programming

CS1100 Introduction to Programming Factoral (n) Recursve Program fact(n) = n*fact(n-) CS00 Introducton to Programmng Recurson and Sortng Madhu Mutyam Department of Computer Scence and Engneerng Indan Insttute of Technology Madras nt fact

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks Federated Search of Text-Based Dgtal Lbrares n Herarchcal Peer-to-Peer Networks Je Lu School of Computer Scence Carnege Mellon Unversty Pttsburgh, PA 15213 jelu@cs.cmu.edu Jame Callan School of Computer

More information

Detecting Spam Review through Sentiment Analysis

Detecting Spam Review through Sentiment Analysis JOURAL OF SOFTWARE, VOL. 9, O. 8, AUGUST 2014 2065 Detectng Spam Revew through Sentment Analyss Qngx Peng an Mng Zhong* State Key Lab of Software Engneerng, Wuhan Unversty, Wuhan, Chna pengqngx@gmal.com,

More information

K-means Clustering Algorithm in Projected Spaces

K-means Clustering Algorithm in Projected Spaces K-means Clusterng Algorthm n Projecte paces Alssar NAER, Dens HAMAD.A.. -U..C.O 50 rue F. Busson, BP 699, 68 Calas, France Emal: nasser@lasl.unv-lttoral.fr Chaban NAR ebanese Unversty E.F Rue Al-Arz, rpol

More information

Bootstrapping Structured Page Segmentation

Bootstrapping Structured Page Segmentation Bootstrappng Structured Page Segmentaton Huanfeng Ma and Davd Doermann Laboratory for Language and Meda Processng Insttute for Advanced Computer Studes (UMIACS) Unversty of Maryland, College Park, MD {hfma,

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

USING GRAPHING SKILLS

USING GRAPHING SKILLS Name: BOLOGY: Date: _ Class: USNG GRAPHNG SKLLS NTRODUCTON: Recorded data can be plotted on a graph. A graph s a pctoral representaton of nformaton recorded n a data table. t s used to show a relatonshp

More information

Information Retrieval

Information Retrieval Anmol Bhasn abhasn[at]cedar.buffalo.edu Moht Devnan mdevnan[at]cse.buffalo.edu Sprng 2005 #$ "% &'" (! Informaton Retreval )" " * + %, ##$ + *--. / "#,0, #'",,,#$ ", # " /,,#,0 1"%,2 '",, Documents are

More information

Segmentation in Echocardiographic Sequences Using Shape-Based Snake Model

Segmentation in Echocardiographic Sequences Using Shape-Based Snake Model Segmentaton n chocarographc Sequences Usng Shape-Base Snake Moel Chen Sheng 1, Yang Xn 1, Yao Lpng 2, an Sun Kun 2 1 Insttuton of Image Processng an Pattern Recognton, Shangha Jaotong Unversty, Shangha,

More information

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,

More information

Semantic Image Retrieval Using Region Based Inverted File

Semantic Image Retrieval Using Region Based Inverted File Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, 119613, Sngapore 2 Department of Computer

More information

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,

More information

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A

More information

Identifying Efficient Kernel Function in Multiclass Support Vector Machines

Identifying Efficient Kernel Function in Multiclass Support Vector Machines Internatonal Journal of Computer Applcatons (0975 8887) Volume 8 No.8, August 0 Ientfng Effcent Kernel Functon n Multclass Support Vector Machnes R.Sangeetha Ph.D Research Scholar Department of Computer

More information

A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorization A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

Backpropagation: In Search of Performance Parameters

Backpropagation: In Search of Performance Parameters Bacpropagaton: In Search of Performance Parameters ANIL KUMAR ENUMULAPALLY, LINGGUO BU, and KHOSROW KAIKHAH, Ph.D. Computer Scence Department Texas State Unversty-San Marcos San Marcos, TX-78666 USA ae049@txstate.edu,

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

An Efficient Scanning Pattern for Layered Manufacturing Processes

An Efficient Scanning Pattern for Layered Manufacturing Processes Proceengs of the 2 IEEE Internatonal Conference on Robotcs & Automaton Seoul, Korea May 2-26, 2 An Effcent Scannng Pattern for Layere Manufacturng Processes Y.Yang, J.Y.H Fuh 2, H.T.Loh 2 Department of

More information

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn

More information

A Method of Hot Topic Detection in Blogs Using N-gram Model

A Method of Hot Topic Detection in Blogs Using N-gram Model 84 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 A Method of Hot Topc Detecton n Blogs Usng N-gram Model Xaodong Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna

More information

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran mogan_az@ksh.sharf.edu

More information

Analysis of Continuous Beams in General

Analysis of Continuous Beams in General Analyss of Contnuous Beams n General Contnuous beams consdered here are prsmatc, rgdly connected to each beam segment and supported at varous ponts along the beam. onts are selected at ponts of support,

More information

Correlative features for the classification of textural images

Correlative features for the classification of textural images Correlatve features for the classfcaton of textural mages M A Turkova 1 and A V Gadel 1, 1 Samara Natonal Research Unversty, Moskovskoe Shosse 34, Samara, Russa, 443086 Image Processng Systems Insttute

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

An Application of Computational Intelligence Technique for Predicting Surface Roughness in End Milling of Inconel-718

An Application of Computational Intelligence Technique for Predicting Surface Roughness in End Milling of Inconel-718 An Applcaton of Computatonal Intellgence Technque for Prectng Roughness n En Mllng of Inconel-718 Abhjt Mahapatra 1 an Shbenu Shekhar Roy 2, 1 Vrtual Prototypng & Immerse Vsualzaton Laboratory, Central

More information

Optimal Workload-based Weighted Wavelet Synopses

Optimal Workload-based Weighted Wavelet Synopses Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,

More information

Faces Recognition with Image Feature Weights and Least Mean Square Learning Approach

Faces Recognition with Image Feature Weights and Least Mean Square Learning Approach Faces Recognton wth Image Feature Weghts an Least Mean Square Learnng Approach We-L Fang, Yng-Kue Yang an Jung-Kue Pan Dept. of Electrcal Engneerng, Natonal Tawan Un. of Sc. & Technology, Tape, Tawan Emal:

More information