Resolving Surface Forms to Wikipedia Topics

Size: px
Start display at page:

Download "Resolving Surface Forms to Wikipedia Topics"

Transcription

1 Resolvng Surface Forms to Wkpeda Topcs Ypng Zhou Lan Ne Omd Rouhan-Kalleh Flavan Vasle Scott Gaffney Yahoo! Labs at Sunnyvale Abstract Ambguty of entty mentons and concept references s a challenge to mnng text beyond surface-level keywords. We descrbe an effectve method of dsambguatng surface forms and resolvng them to Wkpeda enttes and concepts. Our method employs an extensve set of features mned from Wkpeda and other large data sources, and combnes the features usng a machne learnng approach wth automatcally generated tranng data. Based on a manually labeled evaluaton set contanng over 1000 news artcles, our resoluton model has 85% precson and 87.8% recall. The performance s sgnfcantly better than three baselnes based on tradtonal context smlartes or sense commonness measurements. Our method can be appled to other languages and scales well to new enttes and concepts. 1 Introducton Ambguty n natural language s prevalent and, as such, t can be a dffcult challenge for nformaton retreval systems and other text mnng applcatons. For example, a search for Ford n Yahoo! News retreves about 40 thousand artcles contanng Ford referrng to a company (Ford Motors), an athlete (Tommy Ford), a place (Ford Cty), etc. Due to reference ambguty, even f we knew the user was only nterested n the company, they would stll have to contend wth artcles referrng to the other concepts as well. In ths paper we focus on the problem of resolvng references of named-enttes and concepts n natural language through ther textual surface forms. Specfcally, we present a method of resolvng surface forms n general text documents to Wkpeda entres. The tasks of resoluton and dsambguaton are nearly dentcal; we make the dstncton that resoluton specfcally apples when a known set of referent concepts are gven a pror. Our approach dffers from others n multple aspects ncludng the followng. 1) We employ a rch set of dsambguaton features leveragng mnng results from largescale data sources. We calculate contextsenstve features by extensvely mnng the categores, lnks and contents of the entre Wkpeda corpus. Addtonally we make use of contextndependent data mned from varous data sources ncludng Web user-behavoral data and Wkpeda. Our features also capture the one-toone relatonshp between a surface form and ts referent. 2) We use machne learnng methods to tran resoluton models wth a large automatcally labeled tranng set. Both rankng-based and classfcaton-based resoluton approaches are explored. 3) Our method dsambguates both enttes and word senses. It scales well to new enttes and concepts, and t can be easly appled to other languages. We propose an extensve set of metrcs to evaluate not only overall resoluton performance but also out-of-wkpeda predcton. Our systems for Englsh language are evaluated usng realworld test sets and compared wth a number of baselnes. Evaluaton results show that our systems consstently and sgnfcantly outperform others across all test sets. The paper s organzed as follows. We frst descrbe related research n Secton 2, followed by an ntroducton of Wkpeda n Secton 3. We then ntroduce our learnng method n Secton 4 and our features n Secton 5. We show our expermental results n Secton 6, and fnally close wth a dscusson of future work Proceedngs of the 23rd Internatonal Conference on Computatonal Lngustcs (Colng 2010), pages , Bejng, August 2010

2 2 Related Work Named entty dsambguaton research can be dvded nto two categores: some works (Bagga and Baldwn, 1998; Mann and Yarowsky, 2003; Pedersen et al., 2005; Fleschman and Hovy, 2004; Ravn and Kaz, 1999) am to cluster ambguous surface forms to dfferent groups, wth each representng a unque entty; others (Cucerzan, 2007; Bunescu and Paşca, 2006; Han and Zhao, 2009; Mlne and Wtten, 2008a; Mlne and Wtten, 2008b) resolve a surface form to an entty or concept extracted from exstng knowledge bases. Our work falls nto the second category. Lookng specfcally at resoluton, Bunescu and Pasca (2006) bult a taxonomy SVM kernel to enrch a surface form s representaton wth words from Wkpeda artcles n the same category. Cucerzan (2007) employed context vectors consstng of phrases and categores extracted from Wkpeda. The system also attempted to dsambguate all surface forms n a context smultaneously, wth the constrant that ther resolved enttes should be globally consstent on the category level as much as possble. Mlne and Wtten (2008a, 2008b) proposed to use Wkpeda s lnk structure to capture the relatedness between Wkpeda enttes so that a surface form s resolved to an entty based on ts relatedness to the surface form s surroundng enttes. Besdes relatedness, they also defne a commonness feature that captures how common t s that a surface form lnks to a partcular entty n general. Han and Zhao (2009) defned a novel algnment strategy to calculate smlarty between surface forms based on semantc relatedness n the context. Mlne and Wtten s work s most related to what we propose here n that we also employ features smlar to ther relatedness and commonness features. However, we add to ths a much rcher set of features whch are extracted from Web-scale data sources beyond Wkpeda, and we develop a machne learnng approach to automatcally blend our features usng completely automatcally generated tranng data. 3 Wkpeda Wkpeda has more than 200 language edtons, and the Englsh edton has more than 3 mllon artcles as of March Newsworthy events are often added to Wkpeda wthn days of occurrence; Wkpeda has b-weekly snapshots avalable for download. Each artcle n Wkpeda s unquely dentfed by ts ttle whch s usually the most common surface form of an entty or concept. Each artcle ncludes body text, outgong lnks and categores. Here s a sample sentence n the artcle ttled Arstotle n wktext format. Together wth Plato and [[Socrates]] (Platos teacher), Arstotle s one of the most mportant foundng fgures n [[Western phlosophy]]. Near the end of the artcle, there are category lnks such as [[Category:Ancent Greek mathematcans]]. The double brackets annotate outgong lnks to other Wkpeda artcles wth the specfed ttles. The category names are created by authors. Artcles and category names have many-to-many relatonshps. In addton to normal artcles, Wkpeda also has specal types of artcles such as redrect artcles and dsambguaton artcles. A redrect artcle s ttle s an alternatve surface form for a Wkpeda entry. A dsambguaton artcle lsts lnks to smlarly named artcles, and usually ts ttle s a commonly used surface form for multple enttes and concepts. 4 Method of Learnng Our goal s to resolve surface forms to enttes or concepts descrbed n Wkpeda. To ths end, we frst need a recognzer to detect surface forms to be resolved. Then we need a resolver to map a surface form to the most probable entry n Wkpeda (or to out-of-wk) based on the context. Recognzer: We frst create a set of Wkpeda (artcle) entres E = {e 1, e 2, } to whch we want to resolve surface forms. Each entry s surface forms are mned from multple data sources. Then we use smple strng match to recognze surface forms from text documents. Among all Wkpeda entres, we exclude those wth low mportance. In our experments, we removed the entres that would not nterest general Web users, such as stop words and punctuatons. Second, we collect surface forms for entres n E usng Wkpeda and Web search query clck logs based on the followng assumptons: 1336

3 Each Wkpeda artcle ttle s a surface form for the entry. Redrect ttles are taken as alternatve surface forms for the target entry. The anchor text of a lnk from one artcle to another s taken as an alternatve surface form for the lnked-to entry. Web search engne queres resultng n user clcks on a Wkpeda artcle are taken as alternatve surface forms for the entry. As a result, we get a number of surface forms for each entry e. If we let s j denote the j-th surface form for entry, then we can represent our entry dctonary as EntSfDct = {<e 1, (s 11, s 12, )>, <e 2, (s 21, s 22, )>, }. Resolver: We frst buld a labeled tranng set automatcally, and then use supervsed learnng methods to learn models to resolve among Wkpeda entres. In the rest of ths secton we descrbe the resolver n detals. 4.1 Automatcally Labeled Data To learn accurate models, supervsed learnng methods requre tranng data wth both large quantty and hgh qualty, whch often takes lots of human labelng effort. However, n Wkpeda, lnks provde a supervsed mappng from surface forms to artcle entres. We use these lnks to automatcally generate tranng data. If a lnks anchor text s a surface form n EntSfDct, we extract the anchor text as surface form s and the lnks destnaton artcle as Wkpeda entry e, then add the par (s, e) wth a postve judgment to our labeled example set. Contnung, we use EntSfDct to fnd other Wkpeda entres for whch s s a surface form and create negatve examples for these and add them to our labeled example set. If e does not exst n EntSfDct (for example, f the lnk ponts to a Wkpeda artcle about a stop word), then a negatve tranng example s created for every Wkpeda entry to whch s may resolve. We use oow (out-of-wk) to denote ths case. Instead of artcle level coreference resoluton, we only match partal names wth full names based on the observaton that surface forms for named enttes are usually captalzed word sequences n Englsh language and a named entty s often mentoned by a long surface form followed by mentons of short forms n the same artcle. For each par (s, e) n the labeled example set, f s s a partal name of a full name s occurrng earler n the same document, we replace (s, e) wth (s, e) n the labeled example set. Usng ths methodology we created 2.4 mllon labeled examples from only 1% of Englsh Wkpeda artcles. The abundance of data made t possble for us to experment on the mpact of tranng set sze on model accuracy. 4.2 Learnng Algorthms In our experments we explored both Gradent Boosted Decson Trees (GBDT) and Gradent Boosted Rankng (GBRank) to learn resoluton models. They both can easly combne features of dfferent scale and wth mssng values. Other supervsed learnng methods are to be explored n the future. GBDT: We use the stochastc varant of GBDTs (Fredman, 2001) to learn a bnary logstc regresson model wth the judgments as the target. GBDTs compute a functon approxmaton by performng a numercal optmzaton n the functon space. It s done n multple stages, wth each stage modelng resduals from the model of the last stage usng a small decson tree. A bref summary s gven n Algorthm 1. In the stochastc verson of GBDT, one sub-samples the tranng data nstead of usng the entre tranng set to compute the loss functon. Algorthm 1 GBDTs Input: tranng data N {( x, y )}, loss functon = 1 L[y, f(x)], the number of nodes for each tree J, the number of trees M. 1: Intalze f(x)=f 0 2: For m = 1 to M 2.1: For = 1 to N, compute the negatve gradent by takng the dervatve of the loss wth respect to f(x) and substtute wth y and m 1 f ( x ). 2.2: Ft a J-node regresson tree to the components of the negatve gradent. 2.3: Fnd the wthn-node updates m a for j j = 1 to J by performng J unvarate optmzatons of the node contrbutons to the estmated loss. 2.4: Do the update m m 1 m f ( x ) = f ( x ) + r a, j where j s the node that x belongs to, r s learnng rate. 3: End for 4: Return f M 1337

4 In our settng, the loss functon s a negatve bnomal log-lkelhood, x s the feature vector for a surface-form and Wkpeda-entry par (s, e ), and y s +1 for postve judgments and -1 s for negatve judgments. GBRank: From a gven surface form s judgments we can nfer that the correct Wkpeda entry s preferred over other entres. Ths allows us to derve par-wse preference judgments from absolute judgments and tran a model to rank all the Wkpeda canddate entres for each surface form. Let S = {( x, x ) l( x ) l( x ), = 1,..., N} be the set of preference judgments, where x and x are the feature vectors for two pars of surface-forms and Wkpeda-entry, l(x ) and l(x ) are ther absolute judgments respectvely. GBRank (Zheng et al., 2007) tres to learn a functon h such that h( x ) h( x ) for ( x, x ) S. A sketch of the algorthm s gven n Algorthm 2. Algorthm 2 GBRank 1: Intalze h=h 0 2: For k=1 to K 2.1: Use h k-1 as an approxmaton of h and compute S S {( x, x ) S h ( x ) h + = k 1 k 1 = {( x, x ) S hk 1 ( x ) < hk 1 ( x ) +τ} ( x ) +τ} where τ =α( l( x ) l( x )) 2.2: Ft a regresson functon g k usng GBDT and the ncorrectly predcted examples {( x, h k 1 2.3: Do the update ( x ) + τ ),( x, h ( x ) τ ) ( x, x ) S } k 1 h ( x) = ( kh 1 ( x) + g ( x)) /( k + 1) k k η, where k η s learnng rate. 3: End for 4: Return h K We use a tunng set ndependent from the tranng set to select the optmal parameters for GBDT and GBRank. Ths ncludes the number of trees M, the number of nodes J, the learnng rate r, and the samplng rate for GBDT; and for GBRank we select K, α and η. The feature mportance measurement gven by GBDT and GBRank s computed by keepng track of the reducton n the loss functon at each feature varable splt and then computng the total reducton of loss along each explanatory feature varable. We use t to analyze feature effectveness. 4.3 Predcton After applyng a resoluton model on the gven test data, we obtan a score for each surface-form and Wkpeda-entry par (s, e). Among all the pars contanng s, we fnd the par wth the hghest score, denoted by (s, e ~ ). It s very common that a surface form refers to an entty or concept not defned n Wkpeda. So t s mportant to correctly predct whether the gven surface form cannot be mapped to any Wkpeda entry n EntSfDct. We apply a threshold to the scores from resoluton models. If the score for (s, e ~ ) s lower than the threshold, then the predcton s oow (see Secton 4.1), otherwse e ~ s predcted to be the entry referred by s. We select thresholds based on F1 (see Secton 6.2) on a tunng set that s ndependent from our tranng set and test set. 5 Features For each surface-form and Wkpeda-entry par (s, e), we create a feature vector ncludng features capturng the context surroundng s and features ndependent of the context. They are context-dependent and context-ndependent features respectvely. Varous data sources are mned to extract these features, ncludng Wkpeda artcles, Web search query-clck logs, and Web-user browsng logs. In addton, (s, e) s compared to all pars contanng s based on above features and the derved features are called dfferentaton features. 5.1 Context-dependent Features These features measure whether the gven surface form s resolvng to the gven Wkpeda entry e would make the gven document more coherent. They are based on 1) the vector representaton of e, and 2) the vector representaton of the context of s n a document d. Representaton of e: By thoroughly mnng Wkpeda and other large data sources we extract contextual clues for each Wkpeda entry e and formulate ts representaton n the followng ways. 1) Background representaton. The overall background descrpton of e s gven n the correspondng Wkpeda artcle, denoted as A e. Naturally, a bag of terms and surface forms n A e can represent e. So we represent e by a back- 1338

5 ground word vector E bw and a background surface form vector E bs, n whch each element s the occurrence count of a word or a surface form n A e s frst paragraph. 2) Co-occurrence representaton. The terms and surface forms frequently co-occurrng wth e capture ts contextual characterstcs. We frst dentfy all the Wkpeda artcles lnkng to A e. Then, for each lnk pontng to A e we extract the surroundng words and surface forms wthn a wndow centered on the anchor text. The wndow sze s set to 10 words n our experment. Fnally, we select the words and surface forms wth the top co-occurrence frequency, and represent e by a co-occurrng word vector E cw and a cooccurrng surface form vector E cs, n whch each element s the co-occurrence frequency of a selected word or surface form. 3) Relatedness representaton. We analyzed the relatedness between Wkpeda entres from dfferent data sources usng varous measurements, and we computed over 20 types of relatedness scores n our experments. In the followng we dscuss three types as examples. The frst type s computed based on the overlap between two Wkpeda entres categores. The second type s mned from Wkpeda nter-artcle lnks. (In our experments, two Wkpeda entres are consdered to be related f the two artcles are mutually lnked to each other or co-cted by many Wkpeda artcles.) The thrd type s mned from Web-user browsng data based on the assumpton that two Wkpeda artcles cooccurrng n the same browsng sesson are related. We used approxmately one year of Yahoo! user data n our experments. A number of dfferent metrcs are used to measure the relatedness. For example, we apply the algorthm of Google dstance (Mlne and Wtten, 2008b) on Wkpeda lnks to calculate the Wkpeda lnk-based relatedness, and use mutual nformaton for the browsng-sesson-based relatedness. In summary, we represent e by a related entry vector E r for each type of relatedness, n whch each element s the relatedness score between e and a related entry. Representaton of s: We represent a surface form s context as a vector, then calculate a context-dependent feature for a par <s,e> by a smlarty functon Sm from two vectors. Here are examples of context representaton. 1) s s represented by a word vector S w and a surface form vector S s, n whch each element s the occurrence count of a word or a surface form surroundng s. We calculate each vector s smlarty wth the background and co-occurrence representaton of e, and t results n Sm(S w, E bw ), Sm(S w, E cw ), Sm(S s, E bs ) and Sm(S s, E cs ). 2) s s represented by a Wkpeda entry vector S e, n whch each element s a Wkpeda entry to whch a surroundng surface form s could resolve. We calculate ts smlarty wth the relatedness representaton of e, and t results n Sm(S e, E r ). In the above descrpton, smlarty s calculated by dot product or n a summaton-ofmaxmum fashon. In our experments we extracted surroundng words and surface forms for s from the whole document or from the text wndow of 55 tokens centered on s, whch resulted n 2 sets of features. We created around 50 contextdependent features n total. 5.2 Context-ndependent Features These features are extracted from data beyond the document contanng s. Here are examples. Durng the process of buldng the dctonary EntSfDct as descrbed n Secton 4, we count how often s maps to e and estmate the probablty of s mappng to e for each data source. These are the commonness features. The number of Wkpeda entres that s could map to s a feature about the ambguty of s. The strng smlarty between s and the ttle of A e s used as a feature. In our experments strng smlarty was based on word overlap. 5.3 Dfferentaton Features Among all surface-form and Wkpeda-entry pars that contan s, at most one par gets the postve judgment. Based on ths observaton we created dfferentaton features to represent how (s, e) s compared to other pars for s. They are derved from the context-dependent and contextndependent features descrbed above. For example, we compute the dfference between the strng smlarty for (s, e) and the maxmum strng smlarty for all pars contanng s. The derved feature value would be zero f (s, e) has larger strng smlarty than other pars contanng s. 1339

6 6 Expermental Results In our experments we used the Wkpeda snapshot for March 6 th, Our dctonary EntSfDct contans 3.5 mllon Wkpeda entres and 6.5 mllon surface forms. A tranng set was created from randomly selected Wkpeda artcles usng the process descrbed n Secton 4.1. We vared the number of Wkpeda artcles from 500 to 40,000, but the performance dd not ncrease much after The expermental results reported n ths paper are based on the tranng set generated from 5000 artcles. It contans around 1.4 mllon tranng examples. There are approxmately 300,000 surface forms, out of whch 28,000 are the oow case. Around 400 features were created n total, and 200 of them were selected by GBDT and GBRank to be used n our resoluton models. 6.1 Evaluaton Datasets Three datasets from dfferent data sources are used n evaluaton. 1) Wkpeda hold-out set. Usng the same process for generatng tranng data and excludng the surface forms appearng n the tranng data, we bult the hold-out set from approxmately 15,000 Wkpeda artcles, contanng around 600,000 labeled nstances. There are 400,000 surface forms, out of whch 46,000 do not resolve to any Wkpeda entry. 2) MSNBC News test set. Ths entty dsambguaton data set was ntroduced by Cucerzan (2007). It contans 200 news artcles collected from ten MSNBC news categores as of January 2, Surface forms were manually dentfed and mapped to Wkpeda enttes. The data set contans 756 surface forms. Only 589 of them are contaned n our dctonary EntSfDct, manly because EntSfDct excludes surface forms of outof-wkpeda enttes and concepts. Snce the evaluaton task s focused on resoluton performance rather than recognton, we exclude the mssng surface forms from the labeled example set. The fnal dataset contans 4,151 labeled nstances. There are 589 surface forms and 40 of them do not resolve to any Wkpeda entry. 3) Yahoo! News set. One lmtaton of the MSNBC test set s the small sze. We bult a much larger data set by randomly samplng around 1,000 news artcles from Yahoo! News over 2008 and had them manually annotated. The experts frst dentfed person, locaton and organzaton names, then mapped each name to a Wkpeda artcle f the artcle s about the entty referred to by the name. We ddn t nclude more general concepts n ths data set to make the manual effort easer. Ths data set contans around 100,000 labeled nstances. The data set ncludes 15,387 surface forms and 3,532 of them cannot be resolved to any Wkpeda entty. We randomly splt the data set to 2 parts of equal sze. One part s used to tune parameters of GBDT and GBRank and select thresholds based on F1 value. The evaluaton results presented n ths paper s based on the remanng part of the Yahoo! News set. 6.2 Metrcs The possble outcomes from comparng a resoluton system s predcton wth ground truth can be categorzed nto the followng types. True Postve (TP), the predcted e was correctly referred to by s. True Negatve (TN), s was correctly predcted as resolvng to oow. Msmatch (MM), the predcted e was not correctly referred to by s and should have been e from EntSfDct. False Postve (FP), the predcted e was not correctly referred to by s and should have been oow. False Negatve (FN), the predcted oow s not correct and should have been e from EntSfDct. Smlar to the wdely used metrcs for classfcaton systems, we use followng metrcs to evaluate dsambguaton performance. TP precson = TP + FP + MM precson recall F1 = 2 precson+ recall TP recall = TP + FN + MM TP + TN accuracy = TP + FP + TN + FN + MM In the Yahoo! News test set, 23.5% of the surface forms do not resolve to any Wkpeda entres, and n the other two test sets the percentages of oow are between 10% and 20%. Ths demonstrates t s necessary n real-world applcatons to explctly measure oow predcton. We propose followng metrcs. TN TN precson_ oow = recall _ oow = TN + FN TN + FP 2 precson_ oow recall _ oow F1_ oow = precson_ oow+ recall _ oow 1340

7 6.3 Evaluaton Results Wth our tranng set we traned one resoluton model usng GBDT (named as WkRes-c) and another resoluton model usng GBRank (named as WkRes-r). The models were evaluated along wth the followng systems. 1) Baselne-r: each surface form s s randomly mapped to oow or a canddate entry for s n EntSfDct. 2) Baselne-p: each surface form s s mapped to the canddate entry e for s wth the hghest commonness score. The commonness score s lnear combnaton of the probablty of s beng mapped to e estmated from dfferent data sources. The commonness score s among the features used n WkRes-c and WkRes-r. 3) Baselne-m: we mplemented the approach brought by Cucerzan (2007) based on our best understandng. Snce we use a dfferent verson of Wkpeda and a dfferent entty recognton approach, the evaluaton result dffers from the result presented n ther paper. But we beleve our mplementaton follows the algorthm descrbed n ther paper. In Table 1 we present the performance for each system on the Yahoo! News test set and the MSNBC test set. The performance of WkRes-c and WkRes-r are computed after we apply the thresholds selected on the tunng set descrbed n Secton 6.1. In the upper half of Table 1, the three baselnes use the thresholds that lead to the best F1 on the Yahoo! News test set. In the lower half of Table 1, the three baselnes use the thresholds that lead to the best F1 on the MSNBC test set. Among the three baselnes, Baselne-r has the lowest performance. Baselne-m uses a few context-senstve features and Baselne-p uses a context-ndependent feature. These two types of features are both useful, but Baselne-p shows better performance, probably because the surface forms n our test sets are domnated by common senses. In our resoluton models, these features are combned together wth many other features calculated from dfferent large-scale data sources and on dfferent granularty levels. As shown n Table 1, both of our resoluton solutons substantally outperform other systems. Furthermore, WkRes-c and WkRes-r have smlar performance. Precson Recall F1 Accuracy p-value Yahoo! News Test Set Baselne-r Baselne-p e-78 Baselne-m e-160 WkRes-r WkRes-c MSNBC Test Set Baselne-r e-19 Baselne-p Baselne-m WkRes-r WkRes-c Table 1. Performance on the Yahoo! News Test Set and the MSNBC Test set Fgure 1. Precson-recall on the Yahoo! News Test Set and the MSNBC Test Set We compared WkRes-c wth each compettor and from the statstcal sgnfcance test results n the last column of Table 1 we see that on the Yahoo! News test set WkRes-c sgnfcantly outperforms others. The p-values for the MSNBC test set are much hgher than for the Yahoo! News test set because the MSNBC test set s much smaller. Attemptng to address ths pont, we see that the F1 values of WkRes on the MSNBC test set and on the Yahoo! News test set only dffers by a couple percentage ponts, although, these test sets were created ndependently. Ths suggests the objectvty of our method for creatng the Yahoo! News test set and provdes a way to measure resoluton model performance on what 1341

8 would occur n a general news corpus n a statstcally sgnfcant manner. In Fgure 1 we present the precson-recall curves on the Yahoo! News and the MSNBC test sets. We see that our resoluton models are substantally better than the other two baselnes at any partcular precson or recall value on both test sets. Baselne-r s not ncluded n the comparson snce t does not have the tradeoff between precson and recall. We fnd the precson-recall curve of WkRes-r s very smlar to WkRes-c at the lower precson area, but ts recall s much lower than other systems after precson reaches around 90%. So, n Fgure 1 the curves of WkRes-r are truncated at the hgh precson area. In Table 2 we compare the performance of out-of-wkpeda predcton. The comparson s done on the Yahoo! News test set only, snce there are only 40 surface forms of oow case n the MSNBC test set. Each system s threshold s the same as that used for the upper half of Table 1. The results show our models have substantally hgher precson and recall than Baselne-p and Baselne-m. From the statstcal sgnfcance test results n the last column, we can see that WkRes-c sgnfcantly outperforms Baselne-p and Baselne-m. Also, our current approaches stll have room to mprove n the area of out-of- Wkpeda predcton. We also evaluated our models on a Wkpeda hold-out set. The model performance s greater than that obtaned from the prevous two test sets because the hold-out set s more smlar to the tranng data source tself. Agan, our models perform better than others. From the feature mportance lsts of our GBDT model and GBRank model, we fnd that the commonness features, the features based on Wkpeda entres co-occurrence representaton and the correspondng dfferentaton features are the most mportant. Precson Recall F1 p-value Baselne-p e-20 Baselne-m e-34 WkRes-r WkRes-c Table 2. Performance of Out-of-Wkpeda Predcton on the Yahoo! News Test Set 7 Conclusons We have descrbed a method of learnng to resolve surface forms to Wkpeda entres. Usng ths method we can enrch the unstructured documents wth structured knowledge from Wkpeda, the largest knowledge base n exstence. The enrchment makes t possble to represent a document as a machne-readable network of senses nstead of just a bag of words. Ths can supply crtcal semantc nformaton useful for nextgeneraton nformaton retreval systems and other text mnng applcatons. Our resoluton models use an extensve set of novel features and are leveraged by a machne learned approach that depends only on a purely automated tranng data generaton faclty. Our methodology can be appled to any other language that has Wkpeda and Web data avalable (after modfyng the smple captalzaton rules n Secton 4.1). Our resoluton models can be easly and quckly retraned wth updated data when Wkpeda and the relevant Web data are changed. For future work, t wll be mportant to nvestgate other approaches to better predct oow. Addng global constrants on resolutons of the same term at multple locatons n the same document may also be mportant. Of course, developng new features (such as part-of-speech, named entty type, etc) and mprovng tranng data qualty s always crtcal, especally for socal content sources such as those from Twtter. Fnally, drectly demonstratng the degree of applcablty to other languages s nterestng when accountng for the fact that the qualty of Wkpeda s varable across languages. References Bagga, Amt and Breck Baldwn Entty-based cross-document coreferencng usng the Vector Space Model. Proceedngs of the 17th nternatonal conference on Computatonal lngustcs. Bunescu, Razvan and Marus Paşca Usng Encyclopedc Knowledge for Named Entty Dsambguaton. Proceedngs of the 11th Conference of the European Chapter of the Assocaton of Computatonal Lngustcs (EACL-2006). Cucerzan, Slvu Large-Scale Named Entty Dsambguaton Based on Wkpeda Data. Pro- 1342

9 ceedngs of the 2007 Jont Conference on Emprcal Methods n Natural Language Processng and Computatonal Natural Language Learnng. Fleschman, Ben Mchael and Eduard Hovy Mult-Document Person Name Resoluton. Proceesng of the Assocaton for Computatonal Lngustcs. Fredman, J. H Stochastc gradent boostng. Computatonal Statstcs and Data Analyss, 38: Han, Xanpe and Jun Zhao Named Entty Dsambguaton by Leveragng Wkpeda Semantc Knowledge. Proceedngs of the 31st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval. Mann, S. Gdon and Davd Yarowsky Unsupervsed Personal Name Dsambguaton. Proceedngs of the seventh conference on Natural language learnng at HLT-NAACL Mlne, Davd and Ian H. Wtten. 2008a. Learnng to Lnk wth Wkpeda. In Proceedngs of the ACM Conference on Informaton and Knowledge Management (CIKM2008). Mlne, Davd and Ian H. Wtten. 2008b. An effectve, low-cost measure of semantc relatedness obtaned from Wkpeda lnks. Proceedngs of the frst AAAI Workshop on Wkpeda and Artfcal Intellgence. Pedersen, Ted, Amruta Purandare and Anagha Kulkarn Name Dscrmnaton by Clusterng Smlar Contexts. Proceedngs of the Sxth Internatonal Conference on Intellgent Text Processng and Computatonal Lngustcs (2005). Ravn, Y. and Z. Kaz Is Hllary Rodham Clnton the Presdent? In Assocaton for Computatonal Lngustcs Workshop on Coreference and ts Applcatons. Yarowsky, Davd Unsupervsed word sense dsambguaton rvalng supervsed methods. Proceedngs of the 33rd Annual Meetng of the Assocaton for Computatonal Lngustcs, pages Zheng, Zhaohu, K. Chen, G. Sun, and H. Zha A regresson framework for learnng rankng functons usng relatve relevance judgments. Proceedngs of the 30th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pages

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries

A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries A Hybrd Re-rankng Method for Entty Recognton and Lnkng n Search Queres Gongbo Tang 1,2, Yutng Guo 2, Dong Yu 1,2(), and Endong Xun 1,2 1 Insttute of Bg Data and Language Educaton, Bejng Language and Culture

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Data Mining: Model Evaluation

Data Mining: Model Evaluation Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct

More information

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval LRD: Latent Relaton Dscovery for Vector Space Expanson and Informaton Retreval Techncal Report KMI-06-09 March, 006 Alexandre Gonçalves, Janhan Zhu, Dawe Song, Vctora Uren, Roberto Pacheco In Proc. of

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Query classification using topic models and support vector machine

Query classification using topic models and support vector machine Query classfcaton usng topc models and support vector machne Deu-Thu Le Unversty of Trento, Italy deuthu.le@ds.untn.t Raffaella Bernard Unversty of Trento, Italy bernard@ds.untn.t Abstract Ths paper descrbes

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

Local Quaternary Patterns and Feature Local Quaternary Patterns

Local Quaternary Patterns and Feature Local Quaternary Patterns Local Quaternary Patterns and Feature Local Quaternary Patterns Jayu Gu and Chengjun Lu The Department of Computer Scence, New Jersey Insttute of Technology, Newark, NJ 0102, USA Abstract - Ths paper presents

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Exploring Image, Text and Geographic Evidences in ImageCLEF 2007

Exploring Image, Text and Geographic Evidences in ImageCLEF 2007 Explorng Image, Text and Geographc Evdences n ImageCLEF 2007 João Magalhães 1, Smon Overell 1, Stefan Rüger 1,2 1 Department of Computng Imperal College London South Kensngton Campus London SW7 2AZ, UK

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY SSDH: Sem-supervsed Deep Hashng for Large Scale Image Retreval Jan Zhang, and Yuxn Peng arxv:607.08477v2 [cs.cv] 8 Jun 207 Abstract Hashng

More information

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Face Detection with Deep Learning

Face Detection with Deep Learning Face Detecton wth Deep Learnng Yu Shen Yus122@ucsd.edu A13227146 Kuan-We Chen kuc010@ucsd.edu A99045121 Yzhou Hao y3hao@ucsd.edu A98017773 Mn Hsuan Wu mhwu@ucsd.edu A92424998 Abstract The project here

More information

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Improving Web Image Search using Meta Re-rankers

Improving Web Image Search using Meta Re-rankers VOLUME-1, ISSUE-V (Aug-Sep 2013) IS NOW AVAILABLE AT: www.dcst.com Improvng Web Image Search usng Meta Re-rankers B.Kavtha 1, N. Suata 2 1 Department of Computer Scence and Engneerng, Chtanya Bharath Insttute

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM

More information

1 Introduction. Machine Learning for Coreference Resolution: Recent Developments. Abstract

1 Introduction. Machine Learning for Coreference Resolution: Recent Developments. Abstract Machne Learnng for Coreference Resoluton: Recent Developments Ymeng Zhang Yangbo Zhu Abstract Ths paper surveys recent developments of machne learnng methods for coreference resoluton. Accurate coreference

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010 Smulaton: Solvng Dynamc Models ABE 5646 Week Chapter 2, Sprng 200 Week Descrpton Readng Materal Mar 5- Mar 9 Evaluatng [Crop] Models Comparng a model wth data - Graphcal, errors - Measures of agreement

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

Cross-Language Information Retrieval

Cross-Language Information Retrieval Feature Artcle: Cross-Language Informaton Retreval 19 Cross-Language Informaton Retreval Jan-Yun Ne 1 Abstract A research group n Unversty of Montreal has worked on the problem of cross-language nformaton

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Relevance Assignment and Fusion of Multiple Learning Methods Applied to Remote Sensing Image Analysis

Relevance Assignment and Fusion of Multiple Learning Methods Applied to Remote Sensing Image Analysis Assgnment and Fuson of Multple Learnng Methods Appled to Remote Sensng Image Analyss Peter Bajcsy, We-Wen Feng and Praveen Kumar Natonal Center for Supercomputng Applcaton (NCSA), Unversty of Illnos at

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm Recommended Items Ratng Predcton based on RBF Neural Network Optmzed by PSO Algorthm Chengfang Tan, Cayn Wang, Yuln L and Xx Q Abstract In order to mtgate the data sparsty and cold-start problems of recommendaton

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Keyword-based Document Clustering

Keyword-based Document Clustering Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Syntactic Tree-based Relation Extraction Using a Generalization of Collins and Duffy Convolution Tree Kernel

Syntactic Tree-based Relation Extraction Using a Generalization of Collins and Duffy Convolution Tree Kernel Syntactc Tree-based Relaton Extracton Usng a Generalzaton of Collns and Duffy Convoluton Tree Kernel Mahdy Khayyaman Seyed Abolghasem Hassan Abolhassan Mrroshandel Sharf Unversty of Technology Sharf Unversty

More information

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu

More information

USING GRAPHING SKILLS

USING GRAPHING SKILLS Name: BOLOGY: Date: _ Class: USNG GRAPHNG SKLLS NTRODUCTON: Recorded data can be plotted on a graph. A graph s a pctoral representaton of nformaton recorded n a data table. t s used to show a relatonshp

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

430 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 3, MARCH Boosting for Multi-Graph Classification

430 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 3, MARCH Boosting for Multi-Graph Classification 430 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 3, MARCH 2015 Boostng for Mult-Graph Classfcaton Ja Wu, Student Member, IEEE, Shru Pan, Xngquan Zhu, Senor Member, IEEE, and Zhhua Ca Abstract In ths

More information

A Semi-parametric Regression Model to Estimate Variability of NO 2

A Semi-parametric Regression Model to Estimate Variability of NO 2 Envronment and Polluton; Vol. 2, No. 1; 2013 ISSN 1927-0909 E-ISSN 1927-0917 Publshed by Canadan Center of Scence and Educaton A Sem-parametrc Regresson Model to Estmate Varablty of NO 2 Meczysław Szyszkowcz

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Ranking Techniques for Cluster Based Search Results in a Textual Knowledge-base

Ranking Techniques for Cluster Based Search Results in a Textual Knowledge-base Rankng Technques for Cluster Based Search Results n a Textual Knowledge-base Shefal Sharma Fetch Technologes, Inc 841 Apollo St, El Segundo, CA 90254 +1 (310) 414-9849 ssharma@fetch.com Sofus A. Macskassy

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

Semantic Image Retrieval Using Region Based Inverted File

Semantic Image Retrieval Using Region Based Inverted File Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:

More information

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experiments in Text Categorization Using Term Selection by Distance to Transition Point Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur

More information

Personalized Concept-Based Clustering of Search Engine Queries

Personalized Concept-Based Clustering of Search Engine Queries IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Personalzed Concept-Based Clusterng of Search Engne Queres Kenneth Wa-Tng Leung, Wlfred Ng, and Dk Lun Lee Abstract The exponental growth of nformaton

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

A Misclassification Reduction Approach for Automatic Call Routing

A Misclassification Reduction Approach for Automatic Call Routing A Msclassfcaton Reducton Approach for Automatc Call Routng Fernando Uceda-Ponga 1, Lus Vllaseñor-Pneda 1, Manuel Montes-y-Gómez 1, Alejandro Barbosa 2 1 Laboratoro de Tecnologías del Lenguaje, INAOE, Méxco.

More information

Information Retrieval

Information Retrieval Anmol Bhasn abhasn[at]cedar.buffalo.edu Moht Devnan mdevnan[at]cse.buffalo.edu Sprng 2005 #$ "% &'" (! Informaton Retreval )" " * + %, ##$ + *--. / "#,0, #'",,,#$ ", # " /,,#,0 1"%,2 '",, Documents are

More information

Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data

Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Learnng Semantcs-Preservng Dstance Metrcs for Clusterng Graphcal Data Aparna S. Varde, Elke A. Rundenstener Carolna Ruz Mohammed Manruzzaman,3 Rchard D. Ssson Jr.,3 Department of Computer Scence Center

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,

More information

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made

More information

Alignment Results of SOBOM for OAEI 2010

Alignment Results of SOBOM for OAEI 2010 Algnment Results of SOBOM for OAEI 2010 Pegang Xu, Yadong Wang, Lang Cheng, Tany Zang School of Computer Scence and Technology Harbn Insttute of Technology, Harbn, Chna pegang.xu@gmal.com, ydwang@ht.edu.cn,

More information

Discriminative Dictionary Learning with Pairwise Constraints

Discriminative Dictionary Learning with Pairwise Constraints Dscrmnatve Dctonary Learnng wth Parwse Constrants Humn Guo Zhuoln Jang LARRY S. DAVIS UNIVERSITY OF MARYLAND Nov. 6 th, Outlne Introducton/motvaton Dctonary Learnng Dscrmnatve Dctonary Learnng wth Parwse

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

Empirical Exploitation of Click Data for Query-Type-Based Ranking

Empirical Exploitation of Click Data for Query-Type-Based Ranking Emprcal Explotaton of Clck Data for Query-Type-Based Rankng Anle Dong Y Chang Shhao J Cya Lao Xn L Zhaohu Zheng Yahoo! Labs 701 Frst Avenue Sunnyvale, CA 94089 {anle,ychang,shhao,cyalao,xnl,zhaohu}@yahoo-nc.com

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Ontology Mapping: As a Binary Classification Problem

Ontology Mapping: As a Binary Classification Problem Fourth Internatonal Conference on Semantcs, Knowledge and Grd Ontology Mappng: As a Bnary Classfcaton Problem Mng Mao SAP Research mng.mao@sap.com Yefe Peng Yahoo! ypeng@yahoo-nc.com Mchael Sprng U. of

More information

SVM-based Learning for Multiple Model Estimation

SVM-based Learning for Multiple Model Estimation SVM-based Learnng for Multple Model Estmaton Vladmr Cherkassky and Yunqan Ma Department of Electrcal and Computer Engneerng Unversty of Mnnesota Mnneapols, MN 55455 {cherkass,myq}@ece.umn.edu Abstract:

More information

A Gradient Difference based Technique for Video Text Detection

A Gradient Difference based Technique for Video Text Detection A Gradent Dfference based Technque for Vdeo Text Detecton Palaahnakote Shvakumara, Trung Quy Phan and Chew Lm Tan School of Computng, Natonal Unversty of Sngapore {shva, phanquyt, tancl }@comp.nus.edu.sg

More information