Cross-Lingual Taxonomy Alignment with Bilingual Biterm Topic Model

Size: px
Start display at page:

Download "Cross-Lingual Taxonomy Alignment with Bilingual Biterm Topic Model"

Transcription

1 Proceedngs of the Thrteth AAAI Conference on Artfcal Intellgence (AAAI-16) Cross-Lngual Taxonomy Algnment wth Blngual Bterm Topc Model Tanxng Wu 1, Guln Q 1, Haofen Wang 2, Kang Xu 1 and Xuan Cu 1 1 Key Laboratory of Computer Network and Informaton Integraton of State Educaton Mnstry, School of Computer Scence and Engneerng, Southeast Unversty, Chna {wutanxng, gq, kxu, xcu}@seu.edu.cn 2 East Chna Unversty of Scence & Technology, Chna whfcarter@ecust.edu.cn Abstract As more and more multlngual knowledge becomes avalable on the Web, knowledge sharng across languages has become an mportant task to beneft many applcatons. One of the most crucal knds of knowledge on the Web s taxonomy, whch s used to organze and classfy the Web data. To facltate knowledge sharng across languages, we need to deal wth the problem of cross-lngual taxonomy algnment, whch dscovers the most relevant category n the target taxonomy of one language for each category n the source taxonomy of another language. Current approaches for algnng crosslngual taxonomes strongly rely on doman-specfc nformaton and the features based on strng smlartes. In ths paper, we present a new approach to deal wth the problem of cross-lngual taxonomy algnment wthout usng any doman-specfc nformaton. We frst dentfy the canddate matched categores n the target taxonomy for each category n the source taxonomy usng the crosslngual strng smlarty. We then propose a novel blngual topc model, called Blngual Bterm Topc Model (BBTM), to perform exact matchng. BBTM s traned by the textual contexts extracted from the Web. We conduct experments on two knds of real world datasets. The expermental results show that our approach sgnfcantly outperforms the desgned state-of-the-art comparson methods. Introducton Nowadays, as the advent of more and more multlngual resources, the Web has become a global nformaton space. Thus, sharng knowledge across languages has become an mportant and challengng task. One of the most crucal knds of knowledge s taxonomy, whch refers to a herarchy of categores that enttes are classfed to (Prytkova, Wekum, and Spanol 2015). Dfferent knds of taxonomes are everywhere on the Web, such as Web ste drectory (e.g. Yahoo Drectory and Dmoz.org) and product catalogue (e.g. ebay.com and Google Product Taxonomy). To facltate knowledge sharng across languages, we need to deal wth the problem of cross-lngual taxonomy algnment, whch s the task of dscoverng the most relevant category n the target taxonomy of one language for each category n the source tax- Copyrght c 2016, Assocaton for the Advancement of Artfcal Intellgence ( All rghts reserved. onomy of another language. Cross-lngual taxonomy algnment not only contrbutes to globalze knowledge sharng, but also benefts many applcatons, such as cross-lngual nformaton retreval (Potthast, Sten, and Anderka 2008; Nguyen et al. 2009) and multlngual knowledge base constructon (Lehmann et al. 2014; Mahdsoltan, Bega, and Suchanek 2014). The key step of algnng cross-lngual taxonomes s to measure the relevance between one category n the source taxonomy and another one n the target taxonomy. Once all the relevance scores have been determned, we can obtan the most relevant category n the target taxonomy for each category n the source taxonomy n an unsupervsed way. However, snce categores are descrbed n dfferent languages, tradtonal monolngual smlarty metrcs are not sutable n cross-lngual scenaros. In order to overcome ths problem, several approaches have been proposed. The work gven n (Spohr, Hollnk, and Cmano 2011) frst translates cross-lngual taxonomes nto monolngual taxonomes, and then captures the lngustc features and structural features that rely on strng smlartes to predct the relevance score between categores. However, the translated label of a category n the source taxonomy may be dssmlar to ts matched category n the target taxonomy. For example, category n JD.com can be translated to Outdoor/Sportswear by Google Translate 1, but the translated strng s totally dfferent from that of ts matched category Athletc Apparel n ebay.com. Thus, the features that rely on strng smlartes are nsuffcent to decde the relevance score between two categores of dfferent languages, due to dfferent language habts and mproper translatons. Another work (Prytkova, Wekum, and Spanol 2015) tres to solve ths problem by referrng to Wkpeda. It strongly reles on the doman-specfc nformaton (.e. book ntances) to map orgnal categores n book doman onto Wkpeda categores. Categores of dfferent languages can be drectly compared usng nterwk lnks. However, ths approach cannot be easly extended to the other knds of taxonomes, because nstance nformaton s often unavalable. In ths paper, we study the problem of cross-lngual taxonomy algnment. The problem s non-trval and poses the followng challenges

2 Feature. When the doman-specfc nformaton s unavalable, the exstng approach (Spohr, Hollnk, and Cmano 2011) only depends on strng smlartes to capture dfferent knds of features, resultng n a rather poor performance. Snce vector smlartes have acheved great success n natural language processng tasks (Denhère and Lemare 2004; Gabrlovch and Markovtch 2007; L, J, and Yan 2015), can we ntroduce a new powerful feature whch reles on vector smlartes? Representaton. Vector smlartes are based on rch textual nformaton, but categores do not contan such nformaton. Can we fnd a way to enrch the representaton of categores wth textual nformaton? Can ths new representaton reveal the real meanng of each category, especally ambguous categores? Approach. The features that depend on strng smlartes do not work well n cross-lngual taxonomy algnment, but they stll have postve mpacts. Can we desgn an approach usng both the features that rely on vector smlartes and the ones that depend on strng smlartes? To solve the above challenges, we propose a new approach to solve the problem of cross-lngual taxonomy algnment wthout usng any doman-specfc nformaton. Frstly, we dentfy the canddate matched categores n the target taxonomy for each category n the source taxonomy usng a new lngustc feature,.e. cross-lngual strng smlarty. Then, we propose a novel blngual topc model, called Blngual Bterm Topc Model (BBTM), to obtan the topc vector of the context for each category. BBTM s traned by the textual contexts extracted from the Web. Fnally, the relevance score between each category n the source taxonomy and ts canddate matched categores s computed as the cosne smlarty between topc vectors. The experments on two real world datasets show that our approach sgnfcantly outperforms the desgned state-of-the-art comparson methods. The rest of ths paper s organzed as follows. Secton 2 outlnes some related work. Secton 3 ntroduces the proposed approach n detal. Secton 4 presents the expermental results and fnally Secton 5 concludes ths work and descrbes the future work. Related Work In ths secton, we revew some related work on schema matchng and multlngual knowledge algnment. Schema Matchng Schema matchng ams at dentfyng semantc correspondences between two schemas ncludes database schemas and ontologes (Do and Rahm 2007). (Rahm and Bernsten 2001; Berln and Motro 2002; Do, Melnk, and Rahm 2003) ntroduce dfferent methods for matchng database schemas. However, the settng of matchng database schemas s qute dfferent from algnng heterogeneous taxonomes due to the dfference n sze and structure. Ontology matchng (Shvako and Euzenat 2006) solves the problem of fndng relatonshps (e.g. equvalence, subsumpton) between dscrete enttes of ontologes, ncludng classes, propertes, etc. There exsts plenty of work (Euzenat and Shvako 2007) on matchng dfferent knds of ontologes. Several systems (Jménez-Ruz et al. 2014; Fara et al. 2014) can match multlngual ontologes wth the features that only rely on the strng smlartes after usng machne translaton, but the performance s not good. The dfference between our work and ontology matchng s that we focus on algnng general taxonomes rather than standard ontologtes. Ontologes are usually of more nternal nformaton, such as propertes, functons, axoms,.etc. However, the taxonomes handled n ths work do not contan such nformaton to assst the matchng operaton. Another knd of related work s catalog ntegraton (Agrawal and Srkant 2001; Ichse, Takeda, and Honden 2003; Wang et al. 2014), but they do not algnng the taxonomes n cross-lngual scenaros. Multlngual Knowledge Algnment There exsts some work on multlngual knowledge algnment. (Wang et al. 2012) proposed a lnkage factor graph model to lnk artcles from Englsh Wkpeda to those n Badu Bake. They further proposed a concept annotaton method and a regresson-based learnng model to teratvely predct new cross-lngual lnks (Wang, L, and Tang 2013). X-LSA (Zhang and Rettnger 2014) s a semantc annotaton system, whch can annotate text documents and web pages n dfferent languages usng resources from Wkpeda and Lnked Open Data. Dfferent from our work, all of the above work only focuses on algnng cross-lngual data level knowledge,.e. cross-lngual entty lnkng. The most relevant work s (Spohr, Hollnk, and Cmano 2011; Prytkova, Wekum, and Spanol 2015). Both of them depend on doman-specfc nformaton and the features based on strng smlartes to algn cross-lngual taxonomes n specfc domans. Here, we focus on algnng more general cross-lngual and cross-doman taxonomes wthout doman specfc nformaton, such as product catalogues and Web ste drectores. The Proposed Approach In ths secton, we present our proposed approach n detal, whch conssts of two man steps: canddates dentfcaton and exact matchng. Canddates Identfcaton To avod unnecessary comparsons of the categores between two gven taxonomes, we am to obtan all the possble matched categores n the target taxonomy for each category n the source taxonomy. The output of ths step s taken as the nput of exact matchng. The smplest way to represent a category s usng ts category label. However, ths may not be desrable for category matchng snce synonymous categores may own totally dfferent labels. For example, Sports Clothng and Athletc Apparel do not share any word, let alone two synonymous categores of dfferent languages. Besdes, drectly comparng the translated category labels of the same language also has lmtatons due to dfferent language habts and mproper translatons. 288

3 Fgure 1: An Example of Canddates Identfcaton To avod the above problems, we capture cross-lngual strng smlartes between categores n word level by usng BabelNet (Navgl and Ponzetto 2010), a Web-scale multlngual synonym thesaurus. The key dea of canddates dentfcaton s that two categores of dfferent languages may be relevant f they share the same or synonymous words. Gven a category c s n the source taxonomy of language s and a category c t n the target taxonomy of language t, each of them s segmented nto a set of words. After removng stop words, c s and c t contan a set of words W c s = {w s}m =1 and W c t = {wj t}n j=1, respectvely. For each word ws,weget ts synonymous words of language s and t by BabelNet and add them to W c s. The same process s also appled for each word wj t and we get the new W ct. The cross-lngual strng smlarty between c s and c t s defned as: CSS(c s,c t )= { 1, f W c s W c t 0, otherwse If CSS(c s,c t ) equals 1, c t wll be taken as one of the canddate matched category of c s. Fgure 1 llustrates why Athletc Apparel n ebay.com s taken as the canddate matched category of n JD.com. Textual Context Extracton Snce categores do not have textual nformaton to descrbe themselves, vector smlartes cannot be appled to category matchng untl we fnd a way to enrch the representaton of categores wth textual nformaton. There mght exst some Web pages whch contan the textual nformaton assocated wth a gven category, but manually fndng approprate Web pages s unrealstc. Therefore, we choose to acqure textual nformaton by queryng the Web wth the search engne Google. Categores of the same label n dfferent structures may have dfferent meanngs. For example, category Sports occurs twce n Yahoo Drectory. One s the chld of category Shoppng and Servces whch means sports goods, and the other s the chld of category Recreaton representng knds of physcal actvtes. However, the results (.e. ttles, snppets and URLs) returned only by submttng ther own labels to Google are the same. To accurately get the relevant textual nformaton returned by Google for each category, the labels of the gven category c and ts parent category pc are jontly submtted to Google. For example, n Yahoo Drectory, the label of category Sports representng sports goods s submtted to Google jontly wth ts parent category label Shoppng and Servces. In each returned snppet, we extract (1) the words co-occurred wth c n the same sentence except pc, because pc s part of the query, thus t occurs qute a lot of tmes. These extracted words are taken as the textual context to better reveal the meanng of the gven category. Note that the root categores do not have parent categores, but they are usually unambguous, otherwse users wll be easly confused when explorng the taxonomy n a top-down manner. Thus, we smply submt the label of each root category to Google to get ts textual context. Exact Matchng After utlzng a lngustc feature (.e. the cross-lngual smlarty) for canddates dentfcaton, we am to perform exact matchng by determnng the relevance score between each category n the source taxonomy and ts canddate matched categores n the target taxonomy wth one or more features based on vector smlartes. The bag-of-words (BOW) model s the most common method to model text. However, the textual context of each category s extracted from the snppets that vary a lot n wordng styles, because the snppet may be a tweet or a pece of news wth more formal language expressons. For each category, t may be the case that the words extracted from dfferent snppets are totally dfferent, whch means qute a lot of the words are of low frequency. Therefore, the BOW model may not work well n ths scenaro. To address the above problem, we try to dscover the topcs of the extracted textual contexts wth a blngual topc model. The textual context of each category s actually a set of short text documents extracted from the snppets. Gven a short text document d s of language s, we frst translate t nto the document d t of language t wth Google Translate, and then construct a par of blngual documents (d s,d t ). After applyng the same process for all the documents of language s, a pared blngual document corpus {(d s,dt )}N d =1 wll be generated. Then we can drectly apply a wdely used blngual topc model,.e. Blngual Latent Drchlet Allocaton (BLDA) model (Vulćetal. 2015) to model the corpus as a generaton process (see Fgure 3 (a)). However, ths model wll suffer from the data sparsty problem n short text documents (Hong and Davson 2010). Hence, we propose a new blngual topc model, called Blngual Bterm Topc Model (BBTM) to explctly model the word co-occurrence n each par of blngual short text documents. BBTM can not only avod the problems caused by applyng the BOW model or BLDA, but also better uncover the topcs of textual contexts for exact matchng. a) Blngual Bterm Topc Model BBTM s an extenson of Bterm Topc Model (BTM) (Yan et al. 2013; Cheng et al. 2014) for modelng the generaton of bterms. The key dea s that f two words co-occur more frequently, they are more lkely to belong to a same topc. Dfferent from BTM, a bterm used n BBTM denotes an unordered word-par cooccurrng n a par of blngual documents. Any two dstnct words n a par of blngual documents construct a bterm. For example, gven a par of blngual documents (d s,d t ), n whch d s and d t respectvely consst of n dstnct words of language s and m dstnct words of language t, totally 289

4 Fgure 2: An Example of Bterm Generaton Cn 2 + Cm 2 + m n bterms wll be generated. Fgure 2 gves an example of bterm generaton. BBTM assumes that all bterms extracted from the whole corpus share the same topc dstrbuton, and each topc conssts of two dscrete dstrbutons over words of dfferent languages. Gven a pared blngual document corpus, suppose t contans B bterms B = B s B st B t = {b s } Bs =1 {bst } Bst =1 {b t } Bt =1 wth bs =(ws,1,ws,2 ) where each word s n language s, b st =(w,1 s,wt,2 ) where two words are n dfferent languages and b t =(wt,1,wt,2 ) where each word s n language t, as well as K topcs expressed over W s and W t dstnct words of language s and language t respectvely. The topc ndcator varable z [1,K] can be denoted as z s, z st and z t for the three knds of bterms. We represent the topcs n the corpus by a K-dmensonal multnomal dstrbuton θ = {θ k } K k=1 wth θ k = P (z = k). The word dstrbuton of language s and language t are respectvely represented by a K W s matrx ϕ s and a K W t matrx ϕ t, where the kth row ϕ s k and ϕt k are respectve a W s -dmensonal multnomal dstrbuton wth entry ϕ s k,w = P s (ws z = k) and a W t -dmensonal multnomal dstrbuton wth entry ϕ t k,w = P (w t z = k). t Followng the conventon of BTM, the hyperparameters α and β are the symmetrc Drchlet prors. Fgure 3 (b) shows the graphcal representaton of BBTM and ts generatve process s descrbed n Algorthm 1. Usng BBTM, the probablty of generatng the whole corpus gven hyperparameters α and β can be expressed as: B s P (B α, β) = =1 B st =1 K k=1 B t =1 K k=1 θ k ϕ s k,w s,1 ϕs k,w s,2 dθdϕs K k=1 θ k ϕ s k,w s,1 ϕt k,w t,2 dθdϕs dϕ t θ k ϕ t k,w t,1 ϕt k,w t,2 dθdϕt b) Parameters Estmaton Snce t s ntractable to exactly solve the coupled parameters θ, ϕ s and ϕ t by maxmzng the lkelhood n Eq. (2), we adopt collapsed Gbbs Samplng (Lu 1994) to resolve ths problem. θ, ϕ s and ϕ t can be ntegrated out due to the use of conjugate prors. Thus, we only need to sample the topc of each bterm. Due to space lmt, we only show the derved Gbbs samplng formulas for (2) (a) BLDA (b) BBTM Fgure 3: Graphcal representaton of BLDA and BBTM Algorthm 1: Generatve Process of BBTM ntalze: (1) set the number of topcs K; (2) set values for Drchlet prors α and β; sample: K tmes ϕ s Dr(β); sample: K tmes ϕ t Dr(β); sample: θ Dr(α) for all bterms; foreach bterm b s Bs do sample: z s Mult(θ); sample: w s,1,ws,2 Mult(ϕs z s ) foreach bterm b st B st do sample: z st Mult(θ); sample: w,1 s Mult(ϕs z ), w t st,2 Mult(ϕt z ) st foreach bterm b t Bt do sample: z t Mult(θ); sample: w,1 t,wt,2 Mult(ϕt z ) t b s Bs, b st B st and b t Bt as follows, P (z s = k z b s, B) (n b s,k + α) (n b s,w,1 s k + β)(n b s,w,2 s k + β) P (z st = k z b st (n b s, s k +1+W s β)(n b s, s k + W s β), B) (n b st,k + α) (n b st,ws,1 k + β)(n b st,wt,2 k + β) (n b st, s k + W s β)(n b st, t k + W t β) P (z t = k z b t, B) (n b t,k + α) (n b t,w,1 t k + β)(n b t,wt,2 k + β) (n b t, t k +1+Wt β)(n b t, t k + W t β) where z b denotes the topc assgnments for all bterms except the bterm b, n b,k s the number of bterms assgned to topc k excludng b, n b,w s k s the number of tmes word w of language s assgned to topc k excludng b, n b,w t k s the number of tmes word w of language t assgned to topc k excludng b, and n b, s k = w s n b,w s k as well as n b, t k = w t n b,w t k. After a suffcent number of teratons, we can estmate the global topc dstrbuton θ and topc-word dstrbutons ϕ s, (3) (4) (5) 290

5 ϕ t by θ k = α + n k Kα + B ϕ s k,w s = β + n w s k W s β + n s k ϕ t k,w t = β + n w t k W t β + n t k where n k s the number of bterms assgned to topc k, n ws k s the number of tmes word w of language s assgned to topc k, n wt k s the number of tmes word w of language t assgned to topc k, and and n s k = w s n w s k as well as n t k = w t n w t k. c) Context Topcs Inference To perform exact matchng, we need to know the topc dstrbuton of the context for each category. Gven a category c, suppose t contans N c bterms {b j } Nc j=1, whch are extracted from all the pars of blngual documents of c. We utlze the followng formula (Yan et al. 2013) to nfer the topc dstrbuton of the context for c. (6) (7) (8) N c P (z c) = P (z = k b j)p (b j c) (9) j=1 In Eq. (9), P (b j c) s estmated by emprcal dstrbuton: P (b j c) = n(bj) N c n(b j) j=1 (10) where n(b j ) s the frequency of bterm b j n all the pars of blngual documents of c. Meanwhle, P (z = k b j ) can be computed va Bayes formula based on the parameters learned n BBTM: P (z = k b j )= K θ k ϕ s k,w s j,1 ϕs k,w s j,2 k =1 θ k ϕ s k,w s j,1 ϕ s k,w s j,2 K θ k ϕ s k,w s j,1 ϕt k,w t j,2 k =1 θ k ϕ s k,w s j,1 ϕ t k,w t j,2 K θ k ϕ t k,w j,1 t ϕ t k,w j,2 t k =1 θ k ϕ t k,w t j,1 ϕ t k,w t j,2, f b j B s, f b j B st, f b j B t (11) After obtanng the topc dstrbuton of the context for each category, we can represent categores of dfferent languages n the same topc space. The fnal relevance score between each category n the source taxonomy and ts canddate matched categores s computed as the cosne smlarty between topc vectors. Experments To facltate knowledge sharng across languages on the Web, we evaluated our proposed approach on two dfferent knds of real world datasets, whch are publcly avalable Experment Settngs a) Tasks and Data Sets Two knds of cross-lngual and cross-doman taxonomes on the Web (.e. product catalogue and Web ste drectory) were used to valdate the proposed approach. The detals of the tasks and datasets are gven as follows: Cross-lngual Product Catalogue Algnment. In ths task, gven a category n JD.com (.e. one of the largest Chnese B2C onlne retalers), we am to fnd the most relevant category n ebay.com. We collected 7,741 Chnese categores n JD.com and 7,782 Englsh categores n ebay.com. Cross-lngual Web Ste Drectory Algnment. In ths task, gven a category n Chnese Dmoz.org (.e. the largest Chnese Web ste drectory), we ntend to fnd the most relevant category n Yahoo Drectory. We collected 2,084 Chnese categores n Chnese Dmoz.org and 2,353 Englsh categores n Yahoo Drectory. To generate the ground truth data, gven a par of taxonomes, fve annotators labelled the most relevant category n the target taxonomy (ebay.com or Yahoo Drectory) for each of the 100 randomly selected categores n the source taxonomy (JD.com or Chnese Dmoz.org). The labelled results are based on majorty votng. b) Evaluaton Metrcs Smlar to (Prytkova, Wekum, and Spanol 2015; Spohr, Hollnk, and Cmano 2011), we take cross-lngual taxonomy algnment as a rankng problem n the experments. For each category n the source taxonomy, we ranked all categores n the target taxonomy accordng to the relevance score predcted by our approach and the desgned comparson methods. Thus, we evaluated the rankng results n terms of MRR (Mean Recprocal Rank) (Craswell 2009). c) Comparson methods We compared our approach wth the followng methods. RSVM: The rankng SVM (RSVM) model s used n (Spohr, Hollnk, and Cmano 2011) for cross-lngual taxonomy algnment. After removng some doman-specfc features, there stll exst 20 lngustc features and 8 structural features for tranng the model. These features depends on strng smlartes after usng machne translaton. BBTM: Ths s the proposed model for exact matchng n our approach. Here, we only use BBTM to algn corsslngual taxonomes wthout the step of canddates dentfcaton. In BBTM, we set α =50/K, β =0.1and K = 120 (the emprcal tunng results wll be presented n Secton 4.2.2). CSS+RSVM: Ths approach frst uses our proposed crosslngual strng smlarty (CSS) for canddates dentfcaton, and then utlzes the rankng SVM model traned n RSVM for exact matchng. CSS+BOW: Ths approach also uses CSS for canddates dentfcaton at frst, and then apples the tradtonal bagof-words (BOW) model to exact matchng. Gven a category c, after mergng all pars of blngual short text documents of c nto one document d, each word n d s weghed wth TF-IDF (Baeza-Yates and Rbero-Neto 1999). 291

6 Table 1: Overall results: MRR values Approach JD.com ebay.com Chnese Dmoz.org Yahoo Drectory RSVM BBTM CSS+RSVM CSS+BOW CSS+BLDA CSS+BBTM (a) BBTM (b) BLDA CSS+BLDA: After utlzng CSS for canddates dentfcaton, ths approach leverages BLDA (Vulć et al. 2015) to exact matchng. In BLDA, we set α =50/K, β =0.1 and K =80(the emprcal tunng results wll be presented n Secton 4.2.2). Result Analyss a) Overall Performance In our proposed approach (.e. CSS+BBTM) and all desgned comparson methods except RSVM and CSS+RSVM, we need to extract the textual context of each category from the web. For each category, we extracted the snppets of top 20 results returned by Google. In each snppet, we only kept the words that co-occur wth the gven category n the same sentence except ts parent category (mentoned n Secton 3.2). Each processed Chnese (Englsh) snppet was translated nto Englsh (Chnese) by Google Translate to construct a par of blngual short text documents. We further processed the documents va the followng normalzaton steps: Processng Chnese Documents. 1) Segmentng words wth FudanNLP (Qu, Zhang, and Huang 2013) and removng stop words; 2) removng words wth document frequency less than 10; 3) flterng out documents wth length less than 2. Processng Englsh Documents. 1) Removng non-latn characters and stop words; 2) convertng letters nto lower case and stemmng each word; 3) removng words wth document frequency less than 10; 4) flterng out documents wth length less than 2. At last, for the textual contexts of categores n product catalogues (.e. JD.com and ebay.com), we got 11,473 dstnct Chnese words and 9,093 dstnct Englsh words. For the textual contexts of categores n Web ste drectores (.e. Chnese Dmoz.org and Yahoo Drectory), we got 26,473 dstnct Chnese words and 16,852 dstnct Englsh words. For each task n our experments, we traned a BBTM and a BLDA model. For each model, we ran 500 teratons of Gbbs samplng to converge. Table 1 gves the overall results of our approach and the desgned comparson methods, and we can see that: RSVM, BBTM and CSS+RSVM are of the worst performance. It means that the approaches only usng the features that rely on strng smlartes or the ones that depend on vector smlartes can not work well n algnng the real-world cross-lngual taxonomes. CSS+BOW and CSS+BLDA are of the same framework (.e. canddate dentfcaton wth strng smlartes and Fgure 4: MRR value vs. number of topcs K exact matchng wth vector smlartes) as our approach (.e. CSS+BBTM) has. These three approaches sgnfcantly outperform others, whch shows the superorty of the framework of our proposed approach. After performng canddates dentfcaton wth CSS, blngual topc models do much better than the BOW model n exact matchng. It demonstrates that compared wth the word vector generated by the BOW model, representng the textual context of each category as a topc vector s more sutable for exact matchng. Compared wth CSS+BLDA, CSS+BBTM (.e. our approach) acheves about a 4% MRR mprovement on both of the tasks, whch shows that BBTM can better dscover the topcs of textual contexts for exact matchng. b) Parameter Tunng One mportant parameter n BBTM and BLDA s the number of topcs K. Dfferent number of topcs may lead to dfferent performance n cross-lngual taxonomy algnment. Thus, we performed an analyss by varyng the number of topcs n the BBTM and BLDA model. Fgure 4 (a) shows the performance of BBTM wth dfferent number of topcs K on two gven datasets. The performance mproves by ncreasng K when K < 120. Fgure 4 (b) shows the performance of BLDA wth dfferent number of topcs K. The MRR value s the hghest when K =80for both of the datasets. It shows that when algnng cross-lngual and cross-doman taxonomes, as well as the sze of categores s large enough, K can be emprcally set to 120 and 80 n BBTM and BLDA, respectvely. Conclusons and Future Work In ths paper, we present a new approach to address the problem of cross-lngual taxonomy algnment. We frst proposed the cross-lngual strng smlarty for canddates dentfcaton. We then proposed a novel blngual topc model to obtan the topc vector of the extracted textual context for each category. Fnally, we obtaned the algnment result by usng the cosne smlarty between topc vectors. We evaluated our approach on two knds of real world taxonomes. The expermental results showed that our approach sgnfcantly outperforms the desgned state-of-the-art comparson methods. Specfcally, compared wth the methods that combne CSS and other models, our approach got the best performance, whch valdates the advantage of our new blngual topc model. As for the future work, we wll valdate our approach 292

7 on some doman-specfc taxonomes, such as the datasets n OAEI 3 Multfarm track. We also plan to utlze the structured nformaton n knowledge bases to enhance our approach for cross-lngual taxonomy algnment. Acknowledgements Ths work s supported n part by the Natonal Natural Scence Foundaton of Chna (NSFC) under Grant No , No , the 863 Program under Grant No. 2015AA and the Fundamental Research Funds for the Central Unverstes under Grant No. 22A References Agrawal, R., and Srkant, R On ntegratng catalogs. In Proc. of WWW, Baeza-Yates, R., and Rbero-Neto, B Modern nformaton retreval, volume 463. ACM press New York. Berln, J., and Motro, A Database schema matchng usng machne learnng wth feature selecton. In Proc. of CASE, Cheng, X.; Yan, X.; Lan, Y.; and Guo, J Btm: Topc modelng over short texts. IEEE Transactons on Knowledge and Data Engneerng 26(12): Craswell, N Mean recprocal rank. In Encyclopeda of Database Systems. Sprnger Denhère, G., and Lemare, B A computatonal model of chldren s semantc memory. In Proc. of CogSc, Do, H.-H., and Rahm, E Matchng large schemas: Approaches and evaluaton. Informaton Systems 32(6): Do, H.-H.; Melnk, S.; and Rahm, E Comparson of schema matchng evaluatons. In Web, Web-Servces, and Database Systems. Sprnger Euzenat, J., and Shvako, P Ontology matchng, volume 333. Sprnger. Fara, D.; Martns, C.; Nanavaty, A.; Taher, A.; Pesquta, C.; Santos, E.; F. Cruz, I.; and M. Couto, F Agreementmakerlght results for oae In Proc. of OM, Gabrlovch, E., and Markovtch, S Computng semantc relatedness usng wkpeda-based explct semantc analyss. In Proc. of IJCAI, volume 7, Hong, L., and Davson, B. D Emprcal study of topc modelng n twtter. In Proc. of SOMA, Ichse, R.; Takeda, H.; and Honden, S Integratng multple nternet drectores by nstance-based learnng. In Proc. of IJCAI, volume 3, Jménez-Ruz, E.; Grau, B. C.; Xa, W.; Solmando, A.; Chen, X.; Cross, V.; Gong, Y.; Zhang, S.; and Chenna-Thagarajan, A Logmap famly results for oae In Proc. of OM, volume 20, Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; van Kleef, P.; Auer, S.; et al Dbpeda-a large-scale, multlngual knowledge base extracted from wkpeda. Semantc Web Journal 5:1 29. L, C.; J, L.; and Yan, J Acronym dsambguaton usng word embeddng. In Proc. of AAAI, Lu, J. S The collapsed gbbs sampler n bayesan computatons wth applcatons to a gene regulaton problem. Journal of the Amercan Statstcal Assocaton 89(427): Mahdsoltan, F.; Bega, J.; and Suchanek, F Yago3: A knowledge base from multlngual wkpedas. In Proc. of CIDR. Navgl, R., and Ponzetto, S. P Babelnet: Buldng a very large multlngual semantc network. In Proc. of ACL, Nguyen, D.; Overwjk, A.; Hauff, C.; Treschngg, D. R.; Hemstra, D.; and De Jong, F Wktranslate: query translaton for cross-lngual nformaton retreval usng only wkpeda. In Evaluatng Systems for Multlngual and Multmodal Informaton Access. Sprnger Potthast, M.; Sten, B.; and Anderka, M A wkpedabased multlngual retreval model. In Proc. of ECIR, Prytkova, N.; Wekum, G.; and Spanol, M Algnng mult-cultural knowledge taxonomes by combnatoral optmzaton. In Proc. of WWW, Qu, X.; Zhang, Q.; and Huang, X Fudannlp: A toolkt for chnese natural language processng. In Proc. of ACL, Rahm, E., and Bernsten, P. A A survey of approaches to automatc schema matchng. the VLDB Journal 10(4): Shvako, P., and Euzenat, J Tutoral on ontology matchng. In Proc. of SWAP, Spohr, D.; Hollnk, L.; and Cmano, P A machne learnng approach to multlngual and cross-lngual ontology matchng. In Proc. of ISWC, Vulć, I.; De Smet, W.; Tang, J.; and Moens, M.-F Probablstc topc modelng n multlngual settngs: An overvew of ts methodology and applcatons. Informaton Processng & Management 51(1): Wang, Z.; L, J.; Wang, Z.; and Tang, J Cross-lngual knowledge lnkng across wk knowledge bases. In Proc. of WWW, Wang, H.; Wu, T.; Q, G.; and Ruan, T On publshng chnese lnked open schema. In Proc. of ISWC, Wang, Z.; L, J.; and Tang, J Boostng cross-lngual knowledge lnkng va concept annotaton. In Proc. of IJCAI, Yan, X.; Guo, J.; Lan, Y.; and Cheng, X A bterm topc model for short texts. In Proc. of WWW, Zhang, L., and Rettnger, A X-lsa: cross-lngual semantc annotaton. In Proc. of VLDB,

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment Cross-lngual Pseudo Relevance Feedback Based on Weak Relevant opc Algnment WANG Xu-wen Insttute of Medcal Informaton & Lbrary, Chnese Academy of Medcal Scences, Beng 100020 wang.xuwen@mcams.ac.cn ZHANG

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

Alignment Results of SOBOM for OAEI 2010

Alignment Results of SOBOM for OAEI 2010 Algnment Results of SOBOM for OAEI 2010 Pegang Xu, Yadong Wang, Lang Cheng, Tany Zang School of Computer Scence and Technology Harbn Insttute of Technology, Harbn, Chna pegang.xu@gmal.com, ydwang@ht.edu.cn,

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Query classification using topic models and support vector machine

Query classification using topic models and support vector machine Query classfcaton usng topc models and support vector machne Deu-Thu Le Unversty of Trento, Italy deuthu.le@ds.untn.t Raffaella Bernard Unversty of Trento, Italy bernard@ds.untn.t Abstract Ths paper descrbes

More information

A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries

A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries A Hybrd Re-rankng Method for Entty Recognton and Lnkng n Search Queres Gongbo Tang 1,2, Yutng Guo 2, Dong Yu 1,2(), and Endong Xun 1,2 1 Insttute of Bg Data and Language Educaton, Bejng Language and Culture

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Resolving Surface Forms to Wikipedia Topics

Resolving Surface Forms to Wikipedia Topics Resolvng Surface Forms to Wkpeda Topcs Ypng Zhou Lan Ne Omd Rouhan-Kalleh Flavan Vasle Scott Gaffney Yahoo! Labs at Sunnyvale {zhouy,lanne,omd,flavan,gaffney}@yahoo-nc.com Abstract Ambguty of entty mentons

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Learning Topic Structure in Text Documents using Generative Topic Models

Learning Topic Structure in Text Documents using Generative Topic Models Learnng Topc Structure n Text Documents usng Generatve Topc Models Ntsh Srvastava CS 397 Report Advsor: Dr Hrsh Karnck Abstract We present a method for estmatng the topc structure for a document corpus

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Ontology Mapping: As a Binary Classification Problem

Ontology Mapping: As a Binary Classification Problem Fourth Internatonal Conference on Semantcs, Knowledge and Grd Ontology Mappng: As a Bnary Classfcaton Problem Mng Mao SAP Research mng.mao@sap.com Yefe Peng Yahoo! ypeng@yahoo-nc.com Mchael Sprng U. of

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence

Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence 2nd Internatonal Conference on Software Engneerng, Knowledge Engneerng and Informaton Engneerng (SEKEIE 204) Text Smlarty Computng Based on LDA Topc Model and Word Co-occurrence Mngla Shao School of Computer,

More information

Syntactic Tree-based Relation Extraction Using a Generalization of Collins and Duffy Convolution Tree Kernel

Syntactic Tree-based Relation Extraction Using a Generalization of Collins and Duffy Convolution Tree Kernel Syntactc Tree-based Relaton Extracton Usng a Generalzaton of Collns and Duffy Convoluton Tree Kernel Mahdy Khayyaman Seyed Abolghasem Hassan Abolhassan Mrroshandel Sharf Unversty of Technology Sharf Unversty

More information

A Feature-Weighted Instance-Based Learner for Deep Web Search Interface Identification

A Feature-Weighted Instance-Based Learner for Deep Web Search Interface Identification Research Journal of Appled Scences, Engneerng and Technology 5(4): 1278-1283, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scentfc Organzaton, 2013 Submtted: June 28, 2012 Accepted: August 08, 2012

More information

Cross-Language Information Retrieval

Cross-Language Information Retrieval Feature Artcle: Cross-Language Informaton Retreval 19 Cross-Language Informaton Retreval Jan-Yun Ne 1 Abstract A research group n Unversty of Montreal has worked on the problem of cross-language nformaton

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

A User Selection Method in Advertising System

A User Selection Method in Advertising System Int. J. Communcatons, etwork and System Scences, 2010, 3, 54-58 do:10.4236/jcns.2010.31007 Publshed Onlne January 2010 (http://www.scrp.org/journal/jcns/). A User Selecton Method n Advertsng System Shy

More information

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Keyword-based Document Clustering

Keyword-based Document Clustering Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

Web Document Classification Based on Fuzzy Association

Web Document Classification Based on Fuzzy Association Web Document Classfcaton Based on Fuzzy Assocaton Choochart Haruechayasa, Me-Lng Shyu Department of Electrcal and Computer Engneerng Unversty of Mam Coral Gables, FL 33124, USA charuech@mam.edu, shyu@mam.edu

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 97-735 Volume Issue 9 BoTechnology An Indan Journal FULL PAPER BTAIJ, (9), [333-3] Matlab mult-dmensonal model-based - 3 Chnese football assocaton super league

More information

Semantic Image Retrieval Using Region Based Inverted File

Semantic Image Retrieval Using Region Based Inverted File Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn

More information

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Network Intrusion Detection Based on PSO-SVM

Network Intrusion Detection Based on PSO-SVM TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION 1 FENG YONG, DANG XIAO-WAN, 3 XU HONG-YAN School of Informaton, Laonng Unversty, Shenyang Laonng E-mal: 1 fyxuhy@163.com, dangxaowan@163.com, 3 xuhongyan_lndx@163.com

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

Impact of a New Attribute Extraction Algorithm on Web Page Classification

Impact of a New Attribute Extraction Algorithm on Web Page Classification Impact of a New Attrbute Extracton Algorthm on Web Page Classfcaton Gösel Brc, Banu Dr, Yldz Techncal Unversty, Computer Engneerng Department Abstract Ths paper ntroduces a new algorthm for dmensonalty

More information

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY SSDH: Sem-supervsed Deep Hashng for Large Scale Image Retreval Jan Zhang, and Yuxn Peng arxv:607.08477v2 [cs.cv] 8 Jun 207 Abstract Hashng

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

A Method of Query Expansion Based on Event Ontology

A Method of Query Expansion Based on Event Ontology A Method of Query Expanson Based on Event Ontology Zhaoman Zhong, Cunhua L, Yan Guan, Zongtan Lu A Method of Query Expanson Based on Event Ontology 1 Zhaoman Zhong, 1 Cunhua L, 1 Yan Guan, 2 Zongtan Lu,

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study Arabc Text Classfcaton Usng N-Gram Frequency Statstcs A Comparatve Study Lala Khresat Dept. of Computer Scence, Math and Physcs Farlegh Dcknson Unversty 285 Madson Ave, Madson NJ 07940 Khresat@fdu.edu

More information

Domain Thesaurus Construction from Wikipedia *

Domain Thesaurus Construction from Wikipedia * Internatonal Conference on Computer, Networks and Communcaton Engneerng (ICCNCE 2013) Doman Thesaurus Constructon from Wkpeda * WenKe Yn 1, Mng Zhu 2, TanHao Chen 2 1 Department of Electronc Engneerng

More information

CHAPTER 2 DECOMPOSITION OF GRAPHS

CHAPTER 2 DECOMPOSITION OF GRAPHS CHAPTER DECOMPOSITION OF GRAPHS. INTRODUCTION A graph H s called a Supersubdvson of a graph G f H s obtaned from G by replacng every edge uv of G by a bpartte graph,m (m may vary for each edge by dentfyng

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

A Method of Hot Topic Detection in Blogs Using N-gram Model

A Method of Hot Topic Detection in Blogs Using N-gram Model 84 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 A Method of Hot Topc Detecton n Blogs Usng N-gram Model Xaodong Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Local Quaternary Patterns and Feature Local Quaternary Patterns

Local Quaternary Patterns and Feature Local Quaternary Patterns Local Quaternary Patterns and Feature Local Quaternary Patterns Jayu Gu and Chengjun Lu The Department of Computer Scence, New Jersey Insttute of Technology, Newark, NJ 0102, USA Abstract - Ths paper presents

More information

Recommendations of Personal Web Pages Based on User Navigational Patterns

Recommendations of Personal Web Pages Based on User Navigational Patterns nternatonal Journal of Machne Learnng and Computng, Vol. 4, No. 4, August 2014 Recommendatons of Personal Web Pages Based on User Navgatonal Patterns Yn-Fu Huang and Ja-ang Jhang Abstract n ths paper,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques Enhancement of Infrequent Purchased Product Recommendaton Usng Data Mnng Technques Noraswalza Abdullah, Yue Xu, Shlomo Geva, and Mark Loo Dscplne of Computer Scence Faculty of Scence and Technology Queensland

More information

From Comparing Clusterings to Combining Clusterings

From Comparing Clusterings to Combining Clusterings Proceedngs of the Twenty-Thrd AAAI Conference on Artfcal Intellgence (008 From Comparng Clusterngs to Combnng Clusterngs Zhwu Lu and Yuxn Peng and Janguo Xao Insttute of Computer Scence and Technology,

More information

On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering

On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering JOURNAL OF COMPUTERS, VOL. 5, NO. 4, APRIL 2010 549 On-lne Hot Topc Recommendaton Usng Tolerance Rough Set Based Topc Clusterng Yonghu Wu, Yuxn Dng, Xaolong Wang, Jun Xu Intellgence Computng Research Center

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

Adaptive Transfer Learning

Adaptive Transfer Learning Adaptve Transfer Learnng Bn Cao, Snno Jaln Pan, Yu Zhang, Dt-Yan Yeung, Qang Yang Hong Kong Unversty of Scence and Technology Clear Water Bay, Kowloon, Hong Kong {caobn,snnopan,zhangyu,dyyeung,qyang}@cse.ust.hk

More information

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer

More information

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Document Representation and Clustering with WordNet Based Similarity Rough Set Model IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model

More information

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples 94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty,

More information

Improving Web Image Search using Meta Re-rankers

Improving Web Image Search using Meta Re-rankers VOLUME-1, ISSUE-V (Aug-Sep 2013) IS NOW AVAILABLE AT: www.dcst.com Improvng Web Image Search usng Meta Re-rankers B.Kavtha 1, N. Suata 2 1 Department of Computer Scence and Engneerng, Chtanya Bharath Insttute

More information

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds TF 2 P-growth: An Effcent Algorthm for Mnng Frequent Patterns wthout any Thresholds Yu HIRATE, Ego IWAHASHI, and Hayato YAMANA Graduate School of Scence and Engneerng, Waseda Unversty {hrate, ego, yamana}@yama.nfo.waseda.ac.jp

More information

A CO-TRAINING METHOD FOR IDENTIFYING THE SAME PERSON ACROSS SOCIAL NETWORKS

A CO-TRAINING METHOD FOR IDENTIFYING THE SAME PERSON ACROSS SOCIAL NETWORKS A CO-TRAINING METHOD FOR IDENTIFYING THE SAME PERSON ACROSS SOCIAL NETWORKS Zheng Fang 1,2, Yanan Cao 2, Yanmn Shang 2, Yanbng Lu 2, Janlong Tan 2, L Guo 2 1 School of Cyber Securty, Unversty of Chnese

More information

Fast Computation of Shortest Path for Visiting Segments in the Plane

Fast Computation of Shortest Path for Visiting Segments in the Plane Send Orders for Reprnts to reprnts@benthamscence.ae 4 The Open Cybernetcs & Systemcs Journal, 04, 8, 4-9 Open Access Fast Computaton of Shortest Path for Vstng Segments n the Plane Ljuan Wang,, Bo Jang

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Positive Semi-definite Programming Localization in Wireless Sensor Networks Postve Sem-defnte Programmng Localzaton n Wreless Sensor etworks Shengdong Xe 1,, Jn Wang, Aqun Hu 1, Yunl Gu, Jang Xu, 1 School of Informaton Scence and Engneerng, Southeast Unversty, 10096, anjng Computer

More information

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm Recommended Items Ratng Predcton based on RBF Neural Network Optmzed by PSO Algorthm Chengfang Tan, Cayn Wang, Yuln L and Xx Q Abstract In order to mtgate the data sparsty and cold-start problems of recommendaton

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

A Novel Video Retrieval Method Based on Web Community Extraction Using Features of Video Materials

A Novel Video Retrieval Method Based on Web Community Extraction Using Features of Video Materials IEICE TRANS. FUNDAMENTALS, VOL.E92 A, NO.8 AUGUST 2009 1961 PAPER Specal Secton on Sgnal Processng A Novel Vdeo Retreval Method Based on Web Communty Extracton Usng Features of Vdeo Materals Yasutaka HATAKEYAMA

More information

CUM: An Efficient Framework for Mining Concept Units

CUM: An Efficient Framework for Mining Concept Units CUM: An Effcent Framework for Mnng Concept Unts P.Santh Thlagam Ananthanarayana V.S Department of Informaton Technology Natonal Insttute of Technology Karnataka - Surathkal Inda 575025 santh_soc@yahoo.co.n,

More information

Collaborative Topic Regression with Multiple Graphs Factorization for Recommendation in Social Media

Collaborative Topic Regression with Multiple Graphs Factorization for Recommendation in Social Media Collaboratve Topc Regresson wth Multple Graphs Factorzaton for Recommendaton n Socal Meda Qng Zhang Key Laboratory of Computatonal Lngustcs (Pekng Unversty) Mnstry of Educaton, Chna zqcl@pku.edu.cn Houfeng

More information