Cross-Lingual Taxonomy Alignment with Bilingual Biterm Topic Model

Size: px

Start display at page:

Download "Cross-Lingual Taxonomy Alignment with Bilingual Biterm Topic Model"

Everett Barber
6 years ago
Views:

1 Proceedngs of the Thrteth AAAI Conference on Artfcal Intellgence (AAAI-16) Cross-Lngual Taxonomy Algnment wth Blngual Bterm Topc Model Tanxng Wu 1, Guln Q 1, Haofen Wang 2, Kang Xu 1 and Xuan Cu 1 1 Key Laboratory of Computer Network and Informaton Integraton of State Educaton Mnstry, School of Computer Scence and Engneerng, Southeast Unversty, Chna {wutanxng, gq, kxu, xcu}@seu.edu.cn 2 East Chna Unversty of Scence & Technology, Chna whfcarter@ecust.edu.cn Abstract As more and more multlngual knowledge becomes avalable on the Web, knowledge sharng across languages has become an mportant task to beneft many applcatons. One of the most crucal knds of knowledge on the Web s taxonomy, whch s used to organze and classfy the Web data. To facltate knowledge sharng across languages, we need to deal wth the problem of cross-lngual taxonomy algnment, whch dscovers the most relevant category n the target taxonomy of one language for each category n the source taxonomy of another language. Current approaches for algnng crosslngual taxonomes strongly rely on doman-specfc nformaton and the features based on strng smlartes. In ths paper, we present a new approach to deal wth the problem of cross-lngual taxonomy algnment wthout usng any doman-specfc nformaton. We frst dentfy the canddate matched categores n the target taxonomy for each category n the source taxonomy usng the crosslngual strng smlarty. We then propose a novel blngual topc model, called Blngual Bterm Topc Model (BBTM), to perform exact matchng. BBTM s traned by the textual contexts extracted from the Web. We conduct experments on two knds of real world datasets. The expermental results show that our approach sgnfcantly outperforms the desgned state-of-the-art comparson methods. Introducton Nowadays, as the advent of more and more multlngual resources, the Web has become a global nformaton space. Thus, sharng knowledge across languages has become an mportant and challengng task. One of the most crucal knds of knowledge s taxonomy, whch refers to a herarchy of categores that enttes are classfed to (Prytkova, Wekum, and Spanol 2015). Dfferent knds of taxonomes are everywhere on the Web, such as Web ste drectory (e.g. Yahoo Drectory and Dmoz.org) and product catalogue (e.g. ebay.com and Google Product Taxonomy). To facltate knowledge sharng across languages, we need to deal wth the problem of cross-lngual taxonomy algnment, whch s the task of dscoverng the most relevant category n the target taxonomy of one language for each category n the source tax- Copyrght c 2016, Assocaton for the Advancement of Artfcal Intellgence ( All rghts reserved. onomy of another language. Cross-lngual taxonomy algnment not only contrbutes to globalze knowledge sharng, but also benefts many applcatons, such as cross-lngual nformaton retreval (Potthast, Sten, and Anderka 2008; Nguyen et al. 2009) and multlngual knowledge base constructon (Lehmann et al. 2014; Mahdsoltan, Bega, and Suchanek 2014). The key step of algnng cross-lngual taxonomes s to measure the relevance between one category n the source taxonomy and another one n the target taxonomy. Once all the relevance scores have been determned, we can obtan the most relevant category n the target taxonomy for each category n the source taxonomy n an unsupervsed way. However, snce categores are descrbed n dfferent languages, tradtonal monolngual smlarty metrcs are not sutable n cross-lngual scenaros. In order to overcome ths problem, several approaches have been proposed. The work gven n (Spohr, Hollnk, and Cmano 2011) frst translates cross-lngual taxonomes nto monolngual taxonomes, and then captures the lngustc features and structural features that rely on strng smlartes to predct the relevance score between categores. However, the translated label of a category n the source taxonomy may be dssmlar to ts matched category n the target taxonomy. For example, category n JD.com can be translated to Outdoor/Sportswear by Google Translate 1, but the translated strng s totally dfferent from that of ts matched category Athletc Apparel n ebay.com. Thus, the features that rely on strng smlartes are nsuffcent to decde the relevance score between two categores of dfferent languages, due to dfferent language habts and mproper translatons. Another work (Prytkova, Wekum, and Spanol 2015) tres to solve ths problem by referrng to Wkpeda. It strongly reles on the doman-specfc nformaton (.e. book ntances) to map orgnal categores n book doman onto Wkpeda categores. Categores of dfferent languages can be drectly compared usng nterwk lnks. However, ths approach cannot be easly extended to the other knds of taxonomes, because nstance nformaton s often unavalable. In ths paper, we study the problem of cross-lngual taxonomy algnment. The problem s non-trval and poses the followng challenges

2 Feature. When the doman-specfc nformaton s unavalable, the exstng approach (Spohr, Hollnk, and Cmano 2011) only depends on strng smlartes to capture dfferent knds of features, resultng n a rather poor performance. Snce vector smlartes have acheved great success n natural language processng tasks (Denhère and Lemare 2004; Gabrlovch and Markovtch 2007; L, J, and Yan 2015), can we ntroduce a new powerful feature whch reles on vector smlartes? Representaton. Vector smlartes are based on rch textual nformaton, but categores do not contan such nformaton. Can we fnd a way to enrch the representaton of categores wth textual nformaton? Can ths new representaton reveal the real meanng of each category, especally ambguous categores? Approach. The features that depend on strng smlartes do not work well n cross-lngual taxonomy algnment, but they stll have postve mpacts. Can we desgn an approach usng both the features that rely on vector smlartes and the ones that depend on strng smlartes? To solve the above challenges, we propose a new approach to solve the problem of cross-lngual taxonomy algnment wthout usng any doman-specfc nformaton. Frstly, we dentfy the canddate matched categores n the target taxonomy for each category n the source taxonomy usng a new lngustc feature,.e. cross-lngual strng smlarty. Then, we propose a novel blngual topc model, called Blngual Bterm Topc Model (BBTM), to obtan the topc vector of the context for each category. BBTM s traned by the textual contexts extracted from the Web. Fnally, the relevance score between each category n the source taxonomy and ts canddate matched categores s computed as the cosne smlarty between topc vectors. The experments on two real world datasets show that our approach sgnfcantly outperforms the desgned state-of-the-art comparson methods. The rest of ths paper s organzed as follows. Secton 2 outlnes some related work. Secton 3 ntroduces the proposed approach n detal. Secton 4 presents the expermental results and fnally Secton 5 concludes ths work and descrbes the future work. Related Work In ths secton, we revew some related work on schema matchng and multlngual knowledge algnment. Schema Matchng Schema matchng ams at dentfyng semantc correspondences between two schemas ncludes database schemas and ontologes (Do and Rahm 2007). (Rahm and Bernsten 2001; Berln and Motro 2002; Do, Melnk, and Rahm 2003) ntroduce dfferent methods for matchng database schemas. However, the settng of matchng database schemas s qute dfferent from algnng heterogeneous taxonomes due to the dfference n sze and structure. Ontology matchng (Shvako and Euzenat 2006) solves the problem of fndng relatonshps (e.g. equvalence, subsumpton) between dscrete enttes of ontologes, ncludng classes, propertes, etc. There exsts plenty of work (Euzenat and Shvako 2007) on matchng dfferent knds of ontologes. Several systems (Jménez-Ruz et al. 2014; Fara et al. 2014) can match multlngual ontologes wth the features that only rely on the strng smlartes after usng machne translaton, but the performance s not good. The dfference between our work and ontology matchng s that we focus on algnng general taxonomes rather than standard ontologtes. Ontologes are usually of more nternal nformaton, such as propertes, functons, axoms,.etc. However, the taxonomes handled n ths work do not contan such nformaton to assst the matchng operaton. Another knd of related work s catalog ntegraton (Agrawal and Srkant 2001; Ichse, Takeda, and Honden 2003; Wang et al. 2014), but they do not algnng the taxonomes n cross-lngual scenaros. Multlngual Knowledge Algnment There exsts some work on multlngual knowledge algnment. (Wang et al. 2012) proposed a lnkage factor graph model to lnk artcles from Englsh Wkpeda to those n Badu Bake. They further proposed a concept annotaton method and a regresson-based learnng model to teratvely predct new cross-lngual lnks (Wang, L, and Tang 2013). X-LSA (Zhang and Rettnger 2014) s a semantc annotaton system, whch can annotate text documents and web pages n dfferent languages usng resources from Wkpeda and Lnked Open Data. Dfferent from our work, all of the above work only focuses on algnng cross-lngual data level knowledge,.e. cross-lngual entty lnkng. The most relevant work s (Spohr, Hollnk, and Cmano 2011; Prytkova, Wekum, and Spanol 2015). Both of them depend on doman-specfc nformaton and the features based on strng smlartes to algn cross-lngual taxonomes n specfc domans. Here, we focus on algnng more general cross-lngual and cross-doman taxonomes wthout doman specfc nformaton, such as product catalogues and Web ste drectores. The Proposed Approach In ths secton, we present our proposed approach n detal, whch conssts of two man steps: canddates dentfcaton and exact matchng. Canddates Identfcaton To avod unnecessary comparsons of the categores between two gven taxonomes, we am to obtan all the possble matched categores n the target taxonomy for each category n the source taxonomy. The output of ths step s taken as the nput of exact matchng. The smplest way to represent a category s usng ts category label. However, ths may not be desrable for category matchng snce synonymous categores may own totally dfferent labels. For example, Sports Clothng and Athletc Apparel do not share any word, let alone two synonymous categores of dfferent languages. Besdes, drectly comparng the translated category labels of the same language also has lmtatons due to dfferent language habts and mproper translatons. 288

3 Fgure 1: An Example of Canddates Identfcaton To avod the above problems, we capture cross-lngual strng smlartes between categores n word level by usng BabelNet (Navgl and Ponzetto 2010), a Web-scale multlngual synonym thesaurus. The key dea of canddates dentfcaton s that two categores of dfferent languages may be relevant f they share the same or synonymous words. Gven a category c s n the source taxonomy of language s and a category c t n the target taxonomy of language t, each of them s segmented nto a set of words. After removng stop words, c s and c t contan a set of words W c s = {w s}m =1 and W c t = {wj t}n j=1, respectvely. For each word ws,weget ts synonymous words of language s and t by BabelNet and add them to W c s. The same process s also appled for each word wj t and we get the new W ct. The cross-lngual strng smlarty between c s and c t s defned as: CSS(c s,c t )= { 1, f W c s W c t 0, otherwse If CSS(c s,c t ) equals 1, c t wll be taken as one of the canddate matched category of c s. Fgure 1 llustrates why Athletc Apparel n ebay.com s taken as the canddate matched category of n JD.com. Textual Context Extracton Snce categores do not have textual nformaton to descrbe themselves, vector smlartes cannot be appled to category matchng untl we fnd a way to enrch the representaton of categores wth textual nformaton. There mght exst some Web pages whch contan the textual nformaton assocated wth a gven category, but manually fndng approprate Web pages s unrealstc. Therefore, we choose to acqure textual nformaton by queryng the Web wth the search engne Google. Categores of the same label n dfferent structures may have dfferent meanngs. For example, category Sports occurs twce n Yahoo Drectory. One s the chld of category Shoppng and Servces whch means sports goods, and the other s the chld of category Recreaton representng knds of physcal actvtes. However, the results (.e. ttles, snppets and URLs) returned only by submttng ther own labels to Google are the same. To accurately get the relevant textual nformaton returned by Google for each category, the labels of the gven category c and ts parent category pc are jontly submtted to Google. For example, n Yahoo Drectory, the label of category Sports representng sports goods s submtted to Google jontly wth ts parent category label Shoppng and Servces. In each returned snppet, we extract (1) the words co-occurred wth c n the same sentence except pc, because pc s part of the query, thus t occurs qute a lot of tmes. These extracted words are taken as the textual context to better reveal the meanng of the gven category. Note that the root categores do not have parent categores, but they are usually unambguous, otherwse users wll be easly confused when explorng the taxonomy n a top-down manner. Thus, we smply submt the label of each root category to Google to get ts textual context. Exact Matchng After utlzng a lngustc feature (.e. the cross-lngual smlarty) for canddates dentfcaton, we am to perform exact matchng by determnng the relevance score between each category n the source taxonomy and ts canddate matched categores n the target taxonomy wth one or more features based on vector smlartes. The bag-of-words (BOW) model s the most common method to model text. However, the textual context of each category s extracted from the snppets that vary a lot n wordng styles, because the snppet may be a tweet or a pece of news wth more formal language expressons. For each category, t may be the case that the words extracted from dfferent snppets are totally dfferent, whch means qute a lot of the words are of low frequency. Therefore, the BOW model may not work well n ths scenaro. To address the above problem, we try to dscover the topcs of the extracted textual contexts wth a blngual topc model. The textual context of each category s actually a set of short text documents extracted from the snppets. Gven a short text document d s of language s, we frst translate t nto the document d t of language t wth Google Translate, and then construct a par of blngual documents (d s,d t ). After applyng the same process for all the documents of language s, a pared blngual document corpus {(d s,dt )}N d =1 wll be generated. Then we can drectly apply a wdely used blngual topc model,.e. Blngual Latent Drchlet Allocaton (BLDA) model (Vulćetal. 2015) to model the corpus as a generaton process (see Fgure 3 (a)). However, ths model wll suffer from the data sparsty problem n short text documents (Hong and Davson 2010). Hence, we propose a new blngual topc model, called Blngual Bterm Topc Model (BBTM) to explctly model the word co-occurrence n each par of blngual short text documents. BBTM can not only avod the problems caused by applyng the BOW model or BLDA, but also better uncover the topcs of textual contexts for exact matchng. a) Blngual Bterm Topc Model BBTM s an extenson of Bterm Topc Model (BTM) (Yan et al. 2013; Cheng et al. 2014) for modelng the generaton of bterms. The key dea s that f two words co-occur more frequently, they are more lkely to belong to a same topc. Dfferent from BTM, a bterm used n BBTM denotes an unordered word-par cooccurrng n a par of blngual documents. Any two dstnct words n a par of blngual documents construct a bterm. For example, gven a par of blngual documents (d s,d t ), n whch d s and d t respectvely consst of n dstnct words of language s and m dstnct words of language t, totally 289

4 Fgure 2: An Example of Bterm Generaton Cn 2 + Cm 2 + m n bterms wll be generated. Fgure 2 gves an example of bterm generaton. BBTM assumes that all bterms extracted from the whole corpus share the same topc dstrbuton, and each topc conssts of two dscrete dstrbutons over words of dfferent languages. Gven a pared blngual document corpus, suppose t contans B bterms B = B s B st B t = {b s } Bs =1 {bst } Bst =1 {b t } Bt =1 wth bs =(ws,1,ws,2 ) where each word s n language s, b st =(w,1 s,wt,2 ) where two words are n dfferent languages and b t =(wt,1,wt,2 ) where each word s n language t, as well as K topcs expressed over W s and W t dstnct words of language s and language t respectvely. The topc ndcator varable z [1,K] can be denoted as z s, z st and z t for the three knds of bterms. We represent the topcs n the corpus by a K-dmensonal multnomal dstrbuton θ = {θ k } K k=1 wth θ k = P (z = k). The word dstrbuton of language s and language t are respectvely represented by a K W s matrx ϕ s and a K W t matrx ϕ t, where the kth row ϕ s k and ϕt k are respectve a W s -dmensonal multnomal dstrbuton wth entry ϕ s k,w = P s (ws z = k) and a W t -dmensonal multnomal dstrbuton wth entry ϕ t k,w = P (w t z = k). t Followng the conventon of BTM, the hyperparameters α and β are the symmetrc Drchlet prors. Fgure 3 (b) shows the graphcal representaton of BBTM and ts generatve process s descrbed n Algorthm 1. Usng BBTM, the probablty of generatng the whole corpus gven hyperparameters α and β can be expressed as: B s P (B α, β) = =1 B st =1 K k=1 B t =1 K k=1 θ k ϕ s k,w s,1 ϕs k,w s,2 dθdϕs K k=1 θ k ϕ s k,w s,1 ϕt k,w t,2 dθdϕs dϕ t θ k ϕ t k,w t,1 ϕt k,w t,2 dθdϕt b) Parameters Estmaton Snce t s ntractable to exactly solve the coupled parameters θ, ϕ s and ϕ t by maxmzng the lkelhood n Eq. (2), we adopt collapsed Gbbs Samplng (Lu 1994) to resolve ths problem. θ, ϕ s and ϕ t can be ntegrated out due to the use of conjugate prors. Thus, we only need to sample the topc of each bterm. Due to space lmt, we only show the derved Gbbs samplng formulas for (2) (a) BLDA (b) BBTM Fgure 3: Graphcal representaton of BLDA and BBTM Algorthm 1: Generatve Process of BBTM ntalze: (1) set the number of topcs K; (2) set values for Drchlet prors α and β; sample: K tmes ϕ s Dr(β); sample: K tmes ϕ t Dr(β); sample: θ Dr(α) for all bterms; foreach bterm b s Bs do sample: z s Mult(θ); sample: w s,1,ws,2 Mult(ϕs z s ) foreach bterm b st B st do sample: z st Mult(θ); sample: w,1 s Mult(ϕs z ), w t st,2 Mult(ϕt z ) st foreach bterm b t Bt do sample: z t Mult(θ); sample: w,1 t,wt,2 Mult(ϕt z ) t b s Bs, b st B st and b t Bt as follows, P (z s = k z b s, B) (n b s,k + α) (n b s,w,1 s k + β)(n b s,w,2 s k + β) P (z st = k z b st (n b s, s k +1+W s β)(n b s, s k + W s β), B) (n b st,k + α) (n b st,ws,1 k + β)(n b st,wt,2 k + β) (n b st, s k + W s β)(n b st, t k + W t β) P (z t = k z b t, B) (n b t,k + α) (n b t,w,1 t k + β)(n b t,wt,2 k + β) (n b t, t k +1+Wt β)(n b t, t k + W t β) where z b denotes the topc assgnments for all bterms except the bterm b, n b,k s the number of bterms assgned to topc k excludng b, n b,w s k s the number of tmes word w of language s assgned to topc k excludng b, n b,w t k s the number of tmes word w of language t assgned to topc k excludng b, and n b, s k = w s n b,w s k as well as n b, t k = w t n b,w t k. After a suffcent number of teratons, we can estmate the global topc dstrbuton θ and topc-word dstrbutons ϕ s, (3) (4) (5) 290

5 ϕ t by θ k = α + n k Kα + B ϕ s k,w s = β + n w s k W s β + n s k ϕ t k,w t = β + n w t k W t β + n t k where n k s the number of bterms assgned to topc k, n ws k s the number of tmes word w of language s assgned to topc k, n wt k s the number of tmes word w of language t assgned to topc k, and and n s k = w s n w s k as well as n t k = w t n w t k. c) Context Topcs Inference To perform exact matchng, we need to know the topc dstrbuton of the context for each category. Gven a category c, suppose t contans N c bterms {b j } Nc j=1, whch are extracted from all the pars of blngual documents of c. We utlze the followng formula (Yan et al. 2013) to nfer the topc dstrbuton of the context for c. (6) (7) (8) N c P (z c) = P (z = k b j)p (b j c) (9) j=1 In Eq. (9), P (b j c) s estmated by emprcal dstrbuton: P (b j c) = n(bj) N c n(b j) j=1 (10) where n(b j ) s the frequency of bterm b j n all the pars of blngual documents of c. Meanwhle, P (z = k b j ) can be computed va Bayes formula based on the parameters learned n BBTM: P (z = k b j )= K θ k ϕ s k,w s j,1 ϕs k,w s j,2 k =1 θ k ϕ s k,w s j,1 ϕ s k,w s j,2 K θ k ϕ s k,w s j,1 ϕt k,w t j,2 k =1 θ k ϕ s k,w s j,1 ϕ t k,w t j,2 K θ k ϕ t k,w j,1 t ϕ t k,w j,2 t k =1 θ k ϕ t k,w t j,1 ϕ t k,w t j,2, f b j B s, f b j B st, f b j B t (11) After obtanng the topc dstrbuton of the context for each category, we can represent categores of dfferent languages n the same topc space. The fnal relevance score between each category n the source taxonomy and ts canddate matched categores s computed as the cosne smlarty between topc vectors. Experments To facltate knowledge sharng across languages on the Web, we evaluated our proposed approach on two dfferent knds of real world datasets, whch are publcly avalable Experment Settngs a) Tasks and Data Sets Two knds of cross-lngual and cross-doman taxonomes on the Web (.e. product catalogue and Web ste drectory) were used to valdate the proposed approach. The detals of the tasks and datasets are gven as follows: Cross-lngual Product Catalogue Algnment. In ths task, gven a category n JD.com (.e. one of the largest Chnese B2C onlne retalers), we am to fnd the most relevant category n ebay.com. We collected 7,741 Chnese categores n JD.com and 7,782 Englsh categores n ebay.com. Cross-lngual Web Ste Drectory Algnment. In ths task, gven a category n Chnese Dmoz.org (.e. the largest Chnese Web ste drectory), we ntend to fnd the most relevant category n Yahoo Drectory. We collected 2,084 Chnese categores n Chnese Dmoz.org and 2,353 Englsh categores n Yahoo Drectory. To generate the ground truth data, gven a par of taxonomes, fve annotators labelled the most relevant category n the target taxonomy (ebay.com or Yahoo Drectory) for each of the 100 randomly selected categores n the source taxonomy (JD.com or Chnese Dmoz.org). The labelled results are based on majorty votng. b) Evaluaton Metrcs Smlar to (Prytkova, Wekum, and Spanol 2015; Spohr, Hollnk, and Cmano 2011), we take cross-lngual taxonomy algnment as a rankng problem n the experments. For each category n the source taxonomy, we ranked all categores n the target taxonomy accordng to the relevance score predcted by our approach and the desgned comparson methods. Thus, we evaluated the rankng results n terms of MRR (Mean Recprocal Rank) (Craswell 2009). c) Comparson methods We compared our approach wth the followng methods. RSVM: The rankng SVM (RSVM) model s used n (Spohr, Hollnk, and Cmano 2011) for cross-lngual taxonomy algnment. After removng some doman-specfc features, there stll exst 20 lngustc features and 8 structural features for tranng the model. These features depends on strng smlartes after usng machne translaton. BBTM: Ths s the proposed model for exact matchng n our approach. Here, we only use BBTM to algn corsslngual taxonomes wthout the step of canddates dentfcaton. In BBTM, we set α =50/K, β =0.1and K = 120 (the emprcal tunng results wll be presented n Secton 4.2.2). CSS+RSVM: Ths approach frst uses our proposed crosslngual strng smlarty (CSS) for canddates dentfcaton, and then utlzes the rankng SVM model traned n RSVM for exact matchng. CSS+BOW: Ths approach also uses CSS for canddates dentfcaton at frst, and then apples the tradtonal bagof-words (BOW) model to exact matchng. Gven a category c, after mergng all pars of blngual short text documents of c nto one document d, each word n d s weghed wth TF-IDF (Baeza-Yates and Rbero-Neto 1999). 291

6 Table 1: Overall results: MRR values Approach JD.com ebay.com Chnese Dmoz.org Yahoo Drectory RSVM BBTM CSS+RSVM CSS+BOW CSS+BLDA CSS+BBTM (a) BBTM (b) BLDA CSS+BLDA: After utlzng CSS for canddates dentfcaton, ths approach leverages BLDA (Vulć et al. 2015) to exact matchng. In BLDA, we set α =50/K, β =0.1 and K =80(the emprcal tunng results wll be presented n Secton 4.2.2). Result Analyss a) Overall Performance In our proposed approach (.e. CSS+BBTM) and all desgned comparson methods except RSVM and CSS+RSVM, we need to extract the textual context of each category from the web. For each category, we extracted the snppets of top 20 results returned by Google. In each snppet, we only kept the words that co-occur wth the gven category n the same sentence except ts parent category (mentoned n Secton 3.2). Each processed Chnese (Englsh) snppet was translated nto Englsh (Chnese) by Google Translate to construct a par of blngual short text documents. We further processed the documents va the followng normalzaton steps: Processng Chnese Documents. 1) Segmentng words wth FudanNLP (Qu, Zhang, and Huang 2013) and removng stop words; 2) removng words wth document frequency less than 10; 3) flterng out documents wth length less than 2. Processng Englsh Documents. 1) Removng non-latn characters and stop words; 2) convertng letters nto lower case and stemmng each word; 3) removng words wth document frequency less than 10; 4) flterng out documents wth length less than 2. At last, for the textual contexts of categores n product catalogues (.e. JD.com and ebay.com), we got 11,473 dstnct Chnese words and 9,093 dstnct Englsh words. For the textual contexts of categores n Web ste drectores (.e. Chnese Dmoz.org and Yahoo Drectory), we got 26,473 dstnct Chnese words and 16,852 dstnct Englsh words. For each task n our experments, we traned a BBTM and a BLDA model. For each model, we ran 500 teratons of Gbbs samplng to converge. Table 1 gves the overall results of our approach and the desgned comparson methods, and we can see that: RSVM, BBTM and CSS+RSVM are of the worst performance. It means that the approaches only usng the features that rely on strng smlartes or the ones that depend on vector smlartes can not work well n algnng the real-world cross-lngual taxonomes. CSS+BOW and CSS+BLDA are of the same framework (.e. canddate dentfcaton wth strng smlartes and Fgure 4: MRR value vs. number of topcs K exact matchng wth vector smlartes) as our approach (.e. CSS+BBTM) has. These three approaches sgnfcantly outperform others, whch shows the superorty of the framework of our proposed approach. After performng canddates dentfcaton wth CSS, blngual topc models do much better than the BOW model n exact matchng. It demonstrates that compared wth the word vector generated by the BOW model, representng the textual context of each category as a topc vector s more sutable for exact matchng. Compared wth CSS+BLDA, CSS+BBTM (.e. our approach) acheves about a 4% MRR mprovement on both of the tasks, whch shows that BBTM can better dscover the topcs of textual contexts for exact matchng. b) Parameter Tunng One mportant parameter n BBTM and BLDA s the number of topcs K. Dfferent number of topcs may lead to dfferent performance n cross-lngual taxonomy algnment. Thus, we performed an analyss by varyng the number of topcs n the BBTM and BLDA model. Fgure 4 (a) shows the performance of BBTM wth dfferent number of topcs K on two gven datasets. The performance mproves by ncreasng K when K < 120. Fgure 4 (b) shows the performance of BLDA wth dfferent number of topcs K. The MRR value s the hghest when K =80for both of the datasets. It shows that when algnng cross-lngual and cross-doman taxonomes, as well as the sze of categores s large enough, K can be emprcally set to 120 and 80 n BBTM and BLDA, respectvely. Conclusons and Future Work In ths paper, we present a new approach to address the problem of cross-lngual taxonomy algnment. We frst proposed the cross-lngual strng smlarty for canddates dentfcaton. We then proposed a novel blngual topc model to obtan the topc vector of the extracted textual context for each category. Fnally, we obtaned the algnment result by usng the cosne smlarty between topc vectors. We evaluated our approach on two knds of real world taxonomes. The expermental results showed that our approach sgnfcantly outperforms the desgned state-of-the-art comparson methods. Specfcally, compared wth the methods that combne CSS and other models, our approach got the best performance, whch valdates the advantage of our new blngual topc model. As for the future work, we wll valdate our approach 292

7 on some doman-specfc taxonomes, such as the datasets n OAEI 3 Multfarm track. We also plan to utlze the structured nformaton n knowledge bases to enhance our approach for cross-lngual taxonomy algnment. Acknowledgements Ths work s supported n part by the Natonal Natural Scence Foundaton of Chna (NSFC) under Grant No , No , the 863 Program under Grant No. 2015AA and the Fundamental Research Funds for the Central Unverstes under Grant No. 22A References Agrawal, R., and Srkant, R On ntegratng catalogs. In Proc. of WWW, Baeza-Yates, R., and Rbero-Neto, B Modern nformaton retreval, volume 463. ACM press New York. Berln, J., and Motro, A Database schema matchng usng machne learnng wth feature selecton. In Proc. of CASE, Cheng, X.; Yan, X.; Lan, Y.; and Guo, J Btm: Topc modelng over short texts. IEEE Transactons on Knowledge and Data Engneerng 26(12): Craswell, N Mean recprocal rank. In Encyclopeda of Database Systems. Sprnger Denhère, G., and Lemare, B A computatonal model of chldren s semantc memory. In Proc. of CogSc, Do, H.-H., and Rahm, E Matchng large schemas: Approaches and evaluaton. Informaton Systems 32(6): Do, H.-H.; Melnk, S.; and Rahm, E Comparson of schema matchng evaluatons. In Web, Web-Servces, and Database Systems. Sprnger Euzenat, J., and Shvako, P Ontology matchng, volume 333. Sprnger. Fara, D.; Martns, C.; Nanavaty, A.; Taher, A.; Pesquta, C.; Santos, E.; F. Cruz, I.; and M. Couto, F Agreementmakerlght results for oae In Proc. of OM, Gabrlovch, E., and Markovtch, S Computng semantc relatedness usng wkpeda-based explct semantc analyss. In Proc. of IJCAI, volume 7, Hong, L., and Davson, B. D Emprcal study of topc modelng n twtter. In Proc. of SOMA, Ichse, R.; Takeda, H.; and Honden, S Integratng multple nternet drectores by nstance-based learnng. In Proc. of IJCAI, volume 3, Jménez-Ruz, E.; Grau, B. C.; Xa, W.; Solmando, A.; Chen, X.; Cross, V.; Gong, Y.; Zhang, S.; and Chenna-Thagarajan, A Logmap famly results for oae In Proc. of OM, volume 20, Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; van Kleef, P.; Auer, S.; et al Dbpeda-a large-scale, multlngual knowledge base extracted from wkpeda. Semantc Web Journal 5:1 29. L, C.; J, L.; and Yan, J Acronym dsambguaton usng word embeddng. In Proc. of AAAI, Lu, J. S The collapsed gbbs sampler n bayesan computatons wth applcatons to a gene regulaton problem. Journal of the Amercan Statstcal Assocaton 89(427): Mahdsoltan, F.; Bega, J.; and Suchanek, F Yago3: A knowledge base from multlngual wkpedas. In Proc. of CIDR. Navgl, R., and Ponzetto, S. P Babelnet: Buldng a very large multlngual semantc network. In Proc. of ACL, Nguyen, D.; Overwjk, A.; Hauff, C.; Treschngg, D. R.; Hemstra, D.; and De Jong, F Wktranslate: query translaton for cross-lngual nformaton retreval usng only wkpeda. In Evaluatng Systems for Multlngual and Multmodal Informaton Access. Sprnger Potthast, M.; Sten, B.; and Anderka, M A wkpedabased multlngual retreval model. In Proc. of ECIR, Prytkova, N.; Wekum, G.; and Spanol, M Algnng mult-cultural knowledge taxonomes by combnatoral optmzaton. In Proc. of WWW, Qu, X.; Zhang, Q.; and Huang, X Fudannlp: A toolkt for chnese natural language processng. In Proc. of ACL, Rahm, E., and Bernsten, P. A A survey of approaches to automatc schema matchng. the VLDB Journal 10(4): Shvako, P., and Euzenat, J Tutoral on ontology matchng. In Proc. of SWAP, Spohr, D.; Hollnk, L.; and Cmano, P A machne learnng approach to multlngual and cross-lngual ontology matchng. In Proc. of ISWC, Vulć, I.; De Smet, W.; Tang, J.; and Moens, M.-F Probablstc topc modelng n multlngual settngs: An overvew of ts methodology and applcatons. Informaton Processng & Management 51(1): Wang, Z.; L, J.; Wang, Z.; and Tang, J Cross-lngual knowledge lnkng across wk knowledge bases. In Proc. of WWW, Wang, H.; Wu, T.; Q, G.; and Ruan, T On publshng chnese lnked open schema. In Proc. of ISWC, Wang, Z.; L, J.; and Tang, J Boostng cross-lngual knowledge lnkng va concept annotaton. In Proc. of IJCAI, Yan, X.; Guo, J.; Lan, Y.; and Cheng, X A bterm topc model for short texts. In Proc. of WWW, Zhang, L., and Rettnger, A X-lsa: cross-lngual semantc annotaton. In Proc. of VLDB,

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for