A Neighborhood Relevance Model for Entity Linking

Size: px

Start display at page:

Download "A Neighborhood Relevance Model for Entity Linking"

Samson Roy Brooks
6 years ago
Views:

1 A Neighborhood Relevance Model for Entity Linking Jeffrey Dalton University of Massachusetts 140 Governors Drive Aherst, MA, U.S.A. Laura Dietz University of Massachusetts 140 Governors Drive Aherst, MA, U.S.A. ABSTRACT Entity Linking is the task of apping a string in a docuent to its entity in a knowledge base. One of the crucial tasks is to identify disabiguating context; joint assignent odels leverage the relationships within the knowledge base. We deonstrate how joint assignent odels can be approxiated with inforation retrieval. We introduce the neighborhood relevance odel which uses relevance feedback techniques to identify the salience of entity context using cross-docuent evidence. We show that this odel is ore effective than local docuent odels for ranking KB entities. Experients on the TAC KBP entity linking task deonstrate that our odel is the best perforing syste for strings that are linkable to the knowledge base. 1. INTRODUCTION Entity linking is iportant because ost content created is unstructured text in the for of news, blogs, forus, and icroblogs such as Twitter and Facebook. A key challenge is to link these unstructured text docuents to the Web of Data. Entity linking bridges the structure gap between text docuents and linked data by identifying entions of entities in free text and linking the to knowledge bases. It enriches unstructured docuents with links to people, places, and concepts in the world. Entity linking is a fundaental building block that supports a wide variety of inforation extraction, docuent suarization, and data ining tasks. For exaple, linked entities in docuents can be used to expand existing knowledge base entries with new facts and relationships. The ajor challenge in entity linking is abiguity. An entity ention in text ay be abiguous for a wide variety of reasons: ultiple entities share the sae nae (e.g. Michael Jordan), entities are referred to incopletely (e.g. Justin for Justin Bieber), by pseudonys or nicknaes (Christopher George Latore Wallace is also known as The Notorious B.I.G.), and are often abbreviated (e.g. UW for the University of Wisconsin as well as University of Washington). Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. OAIR 13, May 22-24, 2013, Lisbon, Portugal. Copyright 2013 CID The entity linking proble has been studied over several years in the TAC Knowledge Base Population venue with the following task definition: Entity Linking: Given a string ention q in a docuent, predict the entity c in the knowledge base which the string represents, or NIL if no such entity is available. A typical entity linking syste consists of four phases: 1) query expansion, 2) candidate generation, 3) entity ranking, and 4) handling NIL cases. The goal of the first two steps is to achieve a high-recall set of Wikipedia entities. Given the candidate set, ost effective approaches, e.g., [15, 4, 16], leverage contextual entities as disabiguating evidence in step 3. One issue is the candidate generation step is often perfored using string atching heuristics, resulting in large candidate sets that ay contain hundreds or thousands of entities for abiguous atches. The connection between candidate generation and ranking are often separate and not well aligned. We advocate an inforation retrieval approach that uses one probabilistic odel to approach steps 1-3. We introduce our linking syste, KB Bridge. Suppleentary aterials for this work is available on the KB Bridge website 1. Existing entity linking ethods only eploy IR to a inor degree. We odel the entire linking task as a retrieval proble, including foralising joint assignent odels within the retrieval fraework. The graphical odeling fraework allows us to ground our work on odels fro both inforation extraction and inforation retrieval. For a given entity ention, the correct knowledge base entry is likely to share iportant pieces of contextual inforation: lexically siilar naes, shared topical siilarity reflected in word usage, and shared naed entities. Additional context could be included, but throughout the paper we focus the previously described context: nae variations, surrounding sentences, and neighboring entities. Entity linking provides soe unusual challenges. The typical IR setting addresses short queries by using relevance feedback [17] to expand the query odel. In entity linking, the query is an entity string ebedded in a longer docuent, providing an abundance of context which could be leveraged. However, not all context is equally helpful, either because of abiguity, heterogeneity in topic, or spurious collocations. Consider the exaple ABC shot the TV draa Lost in Australia. with the task of linking ABC to the entity Aerican Broadcasting Copany. The naed entity span Australia is not relevant for the true answer. It ight 1 jdalton/kbbridge

2 actually isguide the process to link to the wrong entity Australian Broadcasting Corporation. We introduce the neighborhood relevance odel to estiate the salience of context with the goal of filtering and weighting (as opposed to expanding) the context odel. The neighborhood relevance odel is based on ideas of pseudorelevance feedback [20] and latent concept expansion [13] to leverage cooccurrence evidence across topically siilar docuents. Our ain contributions are: An unsupervised approach to entity linking based upon the Markov Rando Field inforation retrieval odel that provides copetitive perforance out-of-the-box. A unified retrieval based approach to linking cobining candidate generation and ranking in a single retrieval fraework, with ore than 95% recall in the highest ranked 10 entities. A ention-specific approach for identifying salient neighboring entities using across-docuent evidence based on relevance feedback. Deonstrating the benefits of the entity neighborhood relevance odel in cobination with a supervised learning to rank fraework, resulting in the best ranking for entities linkable to the knowledge base for the TAC KBP task. 2. BACKGROUND AND RELATED WORK 2.1 Related Work Early work on entity linking was perfored by Bunescu and Pasca [2] and Cucerzan [3] to link entions of topics to their Wikipedia pages. In contrast to their odels, we focus on a retrieval approach that leverages text based ranking without exploiting extensive Wikipedia-specific structure. Our work is related to that of Gottipati and Jiang [5] who apply a language odeling approach to entity linking. They expand the original query ention with contextual inforation fro the language odel of the docuent. We use the local weighting as a starting point for estiating the entity salience and copare against it as a baseline. Entity linking has been studied in a variety of recent venues. At INEX the Link the Wiki task explored autoatically discovering links that should be created in a Wikipedia article [7]. More recently, it is one of the principle tasks studied at the ongoing Text Analysis Conference Knowledge Base Population track (TAC KBP). Ji et al. [9, 8] provide an overview of the recent systes and approaches. Instead of linking individual entions one at a tie, recent work [3, 16, 19, 11, 6] focuses on linking the set of entions, M, that occur in the query docuent d. These odels perfor joint inference over the link assignents to identify a coherent assignent of KB entries. In our work we leverage the set of entions M in the docuent as context in an inforation retrieval odel. In this work we instead focus on identifying salient entity entions in the context, because entions in the docuent ay be spurious or only tangentially related. This is especially true if the docuent contains ultiple topics. 2.2 Joint Neighborhood Assignent Model Graphical odels [10], such as Markov Rando fields, are a popular tool in both inforation extraction and inforation retrieval. Graphical odels provide the atheatical fraework for foralizing intuitions on how available data (e.g., the query ention) and quantities of interest (e.g. entity in the knowledge base) are connected. After casting data and quantities of interest as rando variables, dependencies between two (or ore) variables are encoded by factor functions φ that assign a non-negative score to each cobination of variable settings. Factors φ are often expressed by a loglinear function of a feature vector. The joint configuration of all variables is scored by the likelihood function L, which is represented by the noralized product over scores of all factor functions for the given variable settings. The factor functions φ are designed to encode our intuition on which variable choices go well together. For instance, early approaches to entity linking [1] heuristically extract a set of entity candidates and use a graphical odel with single factor φ e (q, c). The factor φ e foralizes the intuition that query entions q and entities c are copatible when their naes atch and the docuent surrounding q has ters siilar to the article of the entity candidate c. This basic odel is extended in joint neighborhood assignent odels [3, 16, 12]. The idea is that knowledge base entities which are entioned in the sae docuents are also likely to be structurally related in the knowledge base. When the query ention is abiguous, several candidate entities have an equally high score under the factor φ e (q, c). Joint assignent odels explicitly incorporate entity entions fro the sae docuent as the query ention q, which we call neighbor entions henceforth. Each neighbor ention i is associated with a latent variable z i referring to the respective entity for the neighbor. We expect the neighbor entities z to help disabiguate the true candidate c. This expectation is foralized by a second factor function φ ee (z, c), which captures intuitions on when entities z and c are entioned in the context of each other. For the factor function φ ee, Ratinov et al. [16] set siilarity of inlinks (and outlinks) for c and z; Cucerzan [4, 3] use the overlap in Wikipedia categories and presence of links fro z to c and vice versa. For each neighbor ention i a set of candidates are heuristically identified, where the variable z i represents a choice of the candidate. The odel is visualized in Figure 1a, where variables q and i are observed, but settings of variables c and z i are chosen fro respective candidate sets. A factor is represented as a sall black box which connect its input variables. If each neighboring ention is linked to its correct entity candidate z i, the true candidate c will achieve a high score of φ ee (z i, c) for ost of the neighbors z i. This inforation is cobined across all neighbors and taken together with the nae-factor φ e (q, c). The dilea is that linking all i to z i requires that we solve the entity linking proble as part of the solution. For this odel, the joint inference proble does not have a closed-for solution and therefore requires approxiate inference such as belief propagation [12] or iterative heuristics [4, 16]. As the task is to link only the single query ention, the neighboring entity links are to be arginalized out by integration over z. Candidates c for the query are scored by Equation 1.

3 (a) Joint Assignent Model(b) Neighborhood uery Model Figure 1: Neighborhood Models L(c) = φ e (q, c) ( φ e (, z) φ ee (z, c) dz) (1) 3. UERY MODEL In this section we integrate graphical odels for entity linking as developed in the inforation extraction counity with graphical odels used for inforation retrieval. We utilize the fact that query odels, such as query likelihood and the sequential dependence odel [14] have an underlying graphical odel that gives rise to a score of a docuent. 3.1 Neighborhood uery Model We deonstrate how the retrieval engine can be used to infer the joint inference proble of Equation 1 whenever factor functions φ e and φ ee are expressible as query operators. This allows us to cobine the candidate generation (step 2) with the entity ranking (step 3) and have it carried out by a retrieval engine that optiizes over all possible entities in the knowledge base at once. In contrast to the joint neighborhood assignent approaches, no candidate set has to be identified beforehand. The key insight is that for particular choices of features for the nae-atch factor φ e (, z) and the entity-link factor φ ee (z, c), it is possible to solve the integral over z in Equation 1, with sart preprocessing and indexing. For instance, consider the following siple factor functions for φ e and φ ee. We choose φ e (, z) to yield 1 if the ention is a string atch to the title of z and 0 otherwise; and φ ee (z, c) to yield 1 if z links to c. In this case, we can replace the integral over z with a factor φ e (, c) that yields 1 if the ention atches a title of an inlink of c. Such a factor can be analytically solved by transforing the Wikipedia snapshot so that the indexed article of entity c is enriched with titles of Wikipedia pages that link to c in a separate field (or extents). A search query for the string in the field of inlinks represents a graphical odel that represents the score of φ e. For any choices of factor functions for φ e and φ ee, the knowledge base can be transfored to allow for efficient replaceent factors φ e that can be directly optiized within the retrieval odel fraework. By integrating z out, the joint neighborhood assignent odel depicted in Figure 1a becoes the neighborhood query odel in Figure 1b with the following likelihood function. L(c) = φ e (q, c) φ e (, c) (2) Many coplex factor functions are possible, but for the reainder of this publication we use two siple, yet effective factor functions φ: Factor φ e (q, c) encodes atches of the string q and ters fro the surrounding text in any field of the Wikipedia article of the candidate c, this includes naes as well as the full-text of the article. Factor φ e (, c) atches the string representation of in the full-text and titles entities that link to (and are linked fro) the Wikipedia entry of c. 3.2 Relevance-weighted Neighborhood uery Model As pointed out in the introduction, not all contextual entities are equally salient. For each contextual entity, its salience for disabiguating query q is denoted by ρ q(), ranging on a scale between 0 and 1. If the salience ρ q() is 0, we want to reove the effect of φ e (, c) on the likelihood function. Based on the geoetric ean, which is the natural choice for averages of probabilities, we achieve the weighting with the geoetric interpolated odel of Equation 3. L(c) = φ e (q, c) ( ) ρq() φ e (, c) Notice, that the unweighted odel follows as a special case where all saliences are 1. We also introduce the paraeter λ M that allows trading off between the direct siilarity of the query and candidate as expressed by φ e (q, c) and the aggregated influence of the M contextual entities. Exploiting that the sort-order induced by L is invariant with respect to logariths, we cast the optiization in log-space as in Equation 4. log L(c) = logφ e (q, c) + λ M 1 M (3) ( ) ρ q() log φ e (, c) 3.3 Context fro uery Docuent The factor φ e foralizes our intuition on copatibility between the query ention q and the true candidate c. This includes nae atches of the string representation of with naes listed in the knowledge base (e.g. title, redirect, anchor text). We further extend it to other siilarity easures that are independent of the neighbor entions (which are represented by φ e ). Nae variations v of the query string can be extracted fro the query docuent, to add robustness to the entity linking inference. This is especially iportant if the query string is an acrony or an abiguous reference to the entity. We also incorporate the words, s, of the sentence that surround the query ention or one of its nae variations to preserve verbs, adjectives, and standing expressions that ight disabiguate the candidate. We introduce separate weight paraeters λ, λ V, λ S to individually control influence of nae-atches of the query ention, nae-atches of nae variations v, and sentence context respectively. Accordingly, odel the factor function φ e (q, c) by the likelihood of a graphical odel itself, represented by a log-linear function of potential functions φ nae for nae-siilarity and φ sent for sentence context. The resulting optiization criterion of the candidate answer c for the query odel given the query q, V nae variations v, S contextual phrases s, and M neighbor entions is given in Equation 5. (4)

4 log L(c) = λ logφ nae (q, c) (5) + λ V 1 log φ nae (v, c) V v + λ S 1 log φ sent (s, c) S + λ M 1 M s ( ) ρ q() log φ e (, c) 3.4 Joint Inference with Galago Using log-linear odels for factors φ with features that are readily available in the Indri and Galago 2 query languages, we can leverage the retrieval engine to optiize Equation 5. This is possible because the geoetric interpolations with weights λ and ρ are expressed with the weighted #cobine operator. We rely on the sequential dependence odel [14], which is a query odel for odels dependencies between adjacent query words. It strikes a balance between a odel that ignores the order between the query ters (e.g. query likelihood odel) and a odel that requires the exact ordering of the words (n-gra or phrase odel). For a given sequence of ters t 1, t 2,..., t n, the sequential dependence odel scores docuents by a graphical odel cobining ter factors φ t (t i), bi-gra factors φ o (t i, t i+1), and factors capturing whether two subsequent query ters t i and t i+1 occur within a window of 8 words. We use siple factors in the reainder of the paper. For the nae-atch factor φ nae (q, c) that tests all of the entity s indexed docuent for presence of the string representation of the query ention q. Because q consists of ultiple ters, we use the sequential dependence odel. We use a query likelihood operator for φ sent. For the neighbor factor φ e we also use the sequential dependence odel. With these feature functions, the optiization criterion of Equation 5 is equivalent to the Galago query given in Figure 2. We restrict ourselves to a siple factor functions to deonstrate general applicability. Field-retrieval odels provide further options, e.g., to let φ e distinguish atches in different nae fields (title, anchor text, etc); and φ e distinguish atches in the full-text, the title of in-links and titles of out-links. The factor functions can ake use of any feature that can be expressed in Galago query language, while still allowing to be optiized inside the retrieval engine. 4. NEIGHBORHOOD RELEVANCE MODEL In the previous section we introduced a query odel containing salience-weighted entity entions. We now discuss ethods for estiating these salience weights ρ q() in an unsupervised anner. The idea is to assue a high salience of for the query ention q, if both are frequently entioned together. It is iportant to note that even unabiguous entions are not necessarily useful for disabiguating other entions. 2 #cobine:0=λ :1=λ V :2=λ S :3=λ M ( ) #seqdep(q) #cobine(#seqdep(v 0)... #seqdep(v V )) #cobine(#seqdep(s 0),..., #seqdep(s S)) #cobine:0 = ρ( 0) :... k = ρ( k )( ) #seqdep( 0),..., #seqdep( k ) Figure 2: Salience-weighted neighborhood query odel in Galago syntax. 4.1 Local Docuent Neighborhood Model As a first approach, salience of the neighboring entions on the query ention can be estiated fro the docuent surrounding the query. This technique was used by Gottipati and Jiang [5] to build a ultinoial language odel of entity entions fro the query docuent d q with occurrence count n,dq. We refer to this siple estiation technique as the local odel. ρ local n,dq q () = (6) n,d q Gottipati also tested weighting schees that incorporates distance, but found that these did not significantly iprove the results. We suspect that especially when the query is not the ain focus of the query docuent, any contextual entities are not relevant for disabiguation and ay actually lead to worse perforance (see experiental evaluation). 4.2 Across-docuent Neighborhood Relevance Model We suggest the neighborhood relevance odel which estiates entity saliences ρ fro across-docuent evidence. The idea is that a neighbor is iportant if it occurs frequently in the context of the query ention within the docuent as well as across other docuents that are topically related and contain entions of the query ention q. Our ethod is based on pseudo-relevance feedback [13], which analyses results a first retrieval pass to expand the query with related words. We first identify the query string q, with nae variations v, and neighborhood, and using the local docuent saliences ρ local. We use the query odel given in Equation 5 to search for docuents d that contain coreferent entions. We identify coreferent entions by coparison to the nae variations v and q we call the pseudo-coreferent entions. Given a set of retrieved pseudo-relevant docuents D, we use the retrieval probability of the docuent as an estiate of how likely the pseudo-coreferent ention refers to the sae entity as the query q. As ost retrieval fraeworks return only unnoralized (rank-equivalent) retrieval scores L(d) L(d), the estiate has to be approxiated with d. D L(d ) Counting the occurrence frequency n,d of string-identical entions of, we build a ultinoial language odel across the pseudo-relevant docuents with relevance-odel weighting as follows.

5 ρ nr q () = 1 d D L(d ) d D n,d L(d) (7) n,d In other words, the salience of a ention in the neighborhood is expressed by accuulating relative retrieval probabilities of docuents according to how often they contain the ention. Any corpus that contains docuents with siilar properties as the query docuent is a reasonable target corpus for the feedback query. In this work, we use the coplete TAC source corpus, but other choices such as restrictions to news or blogs, depending on the query docuent are possible. Typically, relevance feedback odels are used to expand the query with new ters. This odel is capable of introducing new entity entions that are not contained in the query docuent. However, since the context of the query docuent is already very rich, a preliinary experient deonstrated that it is better to use the relevance odel to reduce and weight the context found in the query docuent. 5. KB BRIDGE: ENTITY LINKING SYSTEM In this section we describe KB Bridge, our retrieval-based entity linking syste which is ipleented using the Galago search engine and the MRF inforation retrieval fraework. The syste links entity entions in the source docuent to knowledge base entities. The ranking of the entities is a two-stage process. First, entities are ranked using the Galago retrieval odel described in Figure 2. The ranking is then refined with a supervised learning to rank odel using RankLib 3. The final step is NIL handling which deterines if the ention is in the knowledge base or whether it is unknown. 5.1 Knowledge Base Representation Our syste addresses text-driven knowledge bases in which each entity is associated with free text, with relationships between entities fro hyperlinks or other sources. Wikipedia is one representation of such a knowledge base, but our syste would likely perfor well on other knowledge bases. In order to efficiently search over knowledge bases with illions or billions of entities we use an inforation retrieval syste. For these experients, we index the full text of the Wikipedia article, title, redirects, Freebase nae variations, internal anchor text, and web anchor text. 5.2 Docuent Analysis The first step in linking is to identify the entity query span q in the docuent and to find disabiguating contextual inforation for the query odel introduced in Section 3. This includes nae variations v, contextual sentences s, and other neighboring entions. In the TAC KBP challenge, the entities of type person, organization, or location are the ain focus of the linking effort and so the syste detects entities using standard naed entity recognition tools, including UMass s factorie 4 and Stanford CoreNLP 5. These provide the entions spans to derive query entions q, nae variations v, and neighboring entities. Beyond the standard entity classes, our approach 3 vdang/ranklib.htl is general enough to also link other entity types if a suitable detector is incorporated. Given a target entity ention, q, the syste needs to identify nae variations, v, in the docuent, such as Steve to Steve Jobs or IOC to International Olypic Coittee. The goal is to identify alternative naes that are less abiguous than the query ention. We use the within-docuent coreference tool fro UMass s factorie, together with capitalized word sequences that contain the query string (ignoring capitalization and punctuation for the atching) to extract nae variations v. Fro the set of coreferent entions, we extract the sentences s they occur within. After reoving stopwords, casing and punctuation they represent non-ner context such as verbs, adjectives, and ulti-word phrases. 5.3 Cross-docuent evidence Instead of analyzing one docuent in isolation, KB Bridge leverages topically siilar entity entions across docuents. A full-text index of a corpus that contains siilar docuents is constructed. We then generate a query according to Figure 2, using local salience weights ρ local, to retrieve docuents fro the collection. The docuents are ranked by the likelihood of containing relevant entity entions for original query ention, q. These docuents are used in the neighborhood relevance odel to identify salience weights, ρ nr. 5.4 KB Entity Ranking The query odel with salience weights fro local docuent analysis and the neighborhood relevance odel ρ nr () is executed against the search index of KB entries as shown in Equation 5. Our syste supports any feature function expressible in Galago query notation. Beyond this initial ranking, we can further refine the ranking using ore coplex features in learning to rank odels. The ranking is refined using supervised achine learning in a learning to rank (LTR) odel. The refineent eploys ore extensive feature coparisons which would be expensive to copute over the entire collection. For these experients we use the LabdaMART ranking odel, a type of gradient boosted decision tree that is state-of-the-art and captures non-linear dependencies in the data. The odel includes dozens of features. A description of the features used in the odel is found in Table NIL Handling After the entities are ranked, the last step is to deterine if the top-ranked entity for a ention is correct and should be linked to the KB entry or instead refers to an entity not in the knowledge base, in which case NIL should be returned. For these experients, we use a siple NIL handling strategy. We return NIL, if the supervised score of the top ranked entity is below a threshold τ. The NIL threshold τ is tuned on the training data. For the special case of the TAC KBP challenge, the reference knowledge base is a subset of Wikipedia. We exploit this fact by returning NIL whenever the top ranked Wikipedia entity is not contained in the reference knowledge base. 6. EXPERIMENTAL EVALUATION 6.1 Setup We base our experiental evaluation on four data sets fro the TAC KBP entity linking copetition fro 2009 to 2012.

6 Feature Set Type Description Character Siilarity q, v Lower-cased noralized string siilarity: Exact atch, prefix atch, Dice, Jaccard, Levenstein, Jaro-Winkler Token Siilarity q,v Lower-cased noralized token siilarity: Exact atch, Dice, Jaccard Acrony atch q Tests if query is an acrony, if first letters atch, and if KB entry nae is a possible acroyn expansion Field atches q, v Field counts and query likelihood probabilities for title, anchor text, redirects, alternative naes fields Link Probability q, v p (anchor KB entry) - the fraction of internal and external total anchor strings targeting the entity Inlink count docuent prior Log of the nuber of internal and external links to the target KB entry Text Siilarity docuent Noralized text siilarity of docuent and KB entity: Cosine with TF-IDF, KL, JS, Jaccard token overlap Neighborhood text siilarity docuent Noralized neighborhood siilarity: KL Divergence, Nuber of atches, atch probability Neighborhood link siilarity docuent Neighborhood siilarity with in/out links: KL divergence, Jensen-Shannon Divergence, Dice overlap, Jaccard Rank features retrieval Raw retrieval log likelihood, Noralized posterior probability, 1/retrieval rank Context Rank Features retrieval retrieval scores for each contextual coponents: q, v, s, nr, local Table 1: Features of the query ention and candidate Wikipedia entity. Over the years, the TAC organizers and the Linguistic Data Consortiu cae up with evaluation queries with varying characteristics both in ters of abiguity (average unique entions per entity) and variety (average nuber of entities per ention) Data The TAC KBP Knowledge Base was constructed fro a dup of English Wikipedia fro October 2008 containing 818,741 entries. The source collection includes over 1.2 illion newswire docuents, approxiately 500 thousand web docuents and hundreds of transcribed spoken docuents. Across all years there are a total of 12,130 query entity entions. We use all query entions with odd nubered IDs as training data, and the even for evaluation. We inspected the distribution of the queries in the split to ensure they were representative for both NIL and queries with a ground truth entity ( in-kb ) as well as the entity type distribution (Per/Org/GPE). The training set contains 6043 queries, 3034 with a ground truth entity c and 3009 NIL queries. The evaluation set contains 6087 queries with 3058 NIL and 3029 in-kb. This training set is used to learn paraeters of our query odel, as well as paraeters of the supervised reranker. We learn across all years and evaluate year-by-year for coparison with previous results Wikipedia dup For evaluating a retrieval approach to linking, we use a ore recent dup of Wikipedia. We use a Freebase dup of the English Wikipedia fro January 2012, which contains over 3.8 illion articles. It includes the full-text of the article along with etadata including redirects, disabiguation links, outgoing links, and anchor text. We also use the Google Cross-Wiki dictionary [18] for external link inforation fro the web. We derive a apping between our snapshot and the official TAC KBP knowledge base using title atches and article redirects. Using a ore recent snapshot of Wikipedia is a coon practice eployed by the top perforing linking systes in TAC. The full snapshot provides full text as well as rich category and link structure. 6.2 Context Modeling We first evaluate the contributions of the different types of context for the entity query. The context includes the entity query string q, nae variations v, the sentences s surrounding the query or nae variations, as well as neighboring entity spans. The cobinations of these context coponents is indicated by, V, S, or M in the ethod prefix. We evaluate three context weighting ethods. The first is unifor weighting (M). Second is the local docuent odel by Gottipati [5] (indicated by local). Third is our neighborhood relevance odel (indicated by the suffix nr). We copare both for estiating the salience ρ() of neighboring entity entions. Baselines are the ethods using only the query string (), the cobination of query and nae variations (), as well as the local context weighting (M local). Our suggested ethods are SM nr and M nr. These odels are the full query odel with neighborhood relevance weighting with and without sentences. For each of the copared ethods, we train separate λ paraeters on the training data using a coordinate ascent learning algorith. Estiated λ paraeters differ across ethods. For the SM nr odel the estiated paraeters are: λ = 0.321, λ V = 0.293, λ S = 0.155, and λ M = Figure 3 visualizes an ablation study for the context coponents using precision at rank one for evaluation. Figure 3a shows the cuulative iproveents as context is added and weighted with the neighborhood relevance (M nr). M with unifor neighborhood weighting perfors siilarly to M local weighting (not shown). We observe that adding sentence context does not significantly iprove perforance. Figure 3b details the individual contributions of contextual coponents (oitting sentences). It is interesting that the ethod (entity nae plus nae variations) and M nr (entity nae plus weighted entity spans) are coparable in effectiveness. This is useful when no high quality nae variations are extractable fro the text, as is the case in inforal text fro social edia. The cuulative figure above shows that when cobined these features yield further iproveent. Across all years the neighborhood relevance odel achieves better effectiveness than the local odel. 6.3 Ranking Distribution In the previous section we exained the effectiveness of the different contextual coponents on the top-ranked result. Table 2 presents the contextual ranking odels evaluated using ean reciprocal rank (MRR). Siilar to the previous results, it shows that the ost effective odels include the neighborhood relevance weighting schee (nr). M nr and SM nr are significantly better than the baseline. The only exception is in 2010, when the queries are

7 P1 P M M_nr SM_nr all 2009 M_local M_nr M M_nr SM_nr all M M_nr SM_nr all (a) Cuulative M_local M_nr M_local M_nr M M_nr SM_nr all 2012 M_local M_nr Method M nr M nr LTR Table 3: Learning to rank refineent results with ean recipocal rank (MRR). All LTR results are statistically significant with α = 5% over the unsupervised M nr average recall M_nr M_nr LTR Figure 4: Average recall at rank cutoff k cutoff rank k (b) Individual Contributions. Figure 3: Ablation study for the suggested ethod in ters of 1. Method M nr * 0.825* M M nr 0.795* * 0.715* M local 0.784* * S * SM nr 0.792* * 6* SM local 0.780* * 0.719* all context 0.786* * 0.735* Table 2: Ranking results on TAC by year with varying context ethods with ean reciprocal rank (MRR). The best results for each year are highlighted in bold. Results that are statistically significant with α = 5% over the baseline are indicated with *. easier. In this case only the M nr ethod is significantly better. Additionally, M nr is significantly better than the local weighting (Gottipati) for However, there is no significant difference in We hypothesize that the reason the neighborhood odel does not iprove over the local odel in 2012 is because the queries are significantly ore abiguous and the quality of the retrieved feedback docuents is lower. We refine the retrieval ranking using a supervised learning to rank odel. The the features in the ranking odel are described in Table 1. The top 100 results fro the best ranking, M nr are reranked. The results of this are shown in Table 3. The results show significant iproveent over the initial retrieval ranking leveraging ore features that perfor ore extensive contextual coparison. The results for 2012 are still well below the other years, indicating the difficulty of these queries even leveraging the ore coplex contextual features. This indicates that a better feature representation is needed to address soe of these difficult to resolve entions Recall The previous results use ean reciprocal rank to easure the retrieval effectiveness. We now exaine the rank distribution in ore detail, exaining the recall at a given rank cutoff. The entity recall is critical because it is an upper bound on the effectiveness of downstrea systes. To achieve a iniu 90% recall threshold across all years requires hundreds of candidates for the query () odel, 20 for, 16 for M nr, and only 3 for M nr LTR. The learning to rank odel achieves at least 95% recall across all years within 10 results. 6.4 TAC KBP results In this section, we evaluate the ranking as part of the entire linking pipeline described in Section 5. We report the icro-averaged accuracy because we do not focus on clustering NIL entity entions. The results are in Table 4. The unsupervised retrieval M nr perfors well, above the edian in 2012 and copetitive in previous years. The supervised ranking odels iprove effectiveness significantly. The in-kb ranking results outperfor the best perforing systes in 2009 through 2011 and are coparable in The ain focus of this work is ranking, and this shows the effectiveness of our approach. We now exaine the overall (all) results, including the NIL handling. The results show that the M nr with LTR reranking and NIL handling outperfors the top syste in 2009 and is copetitive with the best perforing systes in subsequent years. Applying the score threshold iproves the overall accuracy despite decreasing in-kb effectiveness. This is because soe correctly linked entities are arked as NIL, but are outweighed by the greater reduction in false

8 in-kb NIL all in-kb NIL all in-kb NIL all in-kb NIL all M nr M nr LTR RM nr LTR NIL Best Perforer Table 4: TAC Entity Linking perforance in icro-average accuracy. positive entity links. The NIL handling strategy based on thresholding the ranking score is effective, but could be iproved further. Other linking systes use a supervised NIL classifier for this step, allowing the to perfor well despite less effective in-kb ranking. 7. CONCLUSION In this paper we propose an approach to entity linking based upon the Markov Rando Field inforation retrieval odel (MRF-IR). We focus on the task of ranking knowledge base entities. We deonstrate how concepts fro joint neighborhood odels can be incorporated within the MRF-IR fraework. Further, we propose a neighborhood relevance odel (NRM) that uses relevance feedback techniques to identify salient entity context across docuents. Our experients on the TAC KBP entity linking data show that the neighborhood relevance odel outperfors other contextual odels. When the ranking is refined with a learning to rank odel the results beat the current best perforing systes on in-kb ranking accuracy. Cobined with a siple NIL handling strategy the overall effectiveness on all entions is coparable to, and soeties better than, other state-of-the-art entity linking systes. Acknowledgeents This work was supported in part by the Center for Intelligent Inforation Retrieval and in part under subcontract # fro SRI International, prie contractor to DARPA contract #HR C Any opinions, findings and conclusions or recoendations expressed in this aterial are those of the authors and do not necessarily reflect those of the sponsor. 8. REFERENCES [1] Entity disabiguation for knowledge base population. In COLING, [2] Razvan Bunescu and Marius Pasca. Using encyclopedic knowledge for naed entity disabiguation. In European Chapter of the Association for Coputational Linguistics (EACL), [3] S. Cucerzan. Large-Scale Naed Entity Disabiguation Based on Wikipedia Data. EMNLP, [4] S. Cucerzan. Tac entity linking by perforing full-docuent entity extraction and disabiguation. Proceedings of the Text Analysis Conference, [5] Swapna Gottipati and Jing Jiang. Linking entities to a knowledge base with query expansion. In EMNLP, [6] Johannes Hoffart, Mohaed A. Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weiku. Robust disabiguation of naed entities in text. In EMNLP, [7] Darren W. Huang, Yue Xu, Andrew Trotan, and Shloo Geva. Focused access to XML docuents. chapter Overview of INEX 2007 Link the Wiki Track [8] Heng Ji and Ralph Grishan. Knowledge base population: successful approaches and challenges. In ACL-HLT, [9] Heng Ji, Ralph Grishan, and Hoa Dang. Overview of the TAC2011 knowledge base population track. In TAC 2011 Proceedings Papers, [10] Daphne Koller and Nir Friedan. Probabilistic graphical odels: principles and techniques. MIT press, [11] Sayali Kulkarni, Ait Singh, Ganesh Raakrishnan, and Souen Chakrabarti. Collective annotation of wikipedia entities in web text. In KDD, [12] Paul McNaee, Veselin Stoyanov, Jaes Mayfield, Ti Finin, Ti Oates, Tan Xu, Douglas W. Oard, and Dawn Lawrie. HLTCOE participation at TAC 2012: Entity linking and cold start knowledge base construction. In TAC KBP, [13] D. Metzler and W.B. Croft. Latent concept expansion using arkov rando fields. In SIGIR, [14] Donald Metzler and W. Bruce Croft. A arkov rando field odel for ter dependencies. In SIGIR, [15] S. Monahan, J. Lehann, T. Nyberg, J. Plyale, and A. Jung. Cross-lingual cross-docuent coreference with entity linking. Proceedings of the Text Analysis Conference, [16] L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algoriths for disabiguation to wikipedia. In ACL, [17] J. J. Rocchio. Relevance feedback in inforation retrieval. In G. Salton, editor, The SMART Retrieval Syste: Experients in Autoatic Docuent Processing, chapter 14, pages [18] Valentin I. Spitkovsky and Angel X. Chang. A Cross-Lingual dictionary for english wikipedia concepts. In LREC, [19] Veselin Stoyanov, Jaes Mayfield, Tan Xu, Douglas W. Oard, Dawn Lawrie, Ti Oates, and Ti Finin. A context-aware approach to entity linking. In AKBC-WEKEX, [20] J. Xu and W.B. Croft. uery expansion using local and global docuent analysis. In SIGIR, 1996.

A Neighborhood Relevance Model for Entity Linking

A Neighborhood Relevance Model for Entity Linking Jeffrey Dalton University of Massachusetts 140 Governors Drive Aherst, MA, U.S.A. jdalton@cs.uass.edu Laura Dietz University of Massachusetts 140 Governors