Bc. Pavel Taufer. Named Entity Recognition and Linking

Size: px

Start display at page:

Download "Bc. Pavel Taufer. Named Entity Recognition and Linking"

Jonas Reed
6 years ago
Views:

1 MASTER THESIS Bc. Pavel Taufer Named Entity Recognition and Linking Institute of Formal and Applied Linguistics Supervisor of the master thesis: Study programme: Study branch: RNDr. Milan Straka, Ph.D. Informatics Artificial Intelligence Prague 2017

2 I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that the Charles University has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 subsection 1 of the Copyright Act. In... date... signature of the author i

3 Title: Named Entity Recognition and Linking Author: Bc. Pavel Taufer Institute: Institute of Formal and Applied Linguistics Supervisor: RNDr. Milan Straka, Ph.D., Institute of Formal and Applied Linguistics Abstract: The goal of this master thesis is to design and implement a named entity recognition and linking algorithm. A part of this goal is to propose and create a knowledge base that will be used in the algorithm. Because of the limited amount of data for languages other than English, we want to be able to train our method on one language, and then transfer the learned parameters to other languages (that do not have enough training data). The thesis consists of description of available knowledge bases, existing methods and design and implementation of our own knowledge base and entity linking method. Our method achieves state of the art result on a few variants of the AIDA CoNLL- YAGO dataset. The method also obtains comparable results on a sample of Czech annotated data from the PDT dataset using the parameters trained on the English CoNLL dataset. Keywords: named entities, named entity recognition, named entity linking ii

4 I would like to thank to my thesis supervisor RNDr. Milan Straka, Ph.D. for many interesting insights and comments and help with writing this thesis. iii

5 Contents 1 Introduction Our goals Structure of the thesis Existing Resources Knowledge bases Wikipedia DBpedia Wikidata Other knowledge bases Glossary Datasets AIDA CoNLL-YAGO Dataset PPRForNED PDT TAC Named entity recognition and linking task NERL steps Named entity recognition Candidate generation Mention disambiguation NIL selection and clustering Related work Proposed knowledge base Design goals Used knowledge bases Proposed knowledge base contents Construction Proposed entity linking method τsim Candidate generation Features Weights learning algorithm Mention disambiguation PPRSim Our extensions of PPRSim Hyperparameter search Experiments and Results Datasets preparation Candidate generation Evaluation Baselines Features

6 6.2.4 Experiments Results Mention disambiguation Baselines Evaluation methods Mention disambiguation hyperparemeters Baseline comparison results Comparison with other works PDT results User documentation and attachment content 58 8 Conclusion 62 Bibliography 63 List of Figures 67 List of Tables 68 2

7 1. Introduction Named entity recognition (NER) is a task of identifying the so called named entities, which are prominent words or sequences of words denoting names, locations, organizations, etc. a named entity. In general, any title of Wikipedia page can be considered Furthermore, the task also consists of classifying each recognized named entity into one of the preselected categories, like personal names, organizations or locations. Entity linking (EL) is a task of linking named entities into a knowledge base (KB). A knowledge base is a list of named entities and information about them. For instance, Wikipedia can be considered as one of the largest publicly well known knowledge bases. Entity linking is also known as named entity disambiguation (NED). Both of these tasks together are called named entity recognition and linking (NERL) task. There are many raw texts that contain named entities but do not contain their annotations, for example news articles, web pages or books. Named entity recognition and linking algorithms can be used to extract named entity annotations and link them to a knowledge base. This in turn allows for smarter document searching, for example by distance using geo-coordinates. It is also possible to obtain all documents referring to a specific entity, for example to a person or a place. Another use case of entity linking is machine translation. When a KB contains multilingual data we can translate named entities separately from the rest of the text and merge them after translation. The main difficulty in EL is the ambiguity of named entities. In general, a single named entity can correspond to multiple entities in a KB. An example could be the named entity Washington, which could represent a city, a person or a state. 1 In these ambiguous cases EL can use contextual information as well as coherence of present named entities to try to disambiguate among possible candidates. Named entity recognition has been studied for a long time the NER task was presented at the MUC-6 conference which was held in November Since then, the NER task has been further studied for example in the CoNLL

8 and the CoNLL shared tasks, or the NEEL challenge. 5 The most studied language is English, but several other languages are also introduced, for example the CoNLL 2002 task contains texts in Spanish and Dutch, the CoNLL 2003 task contains texts in English and German and TAC KBP 2016 focuses on English, Chinese and Spanish. The datasets published in above challenges are extensively studied until today. Methods for entity linking go back to the works of Dill et al. [2003], Cucerzan [2007] and Milne and Witten [2008b] and have been well studied since. The entity linking task has been suggested under several names during the years, for example Wikification [Ratinov et al., 2011], Grounding [Leidner et al., 2003] or Named-entity disambiguation [Hoffart et al., 2011]. There have been several contests in recent years aimed at performing both NER and EL, for example TAC (Text Analysis Conference) KBP tasks. 6 A few other datasets are publicly available for training and evaluation of EL algorithms, but most of them are in English. One of the most studied dataset is AIDA CoNLL- YAGO Dataset 7 introduced in Hoffart et al. [2011] and consisting of annotated English articles. A good knowledge base is a necessity for an entity linking algorithm. The KB is not used only as a target for linking but some features present in the KB can help with the entity linking task itself by providing, for example, relations between entities, multilingual text data or ontology classification. There are several knowledge bases publicly available for download and machine processing. 1.1 Our goals One of the goals of this thesis is to design and create a KB for our EL algorithm. Another goal is to create a named entity linking algorithm that will work with only a minimum of language dependent knowledge or will be able to utilize supervised data from another language. The reason behind this is the availability of English supervised data and the lack of supervised data in other languages. Our contributions We have designed and performed following tasks to accomplish our goals: research/yago-naga/aida/downloads/ 4

9 design and create a knowledge base from available sources, acquire datasets for testing and modify them to link to our knowledge base, design and implement an entity linking algorithm, evaluate our method on acquired datasets. All the described tasks were performed exclusively by the thesis author. On several variants of the AIDA CoNLL-YAGO dataset our implemented algorithm achieves state-of-the-art results. Furthermore, on a sample of Czech annotated data from the PDT 8 dataset we achieved satisfying results using parameters trained on the English CoNLL dataset. 1.2 Structure of the thesis In Chapter 2, we explore existing available resources for entity linking. We examine knowledge bases, datasets and existing systems. In Chapter 3, we describe the related work on name entity recognition and linking. My knowledge base constructed from preexisting knowledge bases is proposed in Chapter 4. The main part of this thesis is in Chapter 5 and contains the description of my entity linking method. Chapter 6 contains several experiments we performed on the available datasets and our results. The description of attached files and the user documentation is in Chapter 7. Finally, Chapter 8 is the conclusion of this thesis and contains a summary of achieved outcomes

10 2. Existing Resources In this chapter, we describe available resources for entity linking. First we examine available knowledge bases and the data they contain. Then we inspect what datasets are available for training and evaluation of entity linking algorithms. Additionally, we establish a glossary of terms that will be used for the rest of this work. 2.1 Knowledge bases A knowledge base in general consists of a list of entities, information about each of present entities and data about relations among entities. The most commonly known knowledge base is probably Wikipedia. 1 In the entity linking task we are mostly interested in entities that can occur in texts that we want to process. We want the following properties from a KB: Public availability it can be used and verified (or even contributed to) by others, Machine readability algorithms can access information easily, Persistent identifiers entities should have an identifier that does not change over time and is language independent, Credibility the KB is either human created and verified/sourced or autocreated from some other knowledge base that already satisfies this property, Multilingual text availability we can use it for different languages without having to construct a new KB for each language in a different way, Relations between entities the knowledge base can tell us how related two entities are (for example using common incoming links). We now examine several publicly available knowledge bases, starting with Wikipedia, DBpedia and Wikidata knowledge bases and then mentioning a few other knowledge bases like YAGO or Freebase Wikipedia Wikipedia was launched on January 15, It is owned by the nonprofit Wikimedia Foundation. Wikipedia contains 284 language editions with over 42 million pages in total (over 5 million articles for the English language) as of Novem

11 ber It consists of pages that are uniquely identified by their page title. Wikipedia uses a system of references and citations to try to make sure that the contents of articles are attributable to a reliable published source. Pages in Wikipedia are mainly of the following types: 2 Entity - pages about traditional encyclopedia topics, people, places, media, companies, events and more. Disambiguation - used when page title is ambiguous because it has multiple meanings. Contains links to different page titles for each meaning. Redirect - points from one page title to another and is used when more than one possible page title exists. It may also point to a specific part of a page. Administrative - user pages, internal Wikipedia information, templates. As stated before, Wikipedia pages are uniquely identified by their page title. A single entity can have multiple distinct page titles over time and therefore does not have a persistent identifier. This is most prominent for people. For example consider that there exists a famous sportsman John Doe. A page about this person is created under page title John Doe. An URL link to this page can be used by anyone to refer to sportsman John Doe. But later someone else named John Doe becomes notable as a singer and it is decided that they should have their page too. Page name John Doe is already taken so for the second person the page name John Doe (singer) is chosen. At this moment an editor will probably rename page John Doe to John Doe (sport) and create a disambiguation page John Doe with links to the other two pages. Now any old link to page John Doe is linking to a disambiguation page instead of the first John Doe. Page titles can be also changed for other reasons or even deleted. Therefore, a Wikipedia page name is not a long term reliable entity identifier. Wikimedia provides a way to download a dump of Wikipedia for each language edition as a xml. 3 Contents of the xml dumps are mostly raw pages containing text and infoboxes so the dumps are hard to use directly without preprocessing. The dumps are being released twice a month. Table 2.1 summarizes page type counts for Czech and English language editions of Wikipedia. 2 Articles 3 7

12 2.1.2 DBpedia Language English Czech Pages total 12,737, ,507 Entities 4,841, ,457 Redirects 7,628, ,400 Disambiguations 267,491 8,647 Links 181,980,575 13,050,035 Table 2.1: Wikipedia overview as of Septemeber 2016 As described directly on the DBpedia about page: 4 DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself. The project was started by the Free University of Berlin and Leipzig University, in collaboration with OpenLink Software, and the first publicly available dataset was published in DBpedia provides an extraction framework and a set of extractors to process Wikipedia dumps into more machine readable format. 5 Maintainers of DBpedia keep the extractors up to date for multiple language editions. The extraction framework also allows other developers to create their own custom extractors. Extractors use a set of rules and heuristics to extract wanted knowledge from Wikipedia pages. Because the Wikipedia pages are not strictly structured, the output of extractors can contain errors, either from imprecise heuristics or malformed data in the page itself. Available extractors provide for example a list of texts present in hyperlinks to each page (also called anchor texts), a list of redirects to each page, disambiguation links from each disambiguation page, occurrences of geographic coordinates or infobox properties. DBpedia uses a resource URL derived from page title as a entity identifier so it has the same problems as Wikipedia regarding entity identifier persistence

13 The maintainers of DBpedia also maintain an online instance that is updated live from continuous stream of updates from Wikipedia. 6 Several official dataset dumps were made available online, the last one being from October Another way to obtain DBpedia dataset is to download a Wikipedia dump and use the DBpedia extraction framework to extract the DBpedia dataset locally. There are also infrequently published full dumps of live DBpedia instance and its incremental updates to keep up to date with live Wikipedia Wikidata Wikidata 9 is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It was launched on October 29, It is intended to provide a common source of data to be used in other Wikimedia projects, and by anyone else, under a public domain licence. The content of Wikidata is made by both humans and automatic scripts that take care for example of newly created pages in Wikipedia. As stated at the Wikidata introduction page: Unlike Wikimedia Commons, which collects media files, and the Wikipedias, which produce encyclopedic articles, Wikidata will collect data, in a structured form. This will allow easy reuse of that data by Wikimedia projects and third parties, and will enable computers to easily process and understand it. Wikidata is composed of entities of three types: items, properties and queries. Descriptions of what exactly are Wikidata items, entities and properties were retrieved from the Wikidata glossary page and are as follows: 10 Entity is the data content of a Wikidata page, that either may be an item, a property or a query. Every entity is uniquely identified by an entity ID, which is a prefixed number, for example starting with the prefix Q for an item and P for a property. An entity is also identified by a unique combination of label and description in each language. Each entity has also a dereferenceable URI that follows the pattern http: // www. wikidata. org/ entity/ ID where ID is its entity ID. Item (in some languages translated to words for subject, object or element in the user interface) refers to a real-world object, concept, or event that is given an

14 identifier (an equivalent of a name) in Wikidata together with information about it. Each item has a corresponding Wikipage in the Wikidata main namespace. Items are identified by a prefixed id (like Q5), or by a sitelink to an external page, or by a unique combination of multilingual label and description. Items may also have aliases to ease lookup. The main data part of an item is the list of statements about the item. An item can be viewed as the subject-part of a triplet in linked data. Property (in some languages translated to attribute) is the descriptor for a data value, or some other relation or composite or possibly missing value, but not the data value or values themselves. Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value. The property is stored on a page in the Property namespace, and includes a declaration of the datatype for the property values. Compared to linked data, the property represents a triplet s predicate. Each item consists of the following parts: identifier - an unique number prefixed with the letter Q, labels - list of pairs of a language and a main name, aliases - alternative names, also multilingual, descriptions - more verbose descriptions, also multilingual, statements - list of statements about the item, sitelinks - links to Wikipedia and Wikimedia pages about given item in all occurring language editions. Wikidata decomposes knowledge into statements. Statements in Wikidata can be more than a single triplet (as is common for example in linked data) of (item, property, value) and can contain information about how or when the value was recorded or measured with optional references. An example of a Wikidata item is shown in Figure 2.1. Wikidata is connected with Wikipedia pages through sitelinks that are present in items. It also contains references to identifiers from other knowledge bases through specific properties, for example to Freebase MID through statements with property P links storage in Wikipedia. 12 Also since February 2013, Wikidata is used for interlanguage As of September 2016, Wikidata contains over 23 million items. An overview of item count statistics for Wikidata by language is available in Table 2.2. Rows show statistics based on whether the item has label in the given language or

15 Figure 2.1: Graphic representing datamodel in Wikidata. Taken from https: //commons.wikimedia.org/wiki/file:datamodel_in_wikidata.svg. Language Items containing Items containing any Wikipedia link to Wikipedia language-specific information coverage English 4,975,671 16,060, % Czech 345,709 4,068, % Table 2.2: Wikidata overview by language as of Septemeber 2016 a link to the corresponding Wikipedia language edition. The coverage refers to the portion of Wikipedia pages that are covered by Wikidata. Wikidata dumps are available for download as a single compressed file containing JSON representation of all entities contained in the Wikidata. 13 The dumps are released once a week Other knowledge bases There are also many other knowledge bases available. Here are a few notable mentions

16 Freebase Freebase used to be a collaborative knowledge base consisting of data composed mainly by its community members. It was developed by the American software company Metaweb and publicly ran since March The company was acquired by Google in 2010 and Freebase was one of the data sources used to power Google s Knowledge Graph. 14 Its site was shut down on 2 May 2016 and most of its data were moved to Wikidata. Freebase used unique identifier (MID) for its entities and the MIDs are occasionally still used in annotated datasets or linked data. Freebase used to provide a dump for public download. Google provides a way to access Freebase data by MID providing URLs in format where $MID stands for entity MID. Knowledge Graph Knowledge Graph is an online knowledge base provided by Google. 15 However it is not available for download, which is why we cannot use Knowledge Graph in this work. YAGO YAGO (Yet Another Great Ontology) is a knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken. 16 As stated at the in the YAGO Overview page, it is derived from Wikipedia, WordNet 17 and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities. YAGO is special in several ways: 1. The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value. 2. YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes. 3. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and research/yago-naga/yago/

17 entities. 4. In addition to a taxonomy, YAGO has thematic domains such as music or science from WordNet Domains. 5. YAGO extracts and combines entities and facts from 10 Wikipedias in different languages. However, because YAGO does only provides data in 10 main languages and not in Czech, we decided not to use it. BabelNet BabelNet 18 is both a multilingual encyclopedic dictionary and a semantic network which connects concepts and named entities in a very large network of semantic relations. It is developed at Linguistic Computing Laboratory in the Department of Computer Science of the Sapienza University of Rome. BabelNet was presented in 2012 in Navigli and Ponzetto [2012]. The current version BabelNet 3.7 covers 271 languages and is obtained from the automatic integration of many knowledge bases, such as Wordnet, Wikipedia or Wikidata. BabelNet is provided as a stand-alone resource with its Java API, a SPARQL endpoint and a Linked Data interface. However, there is no way to download full BabelNet for offline processing, therefore we decided not to use it. 2.2 Glossary From now on we need to distinguish between anentity in knowledge base and a named entity present in the text. To do so, we introduce the following definitions: named entity a real world object, that can be denoted with a proper name, which may or may not have a corresponding entity present in a knowledge base, knowledge base entity an entity present in a knowledge base, also referred to as entity, entity mention an occurrence of a named entity in a document, which consists of the text of the mention and the entity in the knowledge base it refers to, also referred to as mention, candidate an entity from a knowledge base that is considered by an entity linking algorithm as a corresponding entity for a given mention,

18 NIL special (virtual) entity used as a placeholder when the corresponding entity for a mention is not present in the knowledge base, also referred to as NME or null. 2.3 Datasets We need some data to be able to train and evaluate named entity recognition and linking algorithms. A named entity recognition and linking dataset should consist of several documents containing tagged named entities with entity identifiers from a selected knowledge base. There are several such datasets available and we describe the most prominent ones AIDA CoNLL-YAGO Dataset AIDA CoNLL-YAGO Dataset dataset was created by Hoffart et al. [2011] because, as the authors state, there were no other established benchmarks at the time. The dataset is based on the CoNLL 2003 dataset [Tjong Kim Sang and De Meulder, 2003] which has become a de-facto standard of NER evaluation. We refer to this dataset as the CoNLL dataset in the rest of the thesis. This dataset has been used in many works since its creation, for example in Pershina et al. [2015], Luo et al. [2015],Francis-Landau et al. [2016], Yamada et al. [2016] or Nguyen et al. [2016]. As described by Hoffart et al. [2011]:This dataset consists of proper noun annotations for 1393 Reuters newswire articles. Authors hand-annotated all these proper nouns with corresponding entities in YAGO 2. Each mention was disambiguated by two students and resolved by the authors in case of conflict. Dataset was updated in to add Wikipedia URLs and Freebase MIDs for almost all entities. The CoNLL dataset is publicly available for download, 19 but the original CoNLL 2003 dataset uses Reuters Corpus as the underlying text data, which is only available under a restrictive licence. The dataset is split into three sets: train/testa/testb. Train and testa dataset are generally used for learning and hyperparameter selection and testb is used for final evaluation and results comparison. Table 2.3 summarizes properties of the dataset. Authors also released their set of candidates that was used in their experiments

19 dataset train testa testb total articles ,393 mentions total 23,396 5,917 5,616 34,929 mentions with correct entity 18,541 4,791 4,485 27,817 mentions without correct entity 4,855 1,126 1,131 7,112 distinct mentions 8,011 2,795 2,606 11,015 words per article (avg) mentions per article (avg) Table 2.3: CoNLL dataset properties dataset train testa testb total candidates per mention (avg) mentions with correct entity present 18,539 4,791 4,484 27,814 mentions with correct entity present % 99.9% 100% 99.9% 99.9% Table 2.4: CoNLL YAGO means dataset properties as a mapping between mention texts and candidate entities in YAGO 2. This set of candidates is usually referred to as CoNLL YAGO means. An overview of properties of this dataset is in Table 2.4. The reason for mentions without correct entity that were not marked as NIL is that the corresponding articles were deleted from Wikipedia PPRForNED Authors of Pershina et al. [2015] published their set of candidates that they used when evaluating on the CoNLL dataset. 20 The PPRForNED dataset consists of all files from the CoNLL dataset. However, not all NIL mentions from CoNLL are present as can be seen in the summary of PPRForNED dataset properties in Table 2.5. Correct entities were added to the candidate list, even if they were not generated initially. This was performed for the testb set as well. Authors also provided a set of Freebase popularity scores that they used in their algorithm. Each named entity mention consists of the text of the mention and Wikipedia URL of the correct entity. Each candidate entity consists of a name, list of related candidates and a corresponding Wikipedia URL. This dataset was also used in other works, for example in Luo et al. [2015]

20 dataset train testa testb total mentions total 20,288 5,191 4,950 30,429 mentions with correct entity 18,541 4,791 4,485 27,817 mentions without correct entity 1, ,611 distinct mentions 5,936 2,261 2,079 8,050 mentions per article (avg) candidates per mention (avg) missing NIL mentions 3, ,500 mentions with correct entity present 18,538 4,791 4,485 27,814 mentions with correct entity present % 99.9% 100% 100% 99.9% Table 2.5: PPRForNED dataset properties total annotations 115 distinct mention texts 75 distinct entities 58 missing entities 2 Table 2.6: PDT dataset properties PDT PDT stands for the Prague Dependency Treebank. It is a Czech language dataset. Current version is As described by Bejček et al. [2013]: PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level. The original dataset does not contain annotations for named entity recognition and linking tasks, but there is currently an ongoing process which aims to annotate a small part of PDT testing data. We were given access to a current small sample of annotated data. The data use Wikipedia and Babelnet URLs as identifiers. Summary of the dataset is in Table

21 2.3.4 TAC TAC stands for Text Analysis Conference. As described at TAC about page: 22 TAC is a series of evaluation workshops organized to encourage research in Natural Language Processing and related applications, by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. TAC comprises sets of tasks known as tracks, each of which focuses on a particular subproblem of NLP. TAC tracks focus on end-user tasks, but also include component evaluations situated within the context of end-user tasks. There have been several TAC tracks concerning named entity linking, starting with TAC-KBP2009 [McNamee et al., 2009], where KBP stands for Knowledge Base Population. Most recent is KBP2016, which is currently in progress as of November However, the dataset itself is only available for the duration of the track and only for the track participants, therefore, we were unable to evaluate our algorithm on this dataset. KBP tracks recently started to focus on NIL clustering and cross-lingual entity linking. NIL clustering is a subtask of entity linking that aims to cluster the occurrences of a single named entity that is not present in the used knowledge base. Cross-lingual entity linking is a task where the documents can occur in multiple languages and are linked into a single language knowledge base. KBP2014 entity linking track 23 focused on both mono-lingual English entity linking and NIL clustering and on cross-lingual Chinese/Spanish to English entity linking

22 3. Named entity recognition and linking task In this chapter we describe the named entity recognition and linking (NERL) task steps and summarize how they are approached. Furthermore we examine what work has been done in the named entity recognition and linking field. 3.1 NERL steps The named entity linking task has been researched under several names over the years. For example Wikification [Ratinov et al., 2011], Grounding [Leidner et al., 2003] or Named-entity disambiguation [Hoffart et al., 2011]. The usual approach can be generalized into following steps: find named entity mentions in given text, generate a set of candidates for each mention, select best candidate for each mention, determine if a mention does correspond to a NIL and cluster corresponding NIL mentions together. We already established what a mention, candidate, entity and NIL refer to in the previous chapter in Section 2.2. Now we are going to define what local and global information means within the context of this thesis. Information from a knowledge base used in the named entity recognition and linking is commonly split into two types: local and global information. Local information consists of features related to a single mention, such as similarity between the document and texts available about each candidate. Because of that local information features can be used for linking each mention independently. Global information contains features that utilize multiple mentions present in a document. An example feature for a pair of entities is whether there is a link in the Wikipedia leading from one entity to the other. Both types of features have their strengths and weaknesses. Local features better encode direct similarity between mention and candidates but do not take coherence into consideration. Global features on the other hand make use of interlinking information between entities, but can introduce noise, for example when the document consists of multiple topics. In general the named entity recognition and linking process can be split into following cascading steps: 1. Named entity mentions recognition - find mentions in given document, 18

23 2. Candidate generation - create a list of candidates for each mention using local information, 3. Mention disambiguation - select best candidate for each mention using both local and global information, 4. NIL selection and clustering - decide if the KB contains the entity of each mention and cluster together mentions referring to a same NIL. Every step is performed separately in most algorithms, where each step utilizes information from the document and from output of previous steps. This allows using distinct techniques and external information like gazetteers (a list of named entities). However, there is no way to adjust results of previous steps and once a step does not produce correct result (e.g. misses a mention, does not generate correct entity candidate), later steps cannot do anything about it. In reaction to the described error propagation, recently there have been a few attempts to do the both named entity recognition and linking tasks jointly as a single step, as can be seen in Luo et al. [2015] and Nguyen et al. [2016]. This allows for example to use global information from other mentions to decide whether a given text span is a mention or not Named entity recognition The task of named entity recognition step is to find named entity mentions in given text. In some implementations when solving the NER task on its own the output can also contain a type of each found mention. For example, the following classes are used for named entity classification in CoNLL shared tasks 2002 and 2003: PER - person, LOC - location, ORG - organization, MISC - miscellaneous. Other examples are 7 categories used in MUC or two-level hierarchy of 46 classes used in Czech Named Entity Corpus. 1 Algorithms for NER step mostly use information from gazetteers such as GeoNames 2 and knowledge bases,ortographic features such as whether words start with capital letters, sentence structure or whether word is a noun. The traditional approach to solve the named entity recognition task is to predict for each word the named entity mention type and position within the

24 mention. Positions can be described for example by the BIO scheme [Ratinov and Roth, 2009]: B for mention Beginning, I for Inside of a mention and O for text Outside of any mention. There are also other schemes such as BILOU, where L marks last word of a mention a U marks single word mention. HMM (Hidden Markov Models) or CRF (Conditional Random Fields) models are often used for this classification, as can be seen for example in the works of Zhou and Su [2002] or Finkel et al. [2005]. The system using the above approach generally rely on hand-crafted features and domain specific knowledge. Recent advancements in neural networks have also found their use in this task. Neural networks were used for example in Lample et al. [2016] and Yang et al. [2016], utilizing bidirectional LSTM or GRU networks, precomputed word-embeddings, character-level word embeddings and CRF output layer. Another approach is to detect spans of mentions by modelling the distribution of segmentation boundaries directly. This was used for example in Luo et al. [2015] with a Semi-CRF model as a part of joint NERL algorithm. Last approach mentioned here is extracting a list of all possible mention texts from a knowledge base and look for matching parts of the document. The list of possible mention texts can be generated for example from anchor texts of hyperlinks in Wikipedia, as was used for example in DBpedia Spotlight by Mendes et al. [2011] or Wikifier. 3 Many works focused on entity linking skip named entity recognition step altogether and rely on output of existing algorithms. One of the more prominent NER algorithms is the Standfor NER tagger Finkel et al. [2005], that is used for example by Hoffart et al. [2011]. For Czech language there exists a NER algorithm by Straková et al. [2014, 2016] Candidate generation In general, it is not computationally feasible to try matching each mention against every entity in KB. Heuristics to prune possible candidate assignments space are often used to make this problem tractable. The goal of the candidate generation step is to create a limited set of best candidates for each mention. How many candidates are generated depends on what algorithms are used in following steps. Formally we can say that for a given document D, list of mentions M =

25 (m 1,..., m n ) and maximum number of candidates k we want to obtain a list of candidate sets C = (C 1..., C n ) where C i contains up to k best candidate entities for mention m i in relation to the document D. There are several options to use when creating the set of candidates for each mention. A direct approach is to have an index that lists all entities that are likely to appear with the given mention text. This index can be constructed for example from Wikipedia anchor texts [Ratinov et al., 2011] or YAGO means relation [Hoffart et al., 2011]. Another way to build the dictionary is to use Wikipedia titles, redirects and disambiguation pages Pershina et al. [2015]. There are ways to enhance this direct approach, for example by using coreference resolution and using the longest mention in the coreference chain [Pershina et al., 2015]. Coreference occurs when an entity is present two or more times under a same referent. Coreference resolution is a task to find coreferences in text and cluster/chain them together. Once a set of candidates for a mention is constructed there are several ways to use the local information from the document to prune the candidate set. In general, candidate generation algorithms contain a scoring function to create a similarity score between mention and each of its candidate entity. The similarity score is commonly computed using features generated for each candidate of every mention. One set of features is textual similarity between the document and the candidate entity. This can be for example cosine similarity of TF-IDF vectors of the document and entity text [Ratinov et al., 2011], keyphrase based similarity [Hoffart et al., 2011] or topical domain similarity of the document [Nguyen et al., 2016]. More novel approaches use convolutional neural networks to capture semantic similarity in different contexts [Francis-Landau et al., 2016]. Other features can be based on a likelihood prior for candidate entities. There are several ways to compute the priors. One option is to base the prior purely on the candidate entity without using any information from the document itself, for example using the number of incoming links to the entity in Wikipedia. Other option is to use some local information, for example the ratio of hyperlinks with the same anchor text as the mention text linking to a given entity over all hyperlinks with this anchor text. Number of generated candidates differs from article to article but it can range from 5 (Hoffart et al. [2011]) to up to 20 [Ratinov et al., 2011], Nguyen et al. [2016]. 21

26 3.1.3 Mention disambiguation The mention disambiguation step takes place when a set of candidates is available for each mention (from the previous step). The mention disambiguation goal is to disambiguate among the candidates to select the best candidate for each mention. All mentions are disambiguated together with the idea that the entities appearing in a single document are generally related to each other. Because of this, global information can be used for in addition to the local information used in previous steps. Formally we can say that for a given document D, list of mentions M = (m 1,..., m n ) and candidate set C = (C 1..., C n ) we want to find best candidate assignment A = (a 1,..., a n ) where i (1,...,n) a i C i, with regard to the mentions M and document D. Mention disambiguation can be approached as a maximization optimization problem where the goal is to find the best assignment of candidates using optimization criteria that considers both local and global features [Ratinov et al., 2011]. Another way to approach mention disambiguation is to create a coherence graph of mentions and candidates and to look for the most dense subgraph [Hoffart et al., 2011]. This task is generally NP-hard, therefore some heuristics must be applied to solve it in reasonable time. Yet another way used is to create a graph with candidates as vertices and edges reflecting the presence of Wikipedia link between two candidates. This approach was used alongside using Personalized Page Rank algorithm in Pershina et al. [2015]. An example of this graph can be seen in Figure 3.1. United F.C. is based in Lincolnshireand participates in the sixth tier of English football. The striker Devon Whitejoined this football club in Lincoln_United_F.C.,0.5 Devon_White (footballer),0.5 Devon_White (baseball),0.5 Boston_United_F.C.,0.5 Lincolnshire,0.4 Lincoln,_Lincolnshire,0.3 Boston, _Lincolnshire,0.3 Figure 3.1: Example document graph taken from Pershina et al. [2015] 22

27 3.1.4 NIL selection and clustering Inthe NIL selection step the entity linking algorithm decides whether the mention has a corresponding entity present in the knowledge base or whether it is a NIL. This can be approached for example by always adding a NIL candidate to the candidate set of each mention and resolve NIL selection within the mention disambiguation step [Luo et al., 2015]. Another approach is to create additional features based on the output of the previous step, for example the relative score between the first two candidates, and create a separate model to decide whether the mention should be a NIL or not [Ratinov et al., 2011]. NIL clustering is a task of grouping NILs that are probably referring to the same entity. This can be performed for example using mention text similarity or coreference resolution [Radford et al., 2011]. In general, this task can be approached using (unsupervised) clustering methods. This part of entity linking is mostly skipped in referenced papers. The TAC- KBP contest contains NIL clustering as a task but the datasets are not publicly available. The Named Entity recognition and Linking (NEEL) challenge 4 dataset contains NIL clustering annotations, however we encountered issues when fetching the corresponding tweet contents. The CoNLL dataset does not have NIL clustering annotations and NIL selection step is skipped in many works by only focusing on mentions with corresponding entities present in the knowledge base, for example like in Hoffart et al. [2011], Pershina et al. [2015] or Yamada et al. [2016]. 3.2 Related work Named entity recognition and linking is a challenging problem and is well studied. We describe several inspiring papers concerning the entity linking task. Local and Global Algorithms for Disambiguation to Wikipedia [Ratinov et al., 2011] Ratinov et al. [2011] analyse approaches utilizing additional information form Wikipedia link structure. Authors introduce a formulation of the Disambiguation to Wikipedia (entity linking) task as an optimization problem with local and global variants. Using this formulation they present a new global entity linking system called GLOW. The system consists of mention candidates generation, ranking the

28 candidates using local and global features and linking to decide whether the top ranked candidate is the correct disambiguation or whether is should be classified as a NIL. Authors use cosine similarity between TF-IDF summaries of mention and entity for local features and relatedness of sets of Wikipedia incoming and outgoing links for global features. This work is used as a basis for a publicly available NERL system called Wikifier made by the authors. 5 Robust Disambiguation of Named Entities in Text [Hoffart et al., 2011] In the work of Hoffart et al. [2011] authors present their system AIDA for collective disambiguation, by harnessing context from knowledge bases and using a new form of a coherence graph. AIDA consists of candidate generation using YAGO means relation and mention disambiguation, where each candidate gets a score based on their prior probability, similarity to the mention and coherence to the coherence graph. This work also introduces the CoNLL dataset, that is described in Subsection Authors report that AIDA achieves 81.91% micro-accuracy (precision) and 81.82% macro-accuracy on the ConLL testb dataset. AIDA outperformed prior methods in terms of accuracy. The system is fully implemented and accessible online. 6 Improving efficiency and accuracy in multilingual entity extraction [Daiber et al., 2013] This work describes enhancements for the DBpedia Spotlight, an open source project developing a system for automatic annotation of DBpedia entities in natural language text [Daiber et al., 2013]. DBpedia Spotlight works by finding mentions in a given document by looking for known anchor texts (or using NLP methods when available for given language), selecting best candidates using anchor text and using a generative probabilistic model (composed of P (e), P (mention text e) and P (context e)) to disambiguate between candidates for each mention independently. DBpedia Spotlight provides programmatic interfaces 7 and is available for

29 download. 8 It was evaluated by Nguyen et al. [2014] on the CoNLL dataset with resulting precision of 75%. AIDA-light: High-throughput named-entity disambiguation [Nguyen et al., 2014] AIDA-light is a successor the entity linking system AIDA [Hoffart et al., 2011]. The main task for AIDA-light is to create a named entity linking system that is both fast and robust. Like AIDA, AIDA-light uses both local mention-candidate similarities and global candidate-candidate similarities and uses YAGO as a knowledge base. On top of that AIDA-light uses a two stage algorithm for mention disambiguation where at first it disambiguates easy and low-cost mentions, extracts the domain of the document and uses this context domain information for disambiguating rest of the mentions. AIDA-light achieves precision of 84.8% on CoNLL dataset and top-5 precision of 95.2% while also being orders of magnitude faster than AIDA. AIDA-light is publicly available for download. Personalized Page Rank for Named Entity Disambiguation [Pershina et al., 2015] This work introduces graph-based disambiguation approach, called PPRSim, based on Personalized PageRank (PPR) that combines both local and global information features. For candidate generation authors use Wikipedia titles and coreference resolution to create candidate sets. For mention disambiguation authors construct a coherence graph between candidates and combine PPR with two constrains to reduce noise from incorrect candidates to compute a score based on local similarity and global PPR similarity. Authors have publicly released their candidate set for CoNLL dataset called PPRForNED, as described in Subsection This work outperformed current state of the art on CoNLL by achieving 91.7% micro-accuracy and 89.9% macro-accuracy. However these results are not exactly comparable, because they were measured on the whole dataset and not just the testb part of it. Even further, these results were achieved utilizing gold annotations of the testing data during the candidate generation step

Personalized Page Rank for Named Entity Disambiguation

Personalized Page Rank for Named Entity Disambiguation Maria Pershina Yifan He Ralph Grishman Computer Science Department New York University New York, NY 10003, USA {pershina,yhe,grishman}@cs.nyu.edu