Bc. Pavel Taufer. Named Entity Recognition and Linking

Size: px
Start display at page:

Download "Bc. Pavel Taufer. Named Entity Recognition and Linking"

Transcription

1 MASTER THESIS Bc. Pavel Taufer Named Entity Recognition and Linking Institute of Formal and Applied Linguistics Supervisor of the master thesis: Study programme: Study branch: RNDr. Milan Straka, Ph.D. Informatics Artificial Intelligence Prague 2017

2 I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that the Charles University has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 subsection 1 of the Copyright Act. In... date... signature of the author i

3 Title: Named Entity Recognition and Linking Author: Bc. Pavel Taufer Institute: Institute of Formal and Applied Linguistics Supervisor: RNDr. Milan Straka, Ph.D., Institute of Formal and Applied Linguistics Abstract: The goal of this master thesis is to design and implement a named entity recognition and linking algorithm. A part of this goal is to propose and create a knowledge base that will be used in the algorithm. Because of the limited amount of data for languages other than English, we want to be able to train our method on one language, and then transfer the learned parameters to other languages (that do not have enough training data). The thesis consists of description of available knowledge bases, existing methods and design and implementation of our own knowledge base and entity linking method. Our method achieves state of the art result on a few variants of the AIDA CoNLL- YAGO dataset. The method also obtains comparable results on a sample of Czech annotated data from the PDT dataset using the parameters trained on the English CoNLL dataset. Keywords: named entities, named entity recognition, named entity linking ii

4 I would like to thank to my thesis supervisor RNDr. Milan Straka, Ph.D. for many interesting insights and comments and help with writing this thesis. iii

5 Contents 1 Introduction Our goals Structure of the thesis Existing Resources Knowledge bases Wikipedia DBpedia Wikidata Other knowledge bases Glossary Datasets AIDA CoNLL-YAGO Dataset PPRForNED PDT TAC Named entity recognition and linking task NERL steps Named entity recognition Candidate generation Mention disambiguation NIL selection and clustering Related work Proposed knowledge base Design goals Used knowledge bases Proposed knowledge base contents Construction Proposed entity linking method τsim Candidate generation Features Weights learning algorithm Mention disambiguation PPRSim Our extensions of PPRSim Hyperparameter search Experiments and Results Datasets preparation Candidate generation Evaluation Baselines Features

6 6.2.4 Experiments Results Mention disambiguation Baselines Evaluation methods Mention disambiguation hyperparemeters Baseline comparison results Comparison with other works PDT results User documentation and attachment content 58 8 Conclusion 62 Bibliography 63 List of Figures 67 List of Tables 68 2

7 1. Introduction Named entity recognition (NER) is a task of identifying the so called named entities, which are prominent words or sequences of words denoting names, locations, organizations, etc. a named entity. In general, any title of Wikipedia page can be considered Furthermore, the task also consists of classifying each recognized named entity into one of the preselected categories, like personal names, organizations or locations. Entity linking (EL) is a task of linking named entities into a knowledge base (KB). A knowledge base is a list of named entities and information about them. For instance, Wikipedia can be considered as one of the largest publicly well known knowledge bases. Entity linking is also known as named entity disambiguation (NED). Both of these tasks together are called named entity recognition and linking (NERL) task. There are many raw texts that contain named entities but do not contain their annotations, for example news articles, web pages or books. Named entity recognition and linking algorithms can be used to extract named entity annotations and link them to a knowledge base. This in turn allows for smarter document searching, for example by distance using geo-coordinates. It is also possible to obtain all documents referring to a specific entity, for example to a person or a place. Another use case of entity linking is machine translation. When a KB contains multilingual data we can translate named entities separately from the rest of the text and merge them after translation. The main difficulty in EL is the ambiguity of named entities. In general, a single named entity can correspond to multiple entities in a KB. An example could be the named entity Washington, which could represent a city, a person or a state. 1 In these ambiguous cases EL can use contextual information as well as coherence of present named entities to try to disambiguate among possible candidates. Named entity recognition has been studied for a long time the NER task was presented at the MUC-6 conference which was held in November Since then, the NER task has been further studied for example in the CoNLL

8 and the CoNLL shared tasks, or the NEEL challenge. 5 The most studied language is English, but several other languages are also introduced, for example the CoNLL 2002 task contains texts in Spanish and Dutch, the CoNLL 2003 task contains texts in English and German and TAC KBP 2016 focuses on English, Chinese and Spanish. The datasets published in above challenges are extensively studied until today. Methods for entity linking go back to the works of Dill et al. [2003], Cucerzan [2007] and Milne and Witten [2008b] and have been well studied since. The entity linking task has been suggested under several names during the years, for example Wikification [Ratinov et al., 2011], Grounding [Leidner et al., 2003] or Named-entity disambiguation [Hoffart et al., 2011]. There have been several contests in recent years aimed at performing both NER and EL, for example TAC (Text Analysis Conference) KBP tasks. 6 A few other datasets are publicly available for training and evaluation of EL algorithms, but most of them are in English. One of the most studied dataset is AIDA CoNLL- YAGO Dataset 7 introduced in Hoffart et al. [2011] and consisting of annotated English articles. A good knowledge base is a necessity for an entity linking algorithm. The KB is not used only as a target for linking but some features present in the KB can help with the entity linking task itself by providing, for example, relations between entities, multilingual text data or ontology classification. There are several knowledge bases publicly available for download and machine processing. 1.1 Our goals One of the goals of this thesis is to design and create a KB for our EL algorithm. Another goal is to create a named entity linking algorithm that will work with only a minimum of language dependent knowledge or will be able to utilize supervised data from another language. The reason behind this is the availability of English supervised data and the lack of supervised data in other languages. Our contributions We have designed and performed following tasks to accomplish our goals: research/yago-naga/aida/downloads/ 4

9 design and create a knowledge base from available sources, acquire datasets for testing and modify them to link to our knowledge base, design and implement an entity linking algorithm, evaluate our method on acquired datasets. All the described tasks were performed exclusively by the thesis author. On several variants of the AIDA CoNLL-YAGO dataset our implemented algorithm achieves state-of-the-art results. Furthermore, on a sample of Czech annotated data from the PDT 8 dataset we achieved satisfying results using parameters trained on the English CoNLL dataset. 1.2 Structure of the thesis In Chapter 2, we explore existing available resources for entity linking. We examine knowledge bases, datasets and existing systems. In Chapter 3, we describe the related work on name entity recognition and linking. My knowledge base constructed from preexisting knowledge bases is proposed in Chapter 4. The main part of this thesis is in Chapter 5 and contains the description of my entity linking method. Chapter 6 contains several experiments we performed on the available datasets and our results. The description of attached files and the user documentation is in Chapter 7. Finally, Chapter 8 is the conclusion of this thesis and contains a summary of achieved outcomes

10 2. Existing Resources In this chapter, we describe available resources for entity linking. First we examine available knowledge bases and the data they contain. Then we inspect what datasets are available for training and evaluation of entity linking algorithms. Additionally, we establish a glossary of terms that will be used for the rest of this work. 2.1 Knowledge bases A knowledge base in general consists of a list of entities, information about each of present entities and data about relations among entities. The most commonly known knowledge base is probably Wikipedia. 1 In the entity linking task we are mostly interested in entities that can occur in texts that we want to process. We want the following properties from a KB: Public availability it can be used and verified (or even contributed to) by others, Machine readability algorithms can access information easily, Persistent identifiers entities should have an identifier that does not change over time and is language independent, Credibility the KB is either human created and verified/sourced or autocreated from some other knowledge base that already satisfies this property, Multilingual text availability we can use it for different languages without having to construct a new KB for each language in a different way, Relations between entities the knowledge base can tell us how related two entities are (for example using common incoming links). We now examine several publicly available knowledge bases, starting with Wikipedia, DBpedia and Wikidata knowledge bases and then mentioning a few other knowledge bases like YAGO or Freebase Wikipedia Wikipedia was launched on January 15, It is owned by the nonprofit Wikimedia Foundation. Wikipedia contains 284 language editions with over 42 million pages in total (over 5 million articles for the English language) as of Novem

11 ber It consists of pages that are uniquely identified by their page title. Wikipedia uses a system of references and citations to try to make sure that the contents of articles are attributable to a reliable published source. Pages in Wikipedia are mainly of the following types: 2 Entity - pages about traditional encyclopedia topics, people, places, media, companies, events and more. Disambiguation - used when page title is ambiguous because it has multiple meanings. Contains links to different page titles for each meaning. Redirect - points from one page title to another and is used when more than one possible page title exists. It may also point to a specific part of a page. Administrative - user pages, internal Wikipedia information, templates. As stated before, Wikipedia pages are uniquely identified by their page title. A single entity can have multiple distinct page titles over time and therefore does not have a persistent identifier. This is most prominent for people. For example consider that there exists a famous sportsman John Doe. A page about this person is created under page title John Doe. An URL link to this page can be used by anyone to refer to sportsman John Doe. But later someone else named John Doe becomes notable as a singer and it is decided that they should have their page too. Page name John Doe is already taken so for the second person the page name John Doe (singer) is chosen. At this moment an editor will probably rename page John Doe to John Doe (sport) and create a disambiguation page John Doe with links to the other two pages. Now any old link to page John Doe is linking to a disambiguation page instead of the first John Doe. Page titles can be also changed for other reasons or even deleted. Therefore, a Wikipedia page name is not a long term reliable entity identifier. Wikimedia provides a way to download a dump of Wikipedia for each language edition as a xml. 3 Contents of the xml dumps are mostly raw pages containing text and infoboxes so the dumps are hard to use directly without preprocessing. The dumps are being released twice a month. Table 2.1 summarizes page type counts for Czech and English language editions of Wikipedia. 2 Articles 3 7

12 2.1.2 DBpedia Language English Czech Pages total 12,737, ,507 Entities 4,841, ,457 Redirects 7,628, ,400 Disambiguations 267,491 8,647 Links 181,980,575 13,050,035 Table 2.1: Wikipedia overview as of Septemeber 2016 As described directly on the DBpedia about page: 4 DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself. The project was started by the Free University of Berlin and Leipzig University, in collaboration with OpenLink Software, and the first publicly available dataset was published in DBpedia provides an extraction framework and a set of extractors to process Wikipedia dumps into more machine readable format. 5 Maintainers of DBpedia keep the extractors up to date for multiple language editions. The extraction framework also allows other developers to create their own custom extractors. Extractors use a set of rules and heuristics to extract wanted knowledge from Wikipedia pages. Because the Wikipedia pages are not strictly structured, the output of extractors can contain errors, either from imprecise heuristics or malformed data in the page itself. Available extractors provide for example a list of texts present in hyperlinks to each page (also called anchor texts), a list of redirects to each page, disambiguation links from each disambiguation page, occurrences of geographic coordinates or infobox properties. DBpedia uses a resource URL derived from page title as a entity identifier so it has the same problems as Wikipedia regarding entity identifier persistence

13 The maintainers of DBpedia also maintain an online instance that is updated live from continuous stream of updates from Wikipedia. 6 Several official dataset dumps were made available online, the last one being from October Another way to obtain DBpedia dataset is to download a Wikipedia dump and use the DBpedia extraction framework to extract the DBpedia dataset locally. There are also infrequently published full dumps of live DBpedia instance and its incremental updates to keep up to date with live Wikipedia Wikidata Wikidata 9 is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It was launched on October 29, It is intended to provide a common source of data to be used in other Wikimedia projects, and by anyone else, under a public domain licence. The content of Wikidata is made by both humans and automatic scripts that take care for example of newly created pages in Wikipedia. As stated at the Wikidata introduction page: Unlike Wikimedia Commons, which collects media files, and the Wikipedias, which produce encyclopedic articles, Wikidata will collect data, in a structured form. This will allow easy reuse of that data by Wikimedia projects and third parties, and will enable computers to easily process and understand it. Wikidata is composed of entities of three types: items, properties and queries. Descriptions of what exactly are Wikidata items, entities and properties were retrieved from the Wikidata glossary page and are as follows: 10 Entity is the data content of a Wikidata page, that either may be an item, a property or a query. Every entity is uniquely identified by an entity ID, which is a prefixed number, for example starting with the prefix Q for an item and P for a property. An entity is also identified by a unique combination of label and description in each language. Each entity has also a dereferenceable URI that follows the pattern http: // www. wikidata. org/ entity/ ID where ID is its entity ID. Item (in some languages translated to words for subject, object or element in the user interface) refers to a real-world object, concept, or event that is given an

14 identifier (an equivalent of a name) in Wikidata together with information about it. Each item has a corresponding Wikipage in the Wikidata main namespace. Items are identified by a prefixed id (like Q5), or by a sitelink to an external page, or by a unique combination of multilingual label and description. Items may also have aliases to ease lookup. The main data part of an item is the list of statements about the item. An item can be viewed as the subject-part of a triplet in linked data. Property (in some languages translated to attribute) is the descriptor for a data value, or some other relation or composite or possibly missing value, but not the data value or values themselves. Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value. The property is stored on a page in the Property namespace, and includes a declaration of the datatype for the property values. Compared to linked data, the property represents a triplet s predicate. Each item consists of the following parts: identifier - an unique number prefixed with the letter Q, labels - list of pairs of a language and a main name, aliases - alternative names, also multilingual, descriptions - more verbose descriptions, also multilingual, statements - list of statements about the item, sitelinks - links to Wikipedia and Wikimedia pages about given item in all occurring language editions. Wikidata decomposes knowledge into statements. Statements in Wikidata can be more than a single triplet (as is common for example in linked data) of (item, property, value) and can contain information about how or when the value was recorded or measured with optional references. An example of a Wikidata item is shown in Figure 2.1. Wikidata is connected with Wikipedia pages through sitelinks that are present in items. It also contains references to identifiers from other knowledge bases through specific properties, for example to Freebase MID through statements with property P links storage in Wikipedia. 12 Also since February 2013, Wikidata is used for interlanguage As of September 2016, Wikidata contains over 23 million items. An overview of item count statistics for Wikidata by language is available in Table 2.2. Rows show statistics based on whether the item has label in the given language or

15 Figure 2.1: Graphic representing datamodel in Wikidata. Taken from https: //commons.wikimedia.org/wiki/file:datamodel_in_wikidata.svg. Language Items containing Items containing any Wikipedia link to Wikipedia language-specific information coverage English 4,975,671 16,060, % Czech 345,709 4,068, % Table 2.2: Wikidata overview by language as of Septemeber 2016 a link to the corresponding Wikipedia language edition. The coverage refers to the portion of Wikipedia pages that are covered by Wikidata. Wikidata dumps are available for download as a single compressed file containing JSON representation of all entities contained in the Wikidata. 13 The dumps are released once a week Other knowledge bases There are also many other knowledge bases available. Here are a few notable mentions

16 Freebase Freebase used to be a collaborative knowledge base consisting of data composed mainly by its community members. It was developed by the American software company Metaweb and publicly ran since March The company was acquired by Google in 2010 and Freebase was one of the data sources used to power Google s Knowledge Graph. 14 Its site was shut down on 2 May 2016 and most of its data were moved to Wikidata. Freebase used unique identifier (MID) for its entities and the MIDs are occasionally still used in annotated datasets or linked data. Freebase used to provide a dump for public download. Google provides a way to access Freebase data by MID providing URLs in format where $MID stands for entity MID. Knowledge Graph Knowledge Graph is an online knowledge base provided by Google. 15 However it is not available for download, which is why we cannot use Knowledge Graph in this work. YAGO YAGO (Yet Another Great Ontology) is a knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken. 16 As stated at the in the YAGO Overview page, it is derived from Wikipedia, WordNet 17 and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities. YAGO is special in several ways: 1. The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value. 2. YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes. 3. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and research/yago-naga/yago/

17 entities. 4. In addition to a taxonomy, YAGO has thematic domains such as music or science from WordNet Domains. 5. YAGO extracts and combines entities and facts from 10 Wikipedias in different languages. However, because YAGO does only provides data in 10 main languages and not in Czech, we decided not to use it. BabelNet BabelNet 18 is both a multilingual encyclopedic dictionary and a semantic network which connects concepts and named entities in a very large network of semantic relations. It is developed at Linguistic Computing Laboratory in the Department of Computer Science of the Sapienza University of Rome. BabelNet was presented in 2012 in Navigli and Ponzetto [2012]. The current version BabelNet 3.7 covers 271 languages and is obtained from the automatic integration of many knowledge bases, such as Wordnet, Wikipedia or Wikidata. BabelNet is provided as a stand-alone resource with its Java API, a SPARQL endpoint and a Linked Data interface. However, there is no way to download full BabelNet for offline processing, therefore we decided not to use it. 2.2 Glossary From now on we need to distinguish between anentity in knowledge base and a named entity present in the text. To do so, we introduce the following definitions: named entity a real world object, that can be denoted with a proper name, which may or may not have a corresponding entity present in a knowledge base, knowledge base entity an entity present in a knowledge base, also referred to as entity, entity mention an occurrence of a named entity in a document, which consists of the text of the mention and the entity in the knowledge base it refers to, also referred to as mention, candidate an entity from a knowledge base that is considered by an entity linking algorithm as a corresponding entity for a given mention,

18 NIL special (virtual) entity used as a placeholder when the corresponding entity for a mention is not present in the knowledge base, also referred to as NME or null. 2.3 Datasets We need some data to be able to train and evaluate named entity recognition and linking algorithms. A named entity recognition and linking dataset should consist of several documents containing tagged named entities with entity identifiers from a selected knowledge base. There are several such datasets available and we describe the most prominent ones AIDA CoNLL-YAGO Dataset AIDA CoNLL-YAGO Dataset dataset was created by Hoffart et al. [2011] because, as the authors state, there were no other established benchmarks at the time. The dataset is based on the CoNLL 2003 dataset [Tjong Kim Sang and De Meulder, 2003] which has become a de-facto standard of NER evaluation. We refer to this dataset as the CoNLL dataset in the rest of the thesis. This dataset has been used in many works since its creation, for example in Pershina et al. [2015], Luo et al. [2015],Francis-Landau et al. [2016], Yamada et al. [2016] or Nguyen et al. [2016]. As described by Hoffart et al. [2011]:This dataset consists of proper noun annotations for 1393 Reuters newswire articles. Authors hand-annotated all these proper nouns with corresponding entities in YAGO 2. Each mention was disambiguated by two students and resolved by the authors in case of conflict. Dataset was updated in to add Wikipedia URLs and Freebase MIDs for almost all entities. The CoNLL dataset is publicly available for download, 19 but the original CoNLL 2003 dataset uses Reuters Corpus as the underlying text data, which is only available under a restrictive licence. The dataset is split into three sets: train/testa/testb. Train and testa dataset are generally used for learning and hyperparameter selection and testb is used for final evaluation and results comparison. Table 2.3 summarizes properties of the dataset. Authors also released their set of candidates that was used in their experiments

19 dataset train testa testb total articles ,393 mentions total 23,396 5,917 5,616 34,929 mentions with correct entity 18,541 4,791 4,485 27,817 mentions without correct entity 4,855 1,126 1,131 7,112 distinct mentions 8,011 2,795 2,606 11,015 words per article (avg) mentions per article (avg) Table 2.3: CoNLL dataset properties dataset train testa testb total candidates per mention (avg) mentions with correct entity present 18,539 4,791 4,484 27,814 mentions with correct entity present % 99.9% 100% 99.9% 99.9% Table 2.4: CoNLL YAGO means dataset properties as a mapping between mention texts and candidate entities in YAGO 2. This set of candidates is usually referred to as CoNLL YAGO means. An overview of properties of this dataset is in Table 2.4. The reason for mentions without correct entity that were not marked as NIL is that the corresponding articles were deleted from Wikipedia PPRForNED Authors of Pershina et al. [2015] published their set of candidates that they used when evaluating on the CoNLL dataset. 20 The PPRForNED dataset consists of all files from the CoNLL dataset. However, not all NIL mentions from CoNLL are present as can be seen in the summary of PPRForNED dataset properties in Table 2.5. Correct entities were added to the candidate list, even if they were not generated initially. This was performed for the testb set as well. Authors also provided a set of Freebase popularity scores that they used in their algorithm. Each named entity mention consists of the text of the mention and Wikipedia URL of the correct entity. Each candidate entity consists of a name, list of related candidates and a corresponding Wikipedia URL. This dataset was also used in other works, for example in Luo et al. [2015]

20 dataset train testa testb total mentions total 20,288 5,191 4,950 30,429 mentions with correct entity 18,541 4,791 4,485 27,817 mentions without correct entity 1, ,611 distinct mentions 5,936 2,261 2,079 8,050 mentions per article (avg) candidates per mention (avg) missing NIL mentions 3, ,500 mentions with correct entity present 18,538 4,791 4,485 27,814 mentions with correct entity present % 99.9% 100% 100% 99.9% Table 2.5: PPRForNED dataset properties total annotations 115 distinct mention texts 75 distinct entities 58 missing entities 2 Table 2.6: PDT dataset properties PDT PDT stands for the Prague Dependency Treebank. It is a Czech language dataset. Current version is As described by Bejček et al. [2013]: PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level. The original dataset does not contain annotations for named entity recognition and linking tasks, but there is currently an ongoing process which aims to annotate a small part of PDT testing data. We were given access to a current small sample of annotated data. The data use Wikipedia and Babelnet URLs as identifiers. Summary of the dataset is in Table

21 2.3.4 TAC TAC stands for Text Analysis Conference. As described at TAC about page: 22 TAC is a series of evaluation workshops organized to encourage research in Natural Language Processing and related applications, by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. TAC comprises sets of tasks known as tracks, each of which focuses on a particular subproblem of NLP. TAC tracks focus on end-user tasks, but also include component evaluations situated within the context of end-user tasks. There have been several TAC tracks concerning named entity linking, starting with TAC-KBP2009 [McNamee et al., 2009], where KBP stands for Knowledge Base Population. Most recent is KBP2016, which is currently in progress as of November However, the dataset itself is only available for the duration of the track and only for the track participants, therefore, we were unable to evaluate our algorithm on this dataset. KBP tracks recently started to focus on NIL clustering and cross-lingual entity linking. NIL clustering is a subtask of entity linking that aims to cluster the occurrences of a single named entity that is not present in the used knowledge base. Cross-lingual entity linking is a task where the documents can occur in multiple languages and are linked into a single language knowledge base. KBP2014 entity linking track 23 focused on both mono-lingual English entity linking and NIL clustering and on cross-lingual Chinese/Spanish to English entity linking

22 3. Named entity recognition and linking task In this chapter we describe the named entity recognition and linking (NERL) task steps and summarize how they are approached. Furthermore we examine what work has been done in the named entity recognition and linking field. 3.1 NERL steps The named entity linking task has been researched under several names over the years. For example Wikification [Ratinov et al., 2011], Grounding [Leidner et al., 2003] or Named-entity disambiguation [Hoffart et al., 2011]. The usual approach can be generalized into following steps: find named entity mentions in given text, generate a set of candidates for each mention, select best candidate for each mention, determine if a mention does correspond to a NIL and cluster corresponding NIL mentions together. We already established what a mention, candidate, entity and NIL refer to in the previous chapter in Section 2.2. Now we are going to define what local and global information means within the context of this thesis. Information from a knowledge base used in the named entity recognition and linking is commonly split into two types: local and global information. Local information consists of features related to a single mention, such as similarity between the document and texts available about each candidate. Because of that local information features can be used for linking each mention independently. Global information contains features that utilize multiple mentions present in a document. An example feature for a pair of entities is whether there is a link in the Wikipedia leading from one entity to the other. Both types of features have their strengths and weaknesses. Local features better encode direct similarity between mention and candidates but do not take coherence into consideration. Global features on the other hand make use of interlinking information between entities, but can introduce noise, for example when the document consists of multiple topics. In general the named entity recognition and linking process can be split into following cascading steps: 1. Named entity mentions recognition - find mentions in given document, 18

23 2. Candidate generation - create a list of candidates for each mention using local information, 3. Mention disambiguation - select best candidate for each mention using both local and global information, 4. NIL selection and clustering - decide if the KB contains the entity of each mention and cluster together mentions referring to a same NIL. Every step is performed separately in most algorithms, where each step utilizes information from the document and from output of previous steps. This allows using distinct techniques and external information like gazetteers (a list of named entities). However, there is no way to adjust results of previous steps and once a step does not produce correct result (e.g. misses a mention, does not generate correct entity candidate), later steps cannot do anything about it. In reaction to the described error propagation, recently there have been a few attempts to do the both named entity recognition and linking tasks jointly as a single step, as can be seen in Luo et al. [2015] and Nguyen et al. [2016]. This allows for example to use global information from other mentions to decide whether a given text span is a mention or not Named entity recognition The task of named entity recognition step is to find named entity mentions in given text. In some implementations when solving the NER task on its own the output can also contain a type of each found mention. For example, the following classes are used for named entity classification in CoNLL shared tasks 2002 and 2003: PER - person, LOC - location, ORG - organization, MISC - miscellaneous. Other examples are 7 categories used in MUC or two-level hierarchy of 46 classes used in Czech Named Entity Corpus. 1 Algorithms for NER step mostly use information from gazetteers such as GeoNames 2 and knowledge bases,ortographic features such as whether words start with capital letters, sentence structure or whether word is a noun. The traditional approach to solve the named entity recognition task is to predict for each word the named entity mention type and position within the

24 mention. Positions can be described for example by the BIO scheme [Ratinov and Roth, 2009]: B for mention Beginning, I for Inside of a mention and O for text Outside of any mention. There are also other schemes such as BILOU, where L marks last word of a mention a U marks single word mention. HMM (Hidden Markov Models) or CRF (Conditional Random Fields) models are often used for this classification, as can be seen for example in the works of Zhou and Su [2002] or Finkel et al. [2005]. The system using the above approach generally rely on hand-crafted features and domain specific knowledge. Recent advancements in neural networks have also found their use in this task. Neural networks were used for example in Lample et al. [2016] and Yang et al. [2016], utilizing bidirectional LSTM or GRU networks, precomputed word-embeddings, character-level word embeddings and CRF output layer. Another approach is to detect spans of mentions by modelling the distribution of segmentation boundaries directly. This was used for example in Luo et al. [2015] with a Semi-CRF model as a part of joint NERL algorithm. Last approach mentioned here is extracting a list of all possible mention texts from a knowledge base and look for matching parts of the document. The list of possible mention texts can be generated for example from anchor texts of hyperlinks in Wikipedia, as was used for example in DBpedia Spotlight by Mendes et al. [2011] or Wikifier. 3 Many works focused on entity linking skip named entity recognition step altogether and rely on output of existing algorithms. One of the more prominent NER algorithms is the Standfor NER tagger Finkel et al. [2005], that is used for example by Hoffart et al. [2011]. For Czech language there exists a NER algorithm by Straková et al. [2014, 2016] Candidate generation In general, it is not computationally feasible to try matching each mention against every entity in KB. Heuristics to prune possible candidate assignments space are often used to make this problem tractable. The goal of the candidate generation step is to create a limited set of best candidates for each mention. How many candidates are generated depends on what algorithms are used in following steps. Formally we can say that for a given document D, list of mentions M =

25 (m 1,..., m n ) and maximum number of candidates k we want to obtain a list of candidate sets C = (C 1..., C n ) where C i contains up to k best candidate entities for mention m i in relation to the document D. There are several options to use when creating the set of candidates for each mention. A direct approach is to have an index that lists all entities that are likely to appear with the given mention text. This index can be constructed for example from Wikipedia anchor texts [Ratinov et al., 2011] or YAGO means relation [Hoffart et al., 2011]. Another way to build the dictionary is to use Wikipedia titles, redirects and disambiguation pages Pershina et al. [2015]. There are ways to enhance this direct approach, for example by using coreference resolution and using the longest mention in the coreference chain [Pershina et al., 2015]. Coreference occurs when an entity is present two or more times under a same referent. Coreference resolution is a task to find coreferences in text and cluster/chain them together. Once a set of candidates for a mention is constructed there are several ways to use the local information from the document to prune the candidate set. In general, candidate generation algorithms contain a scoring function to create a similarity score between mention and each of its candidate entity. The similarity score is commonly computed using features generated for each candidate of every mention. One set of features is textual similarity between the document and the candidate entity. This can be for example cosine similarity of TF-IDF vectors of the document and entity text [Ratinov et al., 2011], keyphrase based similarity [Hoffart et al., 2011] or topical domain similarity of the document [Nguyen et al., 2016]. More novel approaches use convolutional neural networks to capture semantic similarity in different contexts [Francis-Landau et al., 2016]. Other features can be based on a likelihood prior for candidate entities. There are several ways to compute the priors. One option is to base the prior purely on the candidate entity without using any information from the document itself, for example using the number of incoming links to the entity in Wikipedia. Other option is to use some local information, for example the ratio of hyperlinks with the same anchor text as the mention text linking to a given entity over all hyperlinks with this anchor text. Number of generated candidates differs from article to article but it can range from 5 (Hoffart et al. [2011]) to up to 20 [Ratinov et al., 2011], Nguyen et al. [2016]. 21

26 3.1.3 Mention disambiguation The mention disambiguation step takes place when a set of candidates is available for each mention (from the previous step). The mention disambiguation goal is to disambiguate among the candidates to select the best candidate for each mention. All mentions are disambiguated together with the idea that the entities appearing in a single document are generally related to each other. Because of this, global information can be used for in addition to the local information used in previous steps. Formally we can say that for a given document D, list of mentions M = (m 1,..., m n ) and candidate set C = (C 1..., C n ) we want to find best candidate assignment A = (a 1,..., a n ) where i (1,...,n) a i C i, with regard to the mentions M and document D. Mention disambiguation can be approached as a maximization optimization problem where the goal is to find the best assignment of candidates using optimization criteria that considers both local and global features [Ratinov et al., 2011]. Another way to approach mention disambiguation is to create a coherence graph of mentions and candidates and to look for the most dense subgraph [Hoffart et al., 2011]. This task is generally NP-hard, therefore some heuristics must be applied to solve it in reasonable time. Yet another way used is to create a graph with candidates as vertices and edges reflecting the presence of Wikipedia link between two candidates. This approach was used alongside using Personalized Page Rank algorithm in Pershina et al. [2015]. An example of this graph can be seen in Figure 3.1. United F.C. is based in Lincolnshireand participates in the sixth tier of English football. The striker Devon Whitejoined this football club in Lincoln_United_F.C.,0.5 Devon_White (footballer),0.5 Devon_White (baseball),0.5 Boston_United_F.C.,0.5 Lincolnshire,0.4 Lincoln,_Lincolnshire,0.3 Boston, _Lincolnshire,0.3 Figure 3.1: Example document graph taken from Pershina et al. [2015] 22

27 3.1.4 NIL selection and clustering Inthe NIL selection step the entity linking algorithm decides whether the mention has a corresponding entity present in the knowledge base or whether it is a NIL. This can be approached for example by always adding a NIL candidate to the candidate set of each mention and resolve NIL selection within the mention disambiguation step [Luo et al., 2015]. Another approach is to create additional features based on the output of the previous step, for example the relative score between the first two candidates, and create a separate model to decide whether the mention should be a NIL or not [Ratinov et al., 2011]. NIL clustering is a task of grouping NILs that are probably referring to the same entity. This can be performed for example using mention text similarity or coreference resolution [Radford et al., 2011]. In general, this task can be approached using (unsupervised) clustering methods. This part of entity linking is mostly skipped in referenced papers. The TAC- KBP contest contains NIL clustering as a task but the datasets are not publicly available. The Named Entity recognition and Linking (NEEL) challenge 4 dataset contains NIL clustering annotations, however we encountered issues when fetching the corresponding tweet contents. The CoNLL dataset does not have NIL clustering annotations and NIL selection step is skipped in many works by only focusing on mentions with corresponding entities present in the knowledge base, for example like in Hoffart et al. [2011], Pershina et al. [2015] or Yamada et al. [2016]. 3.2 Related work Named entity recognition and linking is a challenging problem and is well studied. We describe several inspiring papers concerning the entity linking task. Local and Global Algorithms for Disambiguation to Wikipedia [Ratinov et al., 2011] Ratinov et al. [2011] analyse approaches utilizing additional information form Wikipedia link structure. Authors introduce a formulation of the Disambiguation to Wikipedia (entity linking) task as an optimization problem with local and global variants. Using this formulation they present a new global entity linking system called GLOW. The system consists of mention candidates generation, ranking the

28 candidates using local and global features and linking to decide whether the top ranked candidate is the correct disambiguation or whether is should be classified as a NIL. Authors use cosine similarity between TF-IDF summaries of mention and entity for local features and relatedness of sets of Wikipedia incoming and outgoing links for global features. This work is used as a basis for a publicly available NERL system called Wikifier made by the authors. 5 Robust Disambiguation of Named Entities in Text [Hoffart et al., 2011] In the work of Hoffart et al. [2011] authors present their system AIDA for collective disambiguation, by harnessing context from knowledge bases and using a new form of a coherence graph. AIDA consists of candidate generation using YAGO means relation and mention disambiguation, where each candidate gets a score based on their prior probability, similarity to the mention and coherence to the coherence graph. This work also introduces the CoNLL dataset, that is described in Subsection Authors report that AIDA achieves 81.91% micro-accuracy (precision) and 81.82% macro-accuracy on the ConLL testb dataset. AIDA outperformed prior methods in terms of accuracy. The system is fully implemented and accessible online. 6 Improving efficiency and accuracy in multilingual entity extraction [Daiber et al., 2013] This work describes enhancements for the DBpedia Spotlight, an open source project developing a system for automatic annotation of DBpedia entities in natural language text [Daiber et al., 2013]. DBpedia Spotlight works by finding mentions in a given document by looking for known anchor texts (or using NLP methods when available for given language), selecting best candidates using anchor text and using a generative probabilistic model (composed of P (e), P (mention text e) and P (context e)) to disambiguate between candidates for each mention independently. DBpedia Spotlight provides programmatic interfaces 7 and is available for

29 download. 8 It was evaluated by Nguyen et al. [2014] on the CoNLL dataset with resulting precision of 75%. AIDA-light: High-throughput named-entity disambiguation [Nguyen et al., 2014] AIDA-light is a successor the entity linking system AIDA [Hoffart et al., 2011]. The main task for AIDA-light is to create a named entity linking system that is both fast and robust. Like AIDA, AIDA-light uses both local mention-candidate similarities and global candidate-candidate similarities and uses YAGO as a knowledge base. On top of that AIDA-light uses a two stage algorithm for mention disambiguation where at first it disambiguates easy and low-cost mentions, extracts the domain of the document and uses this context domain information for disambiguating rest of the mentions. AIDA-light achieves precision of 84.8% on CoNLL dataset and top-5 precision of 95.2% while also being orders of magnitude faster than AIDA. AIDA-light is publicly available for download. Personalized Page Rank for Named Entity Disambiguation [Pershina et al., 2015] This work introduces graph-based disambiguation approach, called PPRSim, based on Personalized PageRank (PPR) that combines both local and global information features. For candidate generation authors use Wikipedia titles and coreference resolution to create candidate sets. For mention disambiguation authors construct a coherence graph between candidates and combine PPR with two constrains to reduce noise from incorrect candidates to compute a score based on local similarity and global PPR similarity. Authors have publicly released their candidate set for CoNLL dataset called PPRForNED, as described in Subsection This work outperformed current state of the art on CoNLL by achieving 91.7% micro-accuracy and 89.9% macro-accuracy. However these results are not exactly comparable, because they were measured on the whole dataset and not just the testb part of it. Even further, these results were achieved utilizing gold annotations of the testing data during the candidate generation step

Personalized Page Rank for Named Entity Disambiguation

Personalized Page Rank for Named Entity Disambiguation Personalized Page Rank for Named Entity Disambiguation Maria Pershina Yifan He Ralph Grishman Computer Science Department New York University New York, NY 10003, USA {pershina,yhe,grishman}@cs.nyu.edu

More information

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017 NERD workshop Luca Foppiano @ ALMAnaCH - Inria Paris Berlin, 18/09/2017 Agenda Introducing the (N)ERD service NERD REST API Usages and use cases Entities Rigid textual expressions corresponding to certain

More information

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014 UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014 Ander Barrena, Eneko Agirre, Aitor Soroa IXA NLP Group / University of the Basque Country, Donostia, Basque Country ander.barrena@ehu.es,

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Entity Linking at TAC Task Description

Entity Linking at TAC Task Description Entity Linking at TAC 2013 Task Description Version 1.0 of April 9, 2013 1 Introduction The main goal of the Knowledge Base Population (KBP) track at TAC 2013 is to promote research in and to evaluate

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

DBpedia Spotlight at the MSM2013 Challenge

DBpedia Spotlight at the MSM2013 Challenge DBpedia Spotlight at the MSM2013 Challenge Pablo N. Mendes 1, Dirk Weissenborn 2, and Chris Hokamp 3 1 Kno.e.sis Center, CSE Dept., Wright State University 2 Dept. of Comp. Sci., Dresden Univ. of Tech.

More information

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper. Semantic Web Company PoolParty - Server PoolParty - Technical White Paper http://www.poolparty.biz Table of Contents Introduction... 3 PoolParty Technical Overview... 3 PoolParty Components Overview...

More information

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de

More information

CMU System for Entity Discovery and Linking at TAC-KBP 2016

CMU System for Entity Discovery and Linking at TAC-KBP 2016 CMU System for Entity Discovery and Linking at TAC-KBP 2016 Xuezhe Ma, Nicolas Fauceglia, Yiu-chang Lin, and Eduard Hovy Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave, Pittsburgh,

More information

PRIS at TAC2012 KBP Track

PRIS at TAC2012 KBP Track PRIS at TAC2012 KBP Track Yan Li, Sijia Chen, Zhihua Zhou, Jie Yin, Hao Luo, Liyin Hong, Weiran Xu, Guang Chen, Jun Guo School of Information and Communication Engineering Beijing University of Posts and

More information

August 2012 Daejeon, South Korea

August 2012 Daejeon, South Korea Building a Web of Linked Entities (Part I: Overview) Pablo N. Mendes Free University of Berlin August 2012 Daejeon, South Korea Outline Part I A Web of Linked Entities Challenges Progress towards solutions

More information

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Nitish Aggarwal, Kartik Asooja, Paul Buitelaar, and Gabriela Vulcu Unit for Natural Language Processing Insight-centre, National University

More information

Programming Technologies for Web Resource Mining

Programming Technologies for Web Resource Mining Programming Technologies for Web Resource Mining SoftLang Team, University of Koblenz-Landau Prof. Dr. Ralf Lämmel Msc. Johannes Härtel Msc. Marcel Heinz Motivation What are interesting web resources??

More information

DBpedia Extracting structured data from Wikipedia

DBpedia Extracting structured data from Wikipedia DBpedia Extracting structured data from Wikipedia Anja Jentzsch, Freie Universität Berlin Köln. 24. November 2009 DBpedia DBpedia is a community effort to extract structured information from Wikipedia

More information

CMU System for Entity Discovery and Linking at TAC-KBP 2017

CMU System for Entity Discovery and Linking at TAC-KBP 2017 CMU System for Entity Discovery and Linking at TAC-KBP 2017 Xuezhe Ma, Nicolas Fauceglia, Yiu-chang Lin, and Eduard Hovy Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave, Pittsburgh,

More information

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico David Soares Batista Disciplina de Recuperação de Informação, Instituto Superior Técnico November 11, 2011 Motivation Entity-Linking is the process of associating an entity mentioned in a text to an entry,

More information

Exploiting DBpedia for Graph-based Entity Linking to Wikipedia

Exploiting DBpedia for Graph-based Entity Linking to Wikipedia Exploiting DBpedia for Graph-based Entity Linking to Wikipedia Master Thesis presented by Bernhard Schäfer submitted to the Research Group Data and Web Science Prof. Dr. Christian Bizer University Mannheim

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

DBpedia-An Advancement Towards Content Extraction From Wikipedia

DBpedia-An Advancement Towards Content Extraction From Wikipedia DBpedia-An Advancement Towards Content Extraction From Wikipedia Neha Jain Government Degree College R.S Pura, Jammu, J&K Abstract: DBpedia is the research product of the efforts made towards extracting

More information

Towards Summarizing the Web of Entities

Towards Summarizing the Web of Entities Towards Summarizing the Web of Entities contributors: August 15, 2012 Thomas Hofmann Director of Engineering Search Ads Quality Zurich, Google Switzerland thofmann@google.com Enrique Alfonseca Yasemin

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

The Luxembourg BabelNet Workshop

The Luxembourg BabelNet Workshop The Luxembourg BabelNet Workshop 2 March 2016: Session 3 Tech session Disambiguating text with Babelfy. The Babelfy API Claudio Delli Bovi Outline Multilingual disambiguation with Babelfy Using Babelfy

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

CMU System for Entity Discovery and Linking at TAC-KBP 2015

CMU System for Entity Discovery and Linking at TAC-KBP 2015 CMU System for Entity Discovery and Linking at TAC-KBP 2015 Nicolas Fauceglia, Yiu-Chang Lin, Xuezhe Ma, and Eduard Hovy Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave, Pittsburgh,

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli, Simone Paolo Ponzetto What is BabelNet a very large, wide-coverage multilingual

More information

Automatically Annotating Text with Linked Open Data

Automatically Annotating Text with Linked Open Data Automatically Annotating Text with Linked Open Data Delia Rusu, Blaž Fortuna, Dunja Mladenić Jožef Stefan Institute Motivation: Annotating Text with LOD Open Cyc DBpedia WordNet Overview Related work Algorithms

More information

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision A Semantic Web-Based Approach for Harvesting Multilingual Textual Definitions from Wikipedia to Support ICD-11 Revision Guoqian Jiang 1,* Harold R. Solbrig 1 and Christopher G. Chute 1 1 Department of

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

arxiv: v1 [cs.ir] 2 Mar 2017

arxiv: v1 [cs.ir] 2 Mar 2017 DAWT: Densely Annotated Wikipedia Texts across multiple languages Nemanja Spasojevic, Preeti Bhargava, Guoning Hu Lithium Technologies Klout San Francisco, CA {nemanja.spasojevic, preeti.bhargava, guoning.hu}@lithium.com

More information

THE amount of Web data has increased exponentially

THE amount of Web data has increased exponentially 1 Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions Wei Shen, Jianyong Wang, Senior Member, IEEE, and Jiawei Han, Fellow, IEEE Abstract The large number of potential applications

More information

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Heather Simpson 1, Stephanie Strassel 1, Robert Parker 1, Paul McNamee

More information

Module 3: GATE and Social Media. Part 4. Named entities

Module 3: GATE and Social Media. Part 4. Named entities Module 3: GATE and Social Media Part 4. Named entities The 1995-2018 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence Named Entity Recognition Texts frequently

More information

Proposal for Implementing Linked Open Data on Libraries Catalogue

Proposal for Implementing Linked Open Data on Libraries Catalogue Submitted on: 16.07.2018 Proposal for Implementing Linked Open Data on Libraries Catalogue Esraa Elsayed Abdelaziz Computer Science, Arab Academy for Science and Technology, Alexandria, Egypt. E-mail address:

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

Token Gazetteer and Character Gazetteer for Named Entity Recognition

Token Gazetteer and Character Gazetteer for Named Entity Recognition Token Gazetteer and Character Gazetteer for Named Entity Recognition Giang Nguyen, Štefan Dlugolinský, Michal Laclavík, Martin Šeleng Institute of Informatics, Slovak Academy of Sciences Dúbravská cesta

More information

WebSAIL Wikifier at ERD 2014

WebSAIL Wikifier at ERD 2014 WebSAIL Wikifier at ERD 2014 Thanapon Noraset, Chandra Sekhar Bhagavatula, Doug Downey Department of Electrical Engineering & Computer Science, Northwestern University {nor.thanapon, csbhagav}@u.northwestern.edu,ddowney@eecs.northwestern.edu

More information

Mining Wikipedia s Snippets Graph: First Step to Build A New Knowledge Base

Mining Wikipedia s Snippets Graph: First Step to Build A New Knowledge Base Mining Wikipedia s Snippets Graph: First Step to Build A New Knowledge Base Andias Wira-Alam and Brigitte Mathiak GESIS - Leibniz-Institute for the Social Sciences Unter Sachsenhausen 6-8, 50667 Köln,

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

An Adaptive Framework for Named Entity Combination

An Adaptive Framework for Named Entity Combination An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

THE GETTY VOCABULARIES TECHNICAL UPDATE

THE GETTY VOCABULARIES TECHNICAL UPDATE AAT TGN ULAN CONA THE GETTY VOCABULARIES TECHNICAL UPDATE International Working Group Meetings January 7-10, 2013 Joan Cobb Gregg Garcia Information Technology Services J. Paul Getty Trust International

More information

Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Running Example. Mention Pair Model. Mention Pair Example

Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Running Example. Mention Pair Model. Mention Pair Example Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Many machine learning models for coreference resolution have been created, using not only different feature sets but also fundamentally

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Building a Large-Scale Cross-Lingual Knowledge Base from Heterogeneous Online Wikis

Building a Large-Scale Cross-Lingual Knowledge Base from Heterogeneous Online Wikis Building a Large-Scale Cross-Lingual Knowledge Base from Heterogeneous Online Wikis Mingyang Li (B), Yao Shi, Zhigang Wang, and Yongbin Liu Department of Computer Science and Technology, Tsinghua University,

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Semi-Supervised Learning of Named Entity Substructure

Semi-Supervised Learning of Named Entity Substructure Semi-Supervised Learning of Named Entity Substructure Alden Timme aotimme@stanford.edu CS229 Final Project Advisor: Richard Socher richard@socher.org Abstract The goal of this project was two-fold: (1)

More information

Question Answering Systems

Question Answering Systems Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction

More information

Stanford-UBC at TAC-KBP

Stanford-UBC at TAC-KBP Stanford-UBC at TAC-KBP Eneko Agirre, Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University Outline

More information

Text Mining for Software Engineering

Text Mining for Software Engineering Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software

More information

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools Wahed Hemati, Alexander Mehler, and Tolga Uslu Text Technology Lab, Goethe Universitt

More information

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities Omar Qawasmeh 1, Maxime Lefranois 2, Antoine Zimmermann 2, Pierre Maret 1 1 Univ. Lyon, CNRS, Lab. Hubert Curien UMR

More information

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.1 October 7, 2016

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.1 October 7, 2016 LODtagger Guide for Users and Developers Bahar Sateli René Witte Release 1.1 October 7, 2016 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info Contents 1 LODtagger

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

DBPedia (dbpedia.org)

DBPedia (dbpedia.org) Matt Harbers Databases and the Web April 22 nd, 2011 DBPedia (dbpedia.org) What is it? DBpedia is a community whose goal is to provide a web based open source data set of RDF triples based on Wikipedia

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Named Entity Detection and Entity Linking in the Context of Semantic Web

Named Entity Detection and Entity Linking in the Context of Semantic Web [1/52] Concordia Seminar - December 2012 Named Entity Detection and in the Context of Semantic Web Exploring the ambiguity question. Eric Charton, Ph.D. [2/52] Concordia Seminar - December 2012 Challenge

More information

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Jeffrey Dalton University of Massachusetts, Amherst jdalton@cs.umass.edu Laura

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.0 July 24, 2015

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.0 July 24, 2015 LODtagger Guide for Users and Developers Bahar Sateli René Witte Release 1.0 July 24, 2015 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info Contents 1 LODtagger

More information

Technische Universität Dresden Fakultät Informatik. Wikidata. A Free Collaborative Knowledge Base. Markus Krötzsch TU Dresden.

Technische Universität Dresden Fakultät Informatik. Wikidata. A Free Collaborative Knowledge Base. Markus Krötzsch TU Dresden. Technische Universität Dresden Fakultät Informatik Wikidata A Free Collaborative Knowledge Base Markus Krötzsch TU Dresden IBM June 2015 Where is Wikipedia Going? Wikipedia in 2015: A project that has

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Markov Chains for Robust Graph-based Commonsense Information Extraction

Markov Chains for Robust Graph-based Commonsense Information Extraction Markov Chains for Robust Graph-based Commonsense Information Extraction N iket Tandon 1,4 Dheera j Ra jagopal 2,4 Gerard de M elo 3 (1) Max Planck Institute for Informatics, Germany (2) NUS, Singapore

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

Finding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track

Finding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track Finding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track V.G.Vinod Vydiswaran, Kavita Ganesan, Yuanhua Lv, Jing He, ChengXiang Zhai Department of Computer Science University of

More information

Natural Language Processing with PoolParty

Natural Language Processing with PoolParty Natural Language Processing with PoolParty Table of Content Introduction to PoolParty 2 Resolving Language Problems 4 Key Features 5 Entity Extraction and Term Extraction 5 Shadow Concepts 6 Word Sense

More information

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,

More information

A cocktail approach to the VideoCLEF 09 linking task

A cocktail approach to the VideoCLEF 09 linking task A cocktail approach to the VideoCLEF 09 linking task Stephan Raaijmakers Corné Versloot Joost de Wit TNO Information and Communication Technology Delft, The Netherlands {stephan.raaijmakers,corne.versloot,

More information

YAGO: a Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames

YAGO: a Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames 1/24 a from Wikipedia, Wordnet, and Geonames 1, Fabian Suchanek 1, Johannes Hoffart 2, Joanna Biega 2, Erdal Kuzey 2, Gerhard Weikum 2 Who uses 1 Télécom ParisTech 2 Max Planck Institute for Informatics

More information

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ??? @ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON

More information

A Korean Knowledge Extraction System for Enriching a KBox

A Korean Knowledge Extraction System for Enriching a KBox A Korean Knowledge Extraction System for Enriching a KBox Sangha Nam, Eun-kyung Kim, Jiho Kim, Yoosung Jung, Kijong Han, Key-Sun Choi KAIST / The Republic of Korea {nam.sangha, kekeeo, hogajiho, wjd1004109,

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK NG, Jun Ping National University of Singapore ngjp@nus.edu.sg 30 November 2009 The latest version of QANUS and this documentation can always be downloaded from

More information

Handling Place References in Text

Handling Place References in Text Handling Place References in Text Introduction Most (geographic) information is available in the form of textual documents Place reference resolution involves two-subtasks: Recognition : Delimiting occurrences

More information

Text, Knowledge, and Information Extraction. Lizhen Qu

Text, Knowledge, and Information Extraction. Lizhen Qu Text, Knowledge, and Information Extraction Lizhen Qu A bit about Myself PhD: Databases and Information Systems Group (MPII) Advisors: Prof. Gerhard Weikum and Prof. Rainer Gemulla Thesis: Sentiment Analysis

More information

Techreport for GERBIL V1

Techreport for GERBIL V1 Techreport for GERBIL 1.2.2 - V1 Michael Röder, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo February 21, 2016 Current Development of GERBIL Recently, we released the latest version 1.2.2 of GERBIL [16] 1.

More information

Generalizing from Freebase and Patterns using Cluster-Based Distant Supervision for KBP Slot-Filling

Generalizing from Freebase and Patterns using Cluster-Based Distant Supervision for KBP Slot-Filling Generalizing from Freebase and Patterns using Cluster-Based Distant Supervision for KBP Slot-Filling Benjamin Roth Grzegorz Chrupała Michael Wiegand Mittul Singh Dietrich Klakow Spoken Language Systems

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators Pablo Ruiz, Thierry Poibeau and Frédérique Mélanie Laboratoire LATTICE CNRS, École Normale Supérieure, U Paris 3 Sorbonne Nouvelle

More information

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia Semantic Annotation of Web Resources Using IdentityRank and Wikipedia Norberto Fernández, José M.Blázquez, Luis Sánchez, and Vicente Luque Telematic Engineering Department. Carlos III University of Madrid

More information

Robust and Collective Entity Disambiguation through Semantic Embeddings

Robust and Collective Entity Disambiguation through Semantic Embeddings Robust and Collective Entity Disambiguation through Semantic Embeddings Stefan Zwicklbauer University of Passau Passau, Germany szwicklbauer@acm.org Christin Seifert University of Passau Passau, Germany

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

Building Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018

Building Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018 Building Multilingual Resources and Neural Models for Word Sense Disambiguation Alessandro Raganato March 15th, 2018 About me alessandro.raganato@helsinki.fi http://wwwusers.di.uniroma1.it/~raganato ERC

More information

Language Resources and Linked Data

Language Resources and Linked Data Integrating NLP with Linked Data: the NIF Format Milan Dojchinovski @EKAW 2014 November 24-28, 2014, Linkoping, Sweden milan.dojchinovski@fit.cvut.cz - @m1ci - http://dojchinovski.mk Web Intelligence Research

More information

Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Entity Linking with Multiple Knowledge Bases: An Ontology Modularization

More information

Orchestrating Music Queries via the Semantic Web

Orchestrating Music Queries via the Semantic Web Orchestrating Music Queries via the Semantic Web Milos Vukicevic, John Galletly American University in Bulgaria Blagoevgrad 2700 Bulgaria +359 73 888 466 milossmi@gmail.com, jgalletly@aubg.bg Abstract

More information

Entity and Knowledge Base-oriented Information Retrieval

Entity and Knowledge Base-oriented Information Retrieval Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061

More information

Exam Marco Kuhlmann. This exam consists of three parts:

Exam Marco Kuhlmann. This exam consists of three parts: TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding

More information

Mutual Disambiguation for Entity Linking

Mutual Disambiguation for Entity Linking Mutual Disambiguation for Entity Linking Eric Charton Polytechnique Montréal Montréal, QC, Canada eric.charton@polymtl.ca Marie-Jean Meurs Concordia University Montréal, QC, Canada marie-jean.meurs@concordia.ca

More information

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity

More information

Linked Data Evolving the Web into a Global Data Space

Linked Data Evolving the Web into a Global Data Space Linked Data Evolving the Web into a Global Data Space Anja Jentzsch, Freie Universität Berlin 05 October 2011 EuropeanaTech 2011, Vienna 1 Architecture of the classic Web Single global document space Web

More information

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Sumit Bhatia, Nidhi Rajshree, Anshu Jain, and Nitish Aggarwal IBM Research sumitbhatia@in.ibm.com, {nidhi.rajshree,anshu.n.jain}@us.ibm.com,nitish.aggarwal@ibm.com

More information

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases LIDER Survey Overview Participant profile (organisation type, industry sector) Relevant use-cases Discovering and extracting information Understanding opinion Content and data (Data Management) Monitoring

More information

A Joint Model for Discovering and Linking Entities

A Joint Model for Discovering and Linking Entities A Joint Model for Discovering and Linking Entities Michael Wick Sameer Singh Harshal Pandya Andrew McCallum School of Computer Science University of Massachusetts Amherst MA {mwick, sameer, harshal, mccallum}@cs.umass.edu

More information

automatic digitization. In the context of ever increasing population worldwide and thereby

automatic digitization. In the context of ever increasing population worldwide and thereby Chapter 1 Introduction In the recent time, many researchers had thrust upon developing various improvised methods of automatic digitization. In the context of ever increasing population worldwide and thereby

More information

A service based on Linked Data to classify Web resources using a Knowledge Organisation System

A service based on Linked Data to classify Web resources using a Knowledge Organisation System A service based on Linked Data to classify Web resources using a Knowledge Organisation System A proof of concept in the Open Educational Resources domain Abstract One of the reasons why Web resources

More information