Named Entity Detection and Entity Linking in the Context of Semantic Web

Similar documents
A disambiguation resource extracted from Wikipedia for semantic annotation

DBpedia Spotlight at the MSM2013 Challenge

A Hybrid Neural Model for Type Classification of Entity Mentions

Mutual Disambiguation for Entity Linking

PRIS at TAC2012 KBP Track

NUS-I2R: Learning a Combined System for Entity Linking

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

NLP Final Project Fall 2015, Due Friday, December 18

Question Answering Systems

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself)

NERITS - A Machine Translation Mashup System Using Wikimeta and DBpedia. NEBHI, Kamel, NERIMA, Luka, WEHRLI, Eric. Abstract

QAKiS: an Open Domain QA System based on Relational Patterns

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Module 3: GATE and Social Media. Part 4. Named entities

Answer Extraction. NLP Systems and Applications Ling573 May 13, 2014

Annotating Spatio-Temporal Information in Documents

Exam Marco Kuhlmann. This exam consists of three parts:

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Natural Language Processing with PoolParty

Text Mining for Software Engineering

Question Answering Using XML-Tagged Documents

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing. Interspeech 2013

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.1 October 7, 2016

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

A Korean Knowledge Extraction System for Enriching a KBox

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Detection and Extraction of Events from s

Search Engines. Information Retrieval in Practice

Introduction to Text Mining. Hongning Wang

NLP in practice, an example: Semantic Role Labeling

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators

Morpho-syntactic Analysis with the Stanford CoreNLP

Information Extraction Techniques in Terrorism Surveillance

Final Project Discussion. Adam Meyers Montclair State University

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.0 July 24, 2015

Fine-Grained Semantic Class Induction via Hierarchical and Collective Classification

A cocktail approach to the VideoCLEF 09 linking task

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

A Hybrid Approach for Entity Recognition and Linking

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Néonaute: mining web archives for linguistic analysis

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

Dmesure: a readability platform for French as a foreign language

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Semantic Multimedia Information Retrieval Based on Contextual Descriptions

Query Expansion using Wikipedia and DBpedia

Text, Knowledge, and Information Extraction. Lizhen Qu

INFORMATION EXTRACTION

Using Linked Data to Reduce Learning Latency for e-book Readers

Semantic Retrieval of the TIB AV-Portal. Dr. Sven Strobel IATUL 2015 July 9, 2015; Hannover

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Natural Language Processing. SoSe Question Answering

CMU System for Entity Discovery and Linking at TAC-KBP 2017

Data-Mining Algorithms with Semantic Knowledge

Semantic Searching. John Winder CMSC 676 Spring 2015

Semantic Web and Natural Language Processing

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

Language Resources and Linked Data

BUPT at TREC 2009: Entity Track

Assignment #1: Named Entity Recognition

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

What you have learned so far. Interoperability. Ontology heterogeneity. Being serious about the semantic web

Conversational Knowledge Graphs. Larry Heck Microsoft Research

CACAO PROJECT AT THE 2009 TASK

Finding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches

Query Difficulty Prediction for Contextual Image Retrieval

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Textual Entailment Evolution and Application. Milen Kouylekov

Question Answering over Knowledge Bases: Entity, Text, and System Perspectives. Wanyun Cui Fudan University

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

CS674 Natural Language Processing

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

Entity and Knowledge Base-oriented Information Retrieval

NAMED ENTITY RECOGNITION

Text Mining: A Burgeoning technology for knowledge extraction

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

Handling Place References in Text

T2KG: An End-to-End System for Creating Knowledge Graph from Unstructured Text

A fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014

Improving Entity Linking using Surface Form Refinement

USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING. Sameer S. Pradhan, Valerie Krugler, Wayne Ward, Dan Jurafsky and James H.

Web-based Question Answering: Revisiting AskMSR

How Co-Occurrence can Complement Semantics?

Ontology-Based Information Extraction

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

Reading group on Ontologies and NLP:

Automated Extraction of Event Details from Text Snippets

A tool for Cross-Language Pair Annotations: CLPA

Machine Learning in GATE

Computing Fine-grained Semantic Annotations of Texts

Transcription:

[1/52] Concordia Seminar - December 2012 Named Entity Detection and in the Context of Semantic Web Exploring the ambiguity question. Eric Charton, Ph.D.

[2/52] Concordia Seminar - December 2012 Challenge on semantic annotation Understanding language with computers?

[3/52] Concordia Seminar - December 2012 Challenge on semantic annotation Understanding language with computers! To build high level natural language understanding component, you need to know the exact identity and meaning of a word sequence.

[3/52] Concordia Seminar - December 2012 Challenge on semantic annotation Understanding language with computers! To build high level natural language understanding component, you need to know the exact identity and meaning of a word sequence. Example: analyzing a sentence with an hotel and its description. I want to book a room in an hotel located in the heart of Paris, just a stone s throw from the Eiffel Tower

[3/52] Concordia Seminar - December 2012 Challenge on semantic annotation Understanding language with computers! To build high level natural language understanding component, you need to know the exact identity and meaning of a word sequence. Example: analyzing a sentence with an hotel and its description. I want to book a room in an hotel located in the heart of Paris, just a stone s throw from the Eiffel Tower Finding a set of valid hotel references will depend on the identification of exact city reference.

[4/52] Concordia Seminar - December 2012 Challenge on semantic annotation Various nature of textual object Definition Semantic annotation is the identification of a textual entity with its meaning through a link to a graph description. The Semantic Annotation task consists in establishing a URI link between the text mention and a graph on the Semantic Web: this is.

[5/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question The ambiguity question

[6/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question The essential problem of semantic annotation is related to the ambiguous nature of natural language.

7/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question We can use various level of information to understand a sentence and to disambiguate it. From the letter sequence to the meaning of the word 1 Typographical and Lexical information: capital letter inside a sentence defines a proper name (ex: Montreal), punctuation separates units. 2 Part-Of-Speech tagging : grammatical information. Is the word a Verb an Adjective, etc.? 3 Boundary and surface form: a group of words defines a lexical unit: ex RMS Titanic. 4 Named Entity (NE) : the semantic class of a lexical unit. Is it a PERSon? A PRODuct? any class? 5 Entity meaning : the exact and unique ontological knowledge related to the word or the lexical unit.

[8/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question Named Entity can be seen as a first level of semantic annotation: Specific nature of Named Entity The NE detection task consists in assigning a class label to a lexical unit: pers.hum, loc.fac, org.com. The class label is unique and cannot be used to define semantic attributes of NE. Named Entity class is not restricted: biological entity, object category...

[9/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question Example The Novotel Paris Tour Eiffel has 764 rooms and is located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Named entity detection

[10/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question Example: Class disambiguation The Novotel Paris Tour Eiffel has 764 rooms and is located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Named entity detection Class selection? An asteroid (3317 Paris), a town, a ship, a movie?

[11/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question Example: Final class choice The Novotel Paris Tour Eiffel has 764 rooms and is located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower A city -> LOC.ADMI class

[12/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question Example: But there is still an ambiguity... The Novotel Paris Tour Eiffel has 764 rooms and is located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Named entity detection Class attribution : a city LOC.ADMI Il n'y a qu'un seul Paris?

[13/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question... inside the class. The Novotel Paris Tour Eiffel has 764 rooms and is located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Paris (Idaho) Paris (Tenessee) Paris (Ontario) Paris Paris (Kentucky) Paris (Maine) Remaining ambiguity is the main limitation of named entities for use in high level text understanding components.

[14/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question Named Entity class label is limited by the tree taxonomy: Named Entity taxonomy limitation Taxonomic sample: Paris loc loc.admi loc.admi.xxx Semantic annotation is outside the scope of named entity taxonomy: population? founders?... and which Paris? We need an external graph knowledge to introduce such information related to identity.

[15/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question In modern systems, semantic knowledge is provided by the Semantic Web content through its standards.

Difficulty of the semantic annotation task The ambiguity question Principle of is to define a link between a lexical unit and its representation on the Semantic Web. I want to book a room in an hotel located in the heart of Paris, just a stone s throw from the Eiffel Tower Paris LOC.ADMI [16/52] Concordia Seminar - December 2012 http://dbpedia.org/resource/paris, dbpedia-owl:country, dbpedia:france dbpprop:decprecipitationdays, 11(xsd:integer) dbpprop:urbanpop, 10354675 (xsd:integer)...

Difficulty of the semantic annotation task The ambiguity question Entity linking can solve the problem of remaining ambiguity. I want to book a room in an hotel located in the heart of Paris, just a stone s throw from the Eiffel Tower Named entity Paris LOC.ADMI Paris representation in a linked metadata Paris (France) Population 2,203,817 Maire B. Delanoë Devise Fluctuat nec mergitur Région Île de France Aire 2723 km... But we need complementary disambiguation techniques to choose the right link. [17/52] Concordia Seminar - December 2012

[18/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task The ambiguity question I want to book a room in an hotel located in the heart of Paris just a stone s throw from the Eiffel Tower. How many Eiffel towers are located in the center of Paris with an hotel nearby?

[19/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task How many Eiffel towers in Paris cities?

[20/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task In Paris Texas?

[21/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task In Paris Tenessee?

[22/52] Concordia Seminar - December 2012 Difficulty of the semantic annotation task And why not in Paris, Las Vegas?

[23/52] Concordia Seminar - December 2012 Language is highly ambiguous Measuring ambiguity Intuition is never a friend in NLP. There is always a possibility of ambiguity: Semantic Annotation task is impossible to solve globally with simple finite solutions (automatons, rules). This can be shown by experiment.

[24/52] Concordia Seminar - December 2012 Language is highly ambiguous Measuring ambiguity Intuition is never a friend in NLP. Experiment on ambiguity: Evaluation of ambiguity on a reference corpus Trec QA Corpus 2004. Generic questions like: How many members were in the crew of the Challenger? What kind of ship is the Liberty Bell? When was James Dean born? Annotation of all the identifiable concepts using lexical units matched by surface forms derived from Wikipedia.

[25/52] Concordia Seminar - December 2012 Language is highly ambiguous Experiment Intuition is never a friend in NLP. Evaluation of ambiguity on Trec QA Corpus 2004. 350 queries 1126 lexical units (a unique concept defined by n words) 1076 lexical units with one ore more potential match in Wikipedia Over 5000 propositions for 1076 candidates mean of 5.22 propositions for each lexical unit

[26/52] Concordia Seminar - December 2012 Language is highly ambiguous Experiment Intuition is never a friend in NLP.

[27/52] Concordia Seminar - December 2012 Architectures for systems

28/52] Concordia Seminar - December 2012 Available methods The two steps of the process 1) Mention detection process

28/52] Concordia Seminar - December 2012 Available methods The two steps of the process 1) Mention detection process Which word or lexical unit in the sentence has to be associated with a semantic link?

[28/52] Concordia Seminar - December 2012 Available methods The two steps of the process 1) Mention detection process Which word or lexical unit in the sentence has to be associated with a semantic link? 2)Semantic disambiguation process Choose from a knowledge base an instance with the right meaning according to a mention, and associate a link.

29/52] Concordia Seminar - December 2012 The two steps of the process First step, mention detection (and class disambiguation) Mention detection techniques

29/52] Concordia Seminar - December 2012 The two steps of the process First step, mention detection (and class disambiguation) Mention detection techniques Inference : choosing some classes of entities to target (Wikimeta system, Stanford NE system)

29/52] Concordia Seminar - December 2012 The two steps of the process First step, mention detection (and class disambiguation) Mention detection techniques Inference : choosing some classes of entities to target (Wikimeta system, Stanford NE system) Rules : using a syntactic schema to detect mentions (Xerox NE system)

29/52] Concordia Seminar - December 2012 The two steps of the process First step, mention detection (and class disambiguation) Mention detection techniques Inference : choosing some classes of entities to target (Wikimeta system, Stanford NE system) Rules : using a syntactic schema to detect mentions (Xerox NE system) Gazetteer : using a knowledge base of potential surface forms (DBpedia Spotlight system)

29/52] Concordia Seminar - December 2012 The two steps of the process First step, mention detection (and class disambiguation) Mention detection techniques Inference : choosing some classes of entities to target (Wikimeta system, Stanford NE system) Rules : using a syntactic schema to detect mentions (Xerox NE system) Gazetteer : using a knowledge base of potential surface forms (DBpedia Spotlight system) Hybridization of all those techniques gives the best results to prepare entity linking (experiments of NERD Platform: http://nerd.eurecom.fr).

30/52] Concordia Seminar - December 2012 The two steps of the process Second step, semantic disambiguation Using a resource of potential context to compare with the entity to annotate

30/52] Concordia Seminar - December 2012 The two steps of the process Second step, semantic disambiguation Using a resource of potential context to compare with the entity to annotate Wikipedia provides unique descriptions for millions of semantic concepts.

30/52] Concordia Seminar - December 2012 The two steps of the process Second step, semantic disambiguation Using a resource of potential context to compare with the entity to annotate Wikipedia provides unique descriptions for millions of semantic concepts. This knowledge can be used to apply Information Retrieval (IR) algorithms.

30/52] Concordia Seminar - December 2012 The two steps of the process Second step, semantic disambiguation Using a resource of potential context to compare with the entity to annotate Wikipedia provides unique descriptions for millions of semantic concepts. This knowledge can be used to apply Information Retrieval (IR) algorithms. IR algorithms used for disambiguation Vector Space Model: Wikimeta system (Cosine similarity), DBpedia Spotlight (Maximum Entropy). Conditional probability: (Kim system).

30/52] Concordia Seminar - December 2012 The two steps of the process Second step, semantic disambiguation Using a resource of potential context to compare with the entity to annotate Wikipedia provides unique descriptions for millions of semantic concepts. This knowledge can be used to apply Information Retrieval (IR) algorithms. IR algorithms used for disambiguation Vector Space Model: Wikimeta system (Cosine similarity), DBpedia Spotlight (Maximum Entropy). Conditional probability: (Kim system). Accuracy of results can vary according to the task: mostly because of context availability.

[31/52] Concordia Seminar - December 2012 The two steps of the process Second step, semantic disambiguation The Vector Space model: word 2 word n Vector of context Vector of bag of word weights candidate 1 Vector of bag of word weights candidate 2 θ Vector of bag of word weights candidate n word 1

[32/52] Concordia Seminar - December 2012 The two steps of the process Advantages of the two possible architectures Named Entity Recognition (NER) prior to semantic disambiguation Robustness of detection Unknown or emergent concepts can receive a first level of semantic information (NE class label) All classes of concepts not covered by NER system are unseen.

[32/52] Concordia Seminar - December 2012 The two steps of the process Advantages of the two possible architectures Named Entity Recognition (NER) prior to semantic disambiguation Robustness of detection Unknown or emergent concepts can receive a first level of semantic information (NE class label) All classes of concepts not covered by NER system are unseen. Simple mention detection prior to semantic disambiguation Virtually any concepts can be annotated with semantic link. Increase of ambiguity that can minimize robustness. Unknown or emergent concepts non available in the gazetteer list used for mention detection are unseen.

[33/52] Concordia Seminar - December 2012 The two steps of the process Example of a full system architecture - Wikimeta -

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities.

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities. Such a classifier is capable of selecting labels from a limited amount of classes (4 to 250) and only with a generic context.

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities. Such a classifier is capable of selecting labels from a limited amount of classes (4 to 250) and only with a generic context. (ex: A company ORG.COM) generic contextual words Nyse, CAC40, Bilan, Effectifs, Shares, Revenue etc)

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities. Such a classifier is capable of selecting labels from a limited amount of classes (4 to 250) and only with a generic context. (ex: A company ORG.COM) generic contextual words Nyse, CAC40, Bilan, Effectifs, Shares, Revenue etc) Disambiguation: Semantic labeling

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities. Such a classifier is capable of selecting labels from a limited amount of classes (4 to 250) and only with a generic context. (ex: A company ORG.COM) generic contextual words Nyse, CAC40, Bilan, Effectifs, Shares, Revenue etc) Disambiguation: Semantic labeling Unlimited quantity of potential graphs: equal to the amount of different concepts to identify.

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities. Such a classifier is capable of selecting labels from a limited amount of classes (4 to 250) and only with a generic context. (ex: A company ORG.COM) generic contextual words Nyse, CAC40, Bilan, Effectifs, Shares, Revenue etc) Disambiguation: Semantic labeling Unlimited quantity of potential graphs: equal to the amount of different concepts to identify. No universal context: a unique potential context corresponds to each different concept.

[34/52] Concordia Seminar - December 2012 Sample architecture Mention detection: Named Entity Recognition A classifier (CRF, SVM) is trained to make an inference regarding the context to extract, and to label named entities. Such a classifier is capable of selecting labels from a limited amount of classes (4 to 250) and only with a generic context. (ex: A company ORG.COM) generic contextual words Nyse, CAC40, Bilan, Effectifs, Shares, Revenue etc) Disambiguation: Semantic labeling Unlimited quantity of potential graphs: equal to the amount of different concepts to identify. No universal context: a unique potential context corresponds to each different concept. (ex Paris (France) personalized contextual words Seine, Tour Eiffel etc).

35/52] Concordia Seminar - December 2012 Sample architecture Knowledge resource used for disambiguation Linked Data Interface (LDI): statistical and conceptual knowledge.

35/52] Concordia Seminar - December 2012 Sample architecture Knowledge resource used for disambiguation Linked Data Interface (LDI): statistical and conceptual knowledge. For each unique concept, gets all words of potential context with their TF.IDF weights.

35/52] Concordia Seminar - December 2012 Sample architecture Knowledge resource used for disambiguation Linked Data Interface (LDI): statistical and conceptual knowledge. For each unique concept, gets all words of potential context with their TF.IDF weights. For each unique concept, includes one ore more links to the Linked Data Network.

35/52] Concordia Seminar - December 2012 Sample architecture Knowledge resource used for disambiguation Linked Data Interface (LDI): statistical and conceptual knowledge. For each unique concept, gets all words of potential context with their TF.IDF weights. For each unique concept, includes one ore more links to the Linked Data Network. Built from resources extracted from the Web Wikipedia provides 3.9M concepts with their word context. Each concept is associated to a set of surface forms matching lexical units. Correspondence tables between Wikipedia and DBpedia are used to collect links.

Sample architecture Linked Data Interface building [36/52] Concordia Seminar - December 2012

37/52] Concordia Seminar - December 2012 Sample architecture Algorithm Named Entity detection and linking A NER tool detects NEs inside the text.

[37/52] Concordia Seminar - December 2012 Sample architecture Algorithm Named Entity detection and linking A NER tool detects NEs inside the text. The surface form of the NE is used to locate SE candidates in the Linked Data Interface

[37/52] Concordia Seminar - December 2012 Sample architecture Algorithm Named Entity detection and linking A NER tool detects NEs inside the text. The surface form of the NE is used to locate SE candidates in the Linked Data Interface A cosine similarity measure is achieved between the context of NE and SE candidates.

[37/52] Concordia Seminar - December 2012 Sample architecture Algorithm Named Entity detection and linking A NER tool detects NEs inside the text. The surface form of the NE is used to locate SE candidates in the Linked Data Interface A cosine similarity measure is achieved between the context of NE and SE candidates. If more than one candidate exists (ex Paris (France), Paris (Ontario)...), the best cosine score gives the best SE instance.

[37/52] Concordia Seminar - December 2012 Sample architecture Algorithm Named Entity detection and linking A NER tool detects NEs inside the text. The surface form of the NE is used to locate SE candidates in the Linked Data Interface A cosine similarity measure is achieved between the context of NE and SE candidates. If more than one candidate exists (ex Paris (France), Paris (Ontario)...), the best cosine score gives the best SE instance. A threshold value is used to reject low scored candidates (presumed wrong identification).

[37/52] Concordia Seminar - December 2012 Sample architecture Algorithm Named Entity detection and linking A NER tool detects NEs inside the text. The surface form of the NE is used to locate SE candidates in the Linked Data Interface A cosine similarity measure is achieved between the context of NE and SE candidates. If more than one candidate exists (ex Paris (France), Paris (Ontario)...), the best cosine score gives the best SE instance. A threshold value is used to reject low scored candidates (presumed wrong identification). Final retained corresponding instance of LDI gives the semantic link between NE and Linked Data.

[38/52] Concordia Seminar - December 2012 Sample architecture Linked Data Interface (LDI) Mention detection using a CRF classifier I want to book a room in an hotel located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower

[39/52] Concordia Seminar - December 2012 Sample architecture Linked Data Interface (LDI) I want to book a room in an hotel located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Linked Data Interface (LDI) Metadata containers E) Surface forms (E.r) Words:TF.IDF (E.c) LinkedData (E.rdf) Paris, Paris New York York:69,Cassvile:58,Oneida:52... http:/ / dbpedia.org/ data/ Paris,_ New_ York.rdf Paris, Paname, Lutece France:342;Seine:210;Eiffel:53... http:/ / dbpedia.org/ data/ Paris.rdf Paris Kentucky:140,Varden:53,Bourbon:37 http:/ / dbpedia.org/ data/ Paris,_ Kentucky.rdf

[40/52] Concordia Seminar - December 2012 Sample architecture Linked Data Interface (LDI) I want to book a room in an hotel located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Semantic Disambiguation A lgorithm (SDA ) Best CosineScore Cosine Similarity mesure (Words.TF.IDF,{ Seine, Eiffel,Tower } ) Linked Data Interface (LDI) Metadata containers E) Surface forms (E.r) Words:TF.IDF (E.c) LinkedData (E.rdf) Paris, Paris New York York:69,Cassvile:58,Oneida:52... http:/ / dbpedia.org/ data/ Paris,_ New_ York.rdf Paris, Paname, Lutece France:342;Seine:210;Eiffel:53... http:/ / dbpedia.org/ data/ Paris.rdf Paris Kentucky:140,Varden:53,Bourbon:37 http:/ / dbpedia.org/ data/ Paris,_ Kentucky.rdf

[41/52] Concordia Seminar - December 2012 Sample architecture Linked Data Interface (LDI) I want to book a room in an hotel located in the heart of Paris, overlooking the Seine and just a stone's throw from the Eiffel Tower Semantic Disambiguation A lgorithm (SDA ) Best CosineScore Cosine Similarity mesure (Words.TF.IDF,{ Seine, Eiffel,Tower } ) Linked Data Interface (LDI) Metadata containers E) Surface forms (E.r) Words:TF.IDF (E.c) LinkedData (E.rdf) Paris, Paris New York York:69,Cassvile:58,Oneida:52... http:/ / dbpedia.org/ data/ Paris,_ New_ York.rdf Paris, Paname, Lutece France:342;Seine:210;Eiffel:53... http:/ / dbpedia.org/ data/ Paris.rdf Paris Kentucky:140,Varden:53,Bourbon:37 http:/ / dbpedia.org/ data/ Paris,_ Kentucky.rdf Semantic Link Linked Data

[42/52] Concordia Seminar - December 2012 Experiments Experiments and results

[43/52] Concordia Seminar - December 2012 Experiments Evaluation plan and method There is no standard evaluation schema for applications like the one described here. We evaluated our system with an improved standard NER test corpus. The corpus is made of two Corpora from French and English evaluation campaigns (ESTER 2 and CoNLL 2008). To each NE in the corpus, we associate a standard Linked Data URI coming from DBpedia.

[44/52] Concordia Seminar - December 2012 Experiments Test corpora Word POS NE Semantic Link il PRO:PER UNK est VER:pres UNK 20 NUM TIME heures NOM TIME a PRP UNK Johannesburg NAM LOC.ADMI http://dbpedia.org/data/ Johannesburg.rdf Table: Sample annotation of the French ESTER 2 NE test corpus. Word POS NE Semantic Link Laura NNP PERS.HUM NORDF Colby NNP PERS.HUM in IN UNK Milan NNP LOC.ADMI http://dbpedia.org/data/milan.rdf Table: Sample annotation of the English CoNLL 2008 test corpus.

[45/52] Concordia Seminar - December 2012 Experiments Coverage of the Linked Data Interface Each NE contained in a text document does not have necessarily a corresponding representation in LDI. The following Table shows the coverage of built metadata contained in LDI, regarding NEs contained in the test corpus. ESTER 2 2009 (French) WSJ CoNLL 2008 (English) Labels Entities in test Equivalent entities Coverage (%) Entities in test Equivalent entities Coverage (%) corpus in LDI corpus in LDI PERS 1096 483 44% 612 380 62% ORG 1204 764 63% 1698 1129 66% LOC 1218 1017 83% 739 709 96 % PROD 59 23 39% 61 60 98 % Total 3577 2287 64% 3110 2278 73%

[46/52] Concordia Seminar - December 2012 Experiments Test LDI is applied to establish a link between NEs and Linked Data network in two configuration. no α test mode : only the NEs covered by LDI are used. α test mode : all the NEs of test corpora are used and a threshold value is used. Recall is calculated on the NE/Semantic link pairs.

[47/52] Concordia Seminar - December 2012 Experiments Results French tests English tests NE [no α] Recall [α] Recall [no α] Recall [α] Recall PERS 483 0.96 1096 0.91 380 0.93 612 0.94 ORG 764 0.91 1204 0.90 1129 0.85 1608 0.86 LOC 1017 0.94 1218 0.92 709 0.84 739 0.82 PROD 23 0.60 59 0.50 60 0.85 61 0.85 Total 2287 0.93 3577 0.90 2278 0.86 3020 0.86 Table: Results of the semantic labeler applied on the ESTER 2 and WSJ CoNLL 2008 test corpora.

[48/52] Concordia Seminar - December 2012 Conclusions Conclusions

[49/52] Concordia Seminar - December 2012 Conclusions About the evolution of the semantic annotation task The Named Entity Recognition task could be soon replaced by the task: The NE class label can be found in knowledge base using the linking. One of the remaining interest of NER is its ability to discover emergent concept. Entity linking using surface form detection offers more possibilities of detection (common words, specific class of words like animals or organisms). NER systems still offer better accuracy than simple Entity Linking systems. There is work to do for improving robustness.

[50/52] Concordia Seminar - December 2012 Conclusions About the evolution of the semantic annotation task Emergence of open and free structured resources like Wikipedia and Semantic Web repositories defines the nature of Semantic Annotation task: Semantic Web URI is a de facto standard for annotation. Wikipedia is the standard to build disambiguation resources. Wikipedia is the standard to build mention detection resources.

[51/52] Concordia Seminar - December 2012 Conclusions About the evolution of the semantic annotation task There are still issues to solve: Evaluation of semantic annotation tools using standard resource is still an open research topic. Without enough context, semantic annotation systems still have problems. The Siri (and friends) problem. Semantic disambiguation in low context using reasoning is a new promising perspective of research.

[52/52] Concordia Seminar - December 2012 Conclusions Make your own experiments! www.wikimeta.com : semantic annotation tool with NER mention detection (free for students). dbpedia.org/spotlight: semantic annotation tool with Surface Form mention detection (free). www.nlgbase.org : semantic disambiguation resource (free CC license). nerd.eurecom.fr : easy to use tool to compare semantic annotation systems. Thank you