LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.0 July 24, 2015

Similar documents
LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.1 October 7, 2016

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 1.0-beta2 May 16, 2010

Advanced JAPE. Module 1. June 2017

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 2.1 December 26, 2010

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format

Semantic MediaWiki (SMW) for Scientific Literature Management

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

CSC 5930/9010: Text Mining GATE Developer Overview

August 2012 Daejeon, South Korea

Module 1: Information Extraction

Language Resources and Linked Data

DBpedia Data Processing and Integration Tasks in UnifiedViews

Tutorial on Text Mining for the Going Digital initiative. Natural Language Processing (NLP), University of Essex

The GATE Embedded API

Creating new Resource Types

Natural Language Processing. SoSe Question Answering

Implementing a Variety of Linguistic Annotations

Advanced JAPE. Mark A. Greenwood

The Luxembourg BabelNet Workshop

Advanced GATE Applications

Natural Language Interfaces to Ontologies. Danica Damljanović

Outline. 1 Introduction. 2 Semantic Assistants: NLP Web Services. 3 NLP for the Masses: Desktop Plug-Ins. 4 Conclusions. Why?

D4.6 Data Value Chain Database v2

Assessing the quality factors found in in-line documentation written in natural language: The JavadocMiner

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Interactive Knowledge Stack A Software Architecture for Semantic Content Management Systems

Module 3: GATE and Social Media. Part 4. Named entities

University of Sheffield NLP. Exercise I

OpenTrace Workbench. Guide for Users and Developers. René Witte Elian Angius. Release

Linked Open Data Cloud. John P. McCrae, Thierry Declerck

introduction to using Watson Services with Java on Bluemix

Question Answering Systems

A tool for Cross-Language Pair Annotations: CLPA

What s in this paper? Combining Rhetorical Entities with Linked Open Data for Semantic Literature Querying

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

On a Java based implementation of ontology evolution processes based on Natural Language Processing

Named Entity Detection and Entity Linking in the Context of Semantic Web

Linked Data in Translation-Kits

Preserving Linked Data on the Semantic Web by the application of Link Integrity techniques from Hypermedia

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

University of Sheffield, NLP. Chunking Practical Exercise

Re-using Cool URIs: Entity Reconciliation Against LOD Hubs

D4.1 PRELIMINARY VERSION OF THE TEXT ANALYSIS COMPONENT, INCLUDING: NER, EVENT DETECTION AND SENTIMENT ANALYSIS

Utilizing, creating and publishing Linked Open Data with the Thesaurus Management Tool PoolParty

Text Mining for Software Engineering

Utilizing Open Data for interactive knowledge transfer

3 Publishing Technique

Linked Data Evolving the Web into a Global Data Space

Incorporating ontological background knowledge into Information Extraction

A FRAMEWORK FOR MULTILINGUAL AND SEMANTIC ENRICHMENT OF DIGITAL CONTENT (NEW L10N BUSINESS OPPORTUNITIES) FREME WEBINAR HELD FOR GALA, 28 APRIL 2016

Semantic Web Fundamentals

Machine Learning in GATE

Exploring and Using the Semantic Web

University of Rome Tor Vergata DBpedia Manuel Fiorelli

Aim behind client server architecture Characteristics of client and server Types of architectures

Introduction to IE and ANNIE

Annotating Spatio-Temporal Information in Documents

BD003: Introduction to NLP Part 2 Information Extraction

Natural Language Processing for MediaWiki: The Semantic Assistants Approach

DBpedia Spotlight at the MSM2013 Challenge

The GATE Embedded API

DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

Automatically Annotating Text with Linked Open Data

Pig for Natural Language Processing. Max Jakob

Property graphs vs Semantic Graph Databases. July 2014

Fusing Corporate Thesaurus Management with Linked Data using PoolParty

1 Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Using Linked Data and taxonomies to create a quick-start smart thesaurus

University of Sheffield, NLP. Chunking Practical Exercise

Semantic Integration with Apache Jena and Apache Stanbol

Large Scale Semantic Annotation, Indexing, and Search at The National Archives Diana Maynard Mark Greenwood

GATE APIs. Track II, Module 6. Third GATE Training Course August September 2010

Knox Manage manages the following application types: Internal applications: Applications for internal use

JVA-563. Developing RESTful Services in Java

Installing and Running a Data Quality Loqate Standard Address Application

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators

DBpedia-An Advancement Towards Content Extraction From Wikipedia

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Semantic Search at Bloomberg

Microsoft Dynamics AX This document describes the concept of events and how they can be used in Microsoft Dynamics AX.

COLLABORATIVE EUROPEAN DIGITAL ARCHIVE INFRASTRUCTURE

A Short Introduction to CATMA

Introduction to Information Extraction (IE) and ANNIE

MarkLogic Server. REST Application Developer s Guide. MarkLogic 9 May, Copyright 2017 MarkLogic Corporation. All rights reserved.

An Adaptive Framework for Named Entity Combination

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

ALIADA, an Open Source Solution to Easily Publish Linked Data of Libraries and Museums

Stanbol Enhancer. Use Custom Vocabularies with the. Rupert Westenthaler, Salzburg Research, Austria. 07.

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Information Extraction with GATE

Vision. Speech. Language. Knowledge. Search. Content Moderator. Computer Vision. Emotion. Face. Video. Speaker Recognition. Custom Recognition

A Korean Knowledge Extraction System for Enriching a KBox

A bit of theory: Algorithms

NLP Final Project Fall 2015, Due Friday, December 18

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

FREME WEBINAR SLIDES CREATED FEBRUARY Presented. FREME Webinar February 2016

University of Sheffield, NLP Machine Learning

Addoro Local 3.0. Installation and Configuration

Transcription:

LODtagger Guide for Users and Developers Bahar Sateli René Witte Release 1.0 July 24, 2015 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info

Contents 1 LODtagger 1 1.1 DBpediaTagger...................................... 1 1.1.1 Installation.................................... 2 1.1.2 Usage....................................... 3 1.1.3 Runtime parameters............................... 3 1.2 Implementation notes.................................. 4 1.2.1 DBpediaTagger.................................. 4 1.3 Troubleshooting..................................... 4 1.4 Changelog......................................... 6 About this document This document contains the documentation for the GATE LODtagger component. You can obtain the latest version from http://www.semanticsoftware.info/lodtagger. License The LODtagger component and resources are published under the GNU Lesser General Public License Version 3 (LGPL3). 1 Support For question or general comments, please use the online forum at http://www.semanticsoftware. info/forums/tools-resources-forum/durm-corpus-wiki-tools 1 LGPL3, https://www.gnu.org/licenses/lgpl-3.0.en.html iii

Chapter 1 LODtagger The LODtagger is a GATE component that provides entity linking from a document to their corresponding resource on the Linked Open Data (LOD) cloud. LODtagger relies on external tools to perform the actual content tagging and hides the complexity of communicating with LOD taggers, such as DBpedia Spotlight, from the perspective of pipeline developers. The current version of the LODtagger supports DBpedia Spotlight (see Section 1.1). We plan to integrate additional existing LOD tagging tools in the future. 1.1 DBpediaTagger The DBpediaTagger component sends the content of a GATE document to a pre-defined DBpedia Spotlight 1 RESTful endpoint and translates the web service results from JSON objects to GATE annotations. Figure 1.1: Example annotations generated by LODtagger using DBpedia Spotlight DBpedia Spotlight is an open-source Named Entity Recognition (NER) tool that matches the surface form of words in a text to their entry in the DBpedia knowledge base. Each annotated entity is linked to the LOD cloud through a URI, where additional, machine-readable information exists to describe the entity under investigation. Spotlight provides multiple web 1 DBpedia Spotlight, http://spotlight.dbpedia.org 1

Chapter 1 LODtagger services, such as Spotting (i.e., recognition of entities in text) and Disambiguate (i.e., linking of entities to DBpedia knowledge base URIs). In the current release of DBpediaTagger, we support the Annotate service, which combines the spotting and disambiguation services. For more information please check the Spotlight web services page. 2 The DBpediaTagger component allows the user to define the type of the annotations generated from the results through a run-time parameter (see Section 1.1.3). Along with the URI, other metadata, such as the confidence of the tool in linking the word to its correct sense is returned to the client. Therefore, for each result received from Spotlight, an annotation is added to the document, which includes several features (Figure 1.1): URI the URI of the entity in DBpedia knowledge base; similarityscore a numerical value between 0 to 1, showing the semantic similarity of the entity s in the document form to the corresponding entity described in the knowledge base. 1.1.1 Installation Prerequisites. To run DBpediaTagger, you must have GATE installed. Please refer to the GATE User Guide 3 for instructions on how to download and install GATE. The automated install method described below requires GATE version 8.1 or better. You also need access to a DBpedia Spotlight endpoint (see Section 1.1.2 for details). The examples and results shown in this document use a local installation of DBpedia Spotlight version 0.7 with a statistical model for English (en 2+2). Figure 1.2: Loading the DBpediaTagger Demo Application in GATE PR and Demo Pipeline. The distribution package creole.zip comes with the PR and a demo pipeline that you can install directly from within GATE: 2 Spotlight Web Services, https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/web-service 3 GATE User Guide, https://gate.ac.uk/sale/tao/split.html 2

1.1 DBpediaTagger 1. Start GATE, open the Plugin Manager (File Manage CREOLE Plugins...). 2. On the fourth tab (Configuration), select a writable directory for the local plugin installations and enable the Semantic Software Lab repository. 3. On the third tab (Available to install), you will now find DBpediaTagger. Select it and click on Apply All. 4. On the first tab (Installed Plugins), you will now find the DBpediaTagger plugin. Select it (either Load Now or Load Always ) and click on Apply All. To load the example pipeline (Figure 1.2), go to Applications Ready Made Applications Semantic Software Lab DBpediaTagger Demo Application. 4 1.1.2 Usage The DBpediaTagger component does not rely on any other processing resources in GATE, nor does it require pre-processing of the text. Simply add the DBpediaTagger processing resource to a corpus pipeline and it will annotate the document. However, in the demo pipeline, we added several other processing resources that help to find noun phrases in text and filter the Spotlight results to named entities, such as serviceoriented architecture. The noun phrases in text are detected by the MuNPEx 5 component. A JAPE transducer then filters out any tagged entity that does not fall within the boundary of a noun phrase. Note that in order to use the DBpediaTagger, you must have an instance of DBpedia Spotlight web service running on a local or remote connection and properly point the DBpediaTagger component to the web service s RESTful endpoint. For more information on how to install and run Spotlight, please refer to the DBpedia Spotlight user guide. 6 Note that the demo pipeline uses a public Spotlight web service by default, but for production use you should change that to a custom endpoint. 1.1.3 Runtime parameters In order to run the DBpediaTagger component, you can set the following run-time parameters: confidence the required confidence of the Spotlight tool in disambiguating a term; endpoint the URL of the annotate service on the server where the DBpedia Spotlight web service is running (the default is http://localhost:2222/rest/annotate, i.e., a local Spotlight installation); outputasname output annotation set; outputannotationname name (type) of the annotations generated from the results; and support is the minimum number of inlinks a DBpedia resource must have to be annotated. The higher the number, the more likely that popular terms are annotated and rare terms are missed in the text. 4 Please refer to the GATE user s guide, http://gate.ac.uk/sale/tao for further details on working with pipelines in GATE. 5 Multi-lingual Noun Phrase Extractor (MuNPEx), http://www.semanticsoftware.info/munpex 6 DBpedia Spotlight User Guide, https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/installation 3

Chapter 1 LODtagger 1.2 Implementation notes 1.2.1 DBpediaTagger Technically, DBpediaTagger is a wrapper program written in Java based on the GATE framework, which provides a simplified interface to interact with a DBpedia Spotlight web service from within the GATE environment. The component sends the entire UTF-8 formatted text of a document as a RESTful POST request to Spotlight. It accepts the results in JSON format and translates each JSON object in the response body to a new annotation in GATE. Each entity detected by Spotlight also bears the start and end offsets of its surface form in the text that directly matches its offsets in the GATE document. Since users can choose to define the name of the generated annotations, the DBpediaTagger creates a new annotation of the specified type within the spans defined by the offsets. Figure 1.3 shows an example JSON result that corresponds to the annotations shown in Figure 1.1. In our current implementation, in order save memory DBpediaTagger only preserves the URI and similarityscore features of each annotation and discards the rest. If you would like to store additional features in the generated annotations, you can extend the component s main class located in DBpediaTagger/src/info/semanticsoftware/lodtagger/ DBpediaTagger.java file and modify the following lines: @Override public void execute() throws ExecutionException { String doccontent = document.getcontent().tostring(); final SpotlightResult result = callspotlight(doccontent); final List<SpotlightResource> list = result.getresources(); AnnotationSet outputas = document.getannotations(outputasname); for (SpotlightResource rsrc : list) { FeatureMap feats = Factory.newFeatureMap(); feats.put("uri", rsrc.geturi()); feats.put("similarityscore", rsrc.getsimilarityscore()); // copy the line below for each feature and replace FEATURE NAME with the proper name, e.g. surfaceform feat.put(feature NAME, rsc.getfeature NAME()); if (rsrc.getsurfaceform() == null) { rsrc.setsurfaceform("null"); final long endoffset = rsrc.getoffset() + rsrc.getsurfaceform().length(); try { outputas.add(rsrc.getoffset(), endoffset, getoutputannotationname(), feats); catch (InvalidOffsetException e) { e.printstacktrace(); 1.3 Troubleshooting How do I know if my local Spotlight is running properly? If you are using a local installation of Spotlight, you can test its endpoint using your command line. Assuming that the Spotlight server is running on your http://localhost:2222, copy and paste the following command in your command line environment: curl -H "Accept: application/json" http://localhost:2222/rest/annotate \ --data-urlencode "text=albert Einstein was born in Elm, Germany." \ --data "confidence=0.2" \ --data "support=20" 4

1.3 Troubleshooting { "@text": "Albert Einstein was born in Elm, Germany.", "@confidence": "0.2", "@support": "20", "@types": "", "@sparql": "", "@policy": "whitelist", "Resources": [ { "@URI": "http://dbpedia.org/resource/albert_einstein", "@support": "3948", "@types": "DBpedia:Agent,Schema:Person,Http://xmlns.com/foaf/0.1/Person,DBpedia:Person, DBpedia:Scientist", "@surfaceform": "Albert Einstein", "@offset": "0", "@similarityscore": "0.9999999999298979", "@percentageofsecondrank": "5.292709205174733E-11", { "@URI": "http://dbpedia.org/resource/name_at_birth", "@support": "1688", "@types": "", "@surfaceform": "born", "@offset": "20", "@similarityscore": "0.6264053011004237", "@percentageofsecondrank": "0.4805726283431058", { "@URI": "http://dbpedia.org/resource/elm", "@support": "1142", "@types": "DBpedia:Species,DBpedia:Eukaryote,DBpedia:Plant", "@surfaceform": "Elm", "@offset": "28", "@similarityscore": "0.5151568678750613", "@percentageofsecondrank": "0.9006601466130938", { "@URI": "http://dbpedia.org/resource/germany", "@support": "182228", "@types": "Schema:Place,DBpedia:Place,DBpedia:PopulatedPlace,Schema:Country,DBpedia: Country", "@surfaceform": "Germany", "@offset": "33", "@similarityscore": "0.9986445869266131", "@percentageofsecondrank": "8.525875346410026E-4" ] Figure 1.3: JSON response from Spotlight for the document shown in Figure 1.1 5

Chapter 1 LODtagger The result from the endpoint should match the code in Figure 1.3. Note that depending on your Spotlight model, the results might vary in their confidence or similarity scores. I get java.net.connectexception: Connection refused when running the pipeline. This exception is thrown when the DBpediaTagger component cannot access the configured Spotlight RESTful endpoint. This means either the endpoint is offline or it in unaccessible (e.g., behind a firewall). 1.4 Changelog Version 1.0 (24.07.2015) Initial public release. 6