University of Sheffield, NLP GATE: Bridging the Gap between Terminology and Linguistics

Size: px
Start display at page:

Download "University of Sheffield, NLP GATE: Bridging the Gap between Terminology and Linguistics"

Transcription

1 GATE: Bridging the Gap between Terminology and Linguistics Diana Maynard University of Sheffield, UK

2 Why do terminologists need GATE? Terminologists face the problem of lack of suitable tools to process their data. Lots of in-house tools for doing individual things Lack of common tools that can be used collaboratively and across different systems and domains. Tools must be flexible, robust and able to adapt to different processing tasks and languages GATE and its components are a key tool in today's world of information and data overload Enable users to perform tasks such as document management, business intelligence, information retrieval, question answering, and knowledge indexing, modelling and conceptualisation.

3 GATE can help terminologists: Save time and money on management of text and data from multiple sources Find hidden links scattered across huge volumes of diverse information Integrate structured data from variety of sources Interlink text and data Collect information and extract new facts

4 A vision for text mining It is difficult to access unstructured information efficiently IE automates extraction of facts from text at reasonable accuracy and cost, increasing the value and utility of unstructured content Interlinking of text and data enables more efficient search, navigation and querying Text analysis is a matter of engineering: GATE offers practical solutions able to match specific requirements

5 Threat tracking application

6 Text mining and semantic annotation Extract structured data from text by Linking references to entities Linking entities to their semantic descriptions Automatic semantic annotation based on IE technology Attaches metadata to documents, which can be used for searching and hyperlinking Adds value to content of libraries, enabling user interaction with content Enhanced capability for cross-referencing and dynamic document classification

7 Semantic Annotation

8 Semantic Annotation of Entities Recognition of the type of the entities in the text from a rich taxonomy of classes Reference to their semantic description. Traditional NE recognition approach results in: <Person>Lama Ole Nydahl</Person> Semantic Annotation of NEs results in: <ReligiousPerson ID= > Lama Ole Nydahl </ReligiousPerson>

9 GATE: the Swiss Army Knife of NLP Has an attachment for almost every eventuality Some are hard to prise open Some are useful, but you might have to put up with a bit of clunkiness in practice Some will only be useful once in a lifetime, but you're glad to have them just in case. There are many imitations, but nothing like the real thing.

10 History of GATE early 1990s: you want me to write that all over again? : first GATE (and "large-scale IE") project 1996: GATE 1: Tcl/Tk, Perl, C++, : release of completely rewritten version 2, 100% Java 2009: mature ecosystem with established community Tens of thousands of research users 25,000 downloads per year commercial users getting serious

11 GATE is very eco-friendly!

12 GATE commercial users Typical commercial uses: dynamic search and indexing of repositories finding relations between elements in distributed repositories aggregating information from different text sources populating repositories fact finding from distributed knowledge sources Typical users: Pharmaceutics, news, intelligence (business, competitor, government, etc.), manufacturing, telecommunications

13

14

15

16

17

18

19

20

21 So what exactly is GATE? An architecture: A macro-level organisational picture for HLT software systems. A framework: For programmers, GATE is an object-oriented class library that implements the architecture. A development environment: For language engineers, computational linguists et al, a graphical development environment. A community of users and contributors

22 Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...) (Almost) everything is a component, and component sets are user-extendable (Almost) all operations are available both from API and GUI

23 In short GATE includes: components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... tools for visualising and manipulating text, annotations, ontologies, parse trees, etc. various information extraction tools evaluation and benchmarking tools

24 Algorithms + Data + GUI = Applications GATE components are one of three types: Language Resources (LRs), e.g. lexicons, corpora, ontologies Processing Resources (PRs), e.g. parsers, generators, taggers Visual Resources (VRs), i.e. visualisation and editing components Algorithms are separated from the data, which means: the two can be developed independently by users with different expertise. alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource

25 But isn t GATE just about IE? Many people think of GATE as an IE tool IE is its primary function, but it also does a lot more Pretty much kind of linguistic processing can be done in GATE The only field we really don't cover is Machine Translation, but you could easily add components for that if you wanted More about the other functionality later, but now back to IE...

26 Two Approaches to IE Knowledge Engineering rule based developed by experienced language engineers make use of human intuition obtain marginally better performance development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpus

27 Named Entity Recognition Named Entity recognition is the cornerstone of IE Identification of proper names in texts, and classification into a set of predefined categories of interest. Three universally accepted categories: person, location and organisation Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), addresses etc. Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

28 ANNIE ANNIE is GATE's rule-based IE system It uses the language engineering approach (though we also have tools in GATE for ML) Distributed as part of GATE Uses a finite-state pattern-action rule language, JAPE More on JAPE later... ANNIE contains a reusable and easily extendable set of components: generic preprocessing components for tokenisation, sentence splitting etc components for performing NE on general open domain text

29 ANNIE Modules

30 Unicode Tokeniser Bases tokenisation on Unicode character classes Language-independent tokenisation Declarative token specification language, e.g.: "UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperinitial; kind=word Identifies words, numbers, spaces, different classes of punctuation, orthography Recognition deliberately basic so that more powerful tools (JAPE) can be used for finer distinctions greater reuse possibilities

31 Gazetteer Set of lists compiled into Finite State Machines 60k entries in 80 types List entries are matched in the text as Lookup annotations Each list has some pre-defined features, which enable different kinds of matches to be identified Additional arbitrary features and values can be added to individual list entries Entries can be matched according to root forms, or more flexibly based on e.g. edit distance

32

33 Limitations of gazetteers Gazetteer lists are designed for annotating simple, regular features Some flexibility is provided, but this is not enough for most tasks Recognising addresses using just a gazetteer would be impossible But combined with other linguistic pre-processing results, we have a whole lot of annotations and features POS tags, capitalisation, punctuation, lookup features, etc can all be combined to form patterns suggesting more complex information Luckily, we have JAPE to take care of this.

34 What is JAPE? a Jolly and Pleasant Experience Specially developed pattern matching language for GATE Each JAPE rule consists of LHS which contains patterns to match RHS which details the annotations (and optionally features) to be created JAPE rules combine to create a phase Rule priority based on pattern length, rule status and rule ordering Phases combine to create a grammar

35 Named Entity Grammars Hand-coded rules written in JAPE applied to annotations to identify NEs Phases run sequentially and constitute a cascade of FSTs over annotations Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc. Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned Standard named entities: persons, locations, organisations, dates, addresses, money Basic NE grammars can be adapted for new applications, domains and languages

36 JAPE example University of Sheffield Rule: nameduniversity ( {Token.string == "University"} {Token.string == "of"} ({Lookup.minorType == city} ({Token.category == NNP})+ ) ):orgname --> :orgname.organisation = {kind = "university", rule = "nameduniversity"} Looks for specific words University of followed by: city name from gazetteer, or one or more proper nouns

37 Combining existing annotations Associate a company with a share price e.g. Whitbread shares closed up 2p at 645p. Phase: Shares Input: Token Organization Lookup Money Percent Options: control = appelt Rule:ShareChange ( {Organization} ({Token})[0,3] {Lookup.majorType=="change"} ({Token})[0,3] ({Money} {Percent}) ):change --> :change.sharechange = {rule = "ShareChange"}

38 Orthomatcher Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown Matching rules are invoked between annotations of the same type, or between an existing annotation and an Unknown annotation The latter is the only case where an annotation type can be changed Lookup tables of aliases and exceptions (i.e. overriding of matching rules) Also PRs for pronominal and nominal coreference

39 What about other languages? Since we're based in Sheffield, you can't blame us for developing GATE primarily for English But contrary to popular belief about the British, we don't hate all foreigners! And we have lots of capabilities for processing in other languages Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic You have a POS tagger for Swahili? Just add it as a plugin and combine it with existing tokeniser etc.

40 It's all Chinese to me...

41 Processing multiple languages If you have a language identifier PR, you can combine processing of texts in different languages in a single application The system will choose the right PRs for each document or document section Conditional application fires a PR if some condition is met

42 Other plugins Parsers (Stanford, MiniPar, RASP, SUPPLE) More flexible gazetteers Specialised NE (Chemistry, Biomedicine, etc) PRs for other languages, Alignment Lemmatisers, morphological analyser, NP and VP chunkers Machine Learning Evaluation toolkit including IAA IR, Google and Yahoo search engines, web crawlers WordNet Whole host of ontology-based tools

43 Alignment plugin

44 GATE in use We have dozens of applications, not all just research projects! A few examples...

45 Semantic Annotation Adding information to documents that is usable by machines to enable better presentation, navigation or searching, e.g. Perseus:

46

47 Indexing news at the BBC BBC Archives: 'Newsnight' archiving time is 8 hours per hour Automatic transcription to extract some potential indexing terms Result: temporally precise, but very noisy data Partial solution: search the web, intranet, digital library for related pages, and process with IE/SA Result: less noisy but temporally imprecise So we merge this information with the speech signal data Result: works well for easy stuff (high precision, low recall)

48

49 Ontology linking at FAO FAO have sets of fisheries-related ontologies, e.g. Gear, species, fishing areas No way to link between them using ontology alignment techniques, because we require information external to the ontology (fish lives in a particular area) NLP techniques make use of information from documents which provide this missing link Not always an exact match between text and the ontology elements, e.g. Mummichogs vs. fundulus heteroclitus Use techniques such as headword matching, noun phrase chunking, synonym and acronym finding, etc Find relations in the text to link the entities together

50 Ontology linking at FAO Fishing Gear Fishing Area caught_by found_in Species basis_of Commodities

51 Matching text descriptions Find NPs and terms; use OntoRootGazetteer to find morphological variants of ontology elements, perform headword and synonym matching etc. Pelagic species, mainly fish and cephalopds, northern shrimp (also small crustaceans, krill Match text span to ontology instance, retaining URIs Create annotations and features, e.g. caught_by = {gear_type = midwater otter trawls target_species = cephalopods} Convert to RDF triples

52

53 Using ANNIC to view results 53

54 Outsmarting our competitors

55 If you can't beat 'em, join 'em UIMA OpenCalais Lingpipe All integrated into GATE as plugins

56 UIMA UIMA is an NL engineering platform developed by IBM Shares some functionality with GATE, but is complementary in most respects. Interoperability layer has been developed to allow UIMA applications to be run within GATE, and vice versa, in order to combine elements of both. Emphasis is on architectural support, including asynchronous scaleout (deploying many copies of an application in parallel) Much narrower range of resources provided than GATE

57 OpenCalais Web service for semantic annotation of text. The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format. Typically, users integrate OpenCalais annotation of their web pages to provide additional links and semantic functionality. OpenCalais annotates both relations and entities, although the GATE plugin only supports entities.

58 LingPipe Provides set of IE and data mining tools largely MLbased. Has a set of models trained for particular tasks/corpora. Limited ontology support: can connect entities found to databases and ontologies Advantage: ML models can suggest more than one output, ranked by confidence. The user can choose number of suggestions generated. Disadvantage: ML models only apply to specific tasks and domains.

59 In summary... We like to think GATE is the best thing since sliced bread for most NLP and terminology tasks You can use it for plenty of other things too, don't let us stop you being creative! Incorporates huge number of plugins, is easily extendable and highly customisable The only limit is your imagination... So if you're now convinced you can't live without GATE, there are two possibilities: ask us to get involved with a project try GATE yourself

60 Get your own hands dirty We run 3x yearly training courses in Sheffield and other selected locations Different tracks available GATE certification available

61 More info, contact details, demos, publications: Now it's time to nudge your neighbour if they are asleep... Or ask that burning question about GATE.

BD003: Introduction to NLP Part 2 Information Extraction

BD003: Introduction to NLP Part 2 Information Extraction BD003: Introduction to NLP Part 2 Information Extraction The University of Sheffield, 1995-2017 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Contents This

More information

CSC 5930/9010: Text Mining GATE Developer Overview

CSC 5930/9010: Text Mining GATE Developer Overview 1 CSC 5930/9010: Text Mining GATE Developer Overview Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 GATE Components 2 We will deal primarily with GATE Developer:

More information

Text Mining for Software Engineering

Text Mining for Software Engineering Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software

More information

Introduction to IE and ANNIE

Introduction to IE and ANNIE Introduction to IE and ANNIE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. About this tutorial This tutorial comprises

More information

Information Extraction with GATE

Information Extraction with GATE Information Extraction with GATE Angus Roberts Recap Installed and run GATE Language Resources LRs documents corpora Looked at annotations Processing resources PRs loading running Outline Introduction

More information

Module 3: Introduction to JAPE

Module 3: Introduction to JAPE Module 3: Introduction to JAPE The University of Sheffield, 1995-2010 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence About this tutorial As in previous modules,

More information

Introduction to Information Extraction (IE) and ANNIE

Introduction to Information Extraction (IE) and ANNIE Module 1 Session 2 Introduction to Information Extraction (IE) and ANNIE The University of Sheffield, 1995-2015 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence.

More information

Implementing a Variety of Linguistic Annotations

Implementing a Variety of Linguistic Annotations Implementing a Variety of Linguistic Annotations through a Common Web-Service Interface Adam Funk, Ian Roberts, Wim Peters University of Sheffield 18 May 2010 Adam Funk, Ian Roberts, Wim Peters Implementing

More information

Module 10: Advanced GATE Applications

Module 10: Advanced GATE Applications Module 10: Advanced GATE Applications The University of Sheffield, 1995-2010 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence About this tutorial This tutorial

More information

Large Scale Semantic Annotation, Indexing, and Search at The National Archives Diana Maynard Mark Greenwood

Large Scale Semantic Annotation, Indexing, and Search at The National Archives Diana Maynard Mark Greenwood Large Scale Semantic Annotation, Indexing, and Search at The National Archives Diana Maynard Mark Greenwood University of Sheffield, UK 1 Burning questions you may have... In the last 3 years, which female

More information

Module 1: Information Extraction

Module 1: Information Extraction Module 1: Information Extraction Introduction to GATE Developer The University of Sheffield, 1995-2014 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence About

More information

Machine Learning in GATE

Machine Learning in GATE Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort

More information

Tutorial on Text Mining for the Going Digital initiative. Natural Language Processing (NLP), University of Essex

Tutorial on Text Mining for the Going Digital initiative. Natural Language Processing (NLP), University of Essex Tutorial on Text Mining for the Going Digital initiative Natural Language Processing (NLP), University of Essex 6 February, 2013 Topics of This Tutorial o Information Extraction (IE) o Examples of IE systems

More information

97 Information Technology with Audiovisual and Multimedia and National Libraries (part 2) No

97 Information Technology with Audiovisual and Multimedia and National Libraries (part 2) No Date : 25/05/2006 Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying The library of Chinese Academy of

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

An UIMA based Tool Suite for Semantic Text Processing

An UIMA based Tool Suite for Semantic Text Processing An UIMA based Tool Suite for Semantic Text Processing Katrin Tomanek, Ekaterina Buyko, Udo Hahn Jena University Language & Information Engineering Lab StemNet Knowledge Management for Immunology in life

More information

On a Java based implementation of ontology evolution processes based on Natural Language Processing

On a Java based implementation of ontology evolution processes based on Natural Language Processing ITALIAN NATIONAL RESEARCH COUNCIL NELLO CARRARA INSTITUTE FOR APPLIED PHYSICS CNR FLORENCE RESEARCH AREA Italy TECHNICAL, SCIENTIFIC AND RESEARCH REPORTS Vol. 2 - n. 65-8 (2010) Francesco Gabbanini On

More information

Module 2: Introduction to IE and ANNIE

Module 2: Introduction to IE and ANNIE Module 2: Introduction to IE and ANNIE The University of Sheffield, 1995-2010 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. About this tutorial This tutorial

More information

Using GATE as an Environment for Teaching NLP

Using GATE as an Environment for Teaching NLP Using GATE as an Environment for Teaching NLP Kalina Bontcheva, Hamish Cunningham, Valentin Tablan, Diana Maynard, Oana Hamza Department of Computer Science University of Sheffield Sheffield, S1 4DP, UK

More information

STS Infrastructural considerations. Christian Chiarcos

STS Infrastructural considerations. Christian Chiarcos STS Infrastructural considerations Christian Chiarcos chiarcos@uni-potsdam.de Infrastructure Requirements Candidates standoff-based architecture (Stede et al. 2006, 2010) UiMA (Ferrucci and Lally 2004)

More information

Developing Language Processing Components with GATE Version 4 (a User Guide)

Developing Language Processing Components with GATE Version 4 (a User Guide) Developing Language Processing Components with GATE Version 4 (a User Guide) For GATE version 4.0-beta1 (April 2007) (built April 24, 2007) Hamish Cunningham Diana Maynard Kalina Bontcheva Valentin Tablan

More information

Outline. 1 Introduction. 2 Semantic Assistants: NLP Web Services. 3 NLP for the Masses: Desktop Plug-Ins. 4 Conclusions. Why?

Outline. 1 Introduction. 2 Semantic Assistants: NLP Web Services. 3 NLP for the Masses: Desktop Plug-Ins. 4 Conclusions. Why? Natural Language Processing for the Masses: The Semantic Assistants Project Outline 1 : Desktop Plug-Ins Semantic Software Lab Department of Computer Science and Concordia University Montréal, Canada 2

More information

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki Overview What is UIMA? A framework for NLP tasks and tools Part-of-Speech Tagging Full Parsing Shallow Parsing

More information

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which

More information

Ortolang Tools : MarsaTag

Ortolang Tools : MarsaTag Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements

More information

Module 3: GATE and Social Media. Part 4. Named entities

Module 3: GATE and Social Media. Part 4. Named entities Module 3: GATE and Social Media Part 4. Named entities The 1995-2018 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence Named Entity Recognition Texts frequently

More information

D4.6 Data Value Chain Database v2

D4.6 Data Value Chain Database v2 D4.6 Data Value Chain Database v2 Coordinator: Fabrizio Orlandi (Fraunhofer) With contributions from: Isaiah Mulang Onando (Fraunhofer), Luis-Daniel Ibáñez (SOTON) Reviewer: Ryan Goodman (ODI) Deliverable

More information

Customisable Curation Workflows in Argo

Customisable Curation Workflows in Argo Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

State of the Art and Trends in Search Engine Technology. Gerhard Weikum State of the Art and Trends in Search Engine Technology Gerhard Weikum (weikum@mpi-inf.mpg.de) Commercial Search Engines Web search Google, Yahoo, MSN simple queries, chaotic data, many results key is

More information

clarin:el an infrastructure for documenting, sharing and processing language data

clarin:el an infrastructure for documenting, sharing and processing language data clarin:el an infrastructure for documenting, sharing and processing language data Stelios Piperidis, Penny Labropoulou, Maria Gavrilidou (Athena RC / ILSP) the problem 19/9/2015 ICGL12, FU-Berlin 2 use

More information

Advanced GATE Applications

Advanced GATE Applications Advanced GATE Applications The University of Sheffield, 1995-2015 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence Topics covered This module is about adapting

More information

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from

More information

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing Overview What is UIMA? First Experiments NLP Teaching

More information

Apache UIMA and Mayo ctakes

Apache UIMA and Mayo ctakes Apache and Mayo and how it is used in the clinical domain March 16, 2012 Apache and Mayo Outline 1 Apache and Mayo Outline 1 2 Introducing Pipeline Modules Apache and Mayo What is? (You - eee - muh) Unstructured

More information

A bit of theory: Algorithms

A bit of theory: Algorithms A bit of theory: Algorithms There are different kinds of algorithms Vector space models. e.g. support vector machines Decision trees, e.g. C45 Probabilistic models, e.g. Naive Bayes Neural networks, e.g.

More information

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ??? @ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON

More information

A tool for Cross-Language Pair Annotations: CLPA

A tool for Cross-Language Pair Annotations: CLPA A tool for Cross-Language Pair Annotations: CLPA August 28, 2006 This document describes our tool called Cross-Language Pair Annotator (CLPA) that is capable to automatically annotate cognates and false

More information

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 1.0-beta2 May 16, 2010

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 1.0-beta2 May 16, 2010 OwlExporter Guide for Users and Developers René Witte Ninus Khamis Release 1.0-beta2 May 16, 2010 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info Contents

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Languages and tools for building and using ontologies. Simon Jupp, James Malone

Languages and tools for building and using ontologies. Simon Jupp, James Malone An overview of ontology technology Languages and tools for building and using ontologies Simon Jupp, James Malone jupp@ebi.ac.uk, malone@ebi.ac.uk Outline Languages OWL and OBO classes, individuals, relations,

More information

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format Building the Multilingual Web of Data Integrating NLP with Linked Data and RDF using the NLP Interchange Format Presenter name 1 Outline 1. Introduction 2. NIF Basics 3. NIF corpora 4. NIF tools & services

More information

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017 NERD workshop Luca Foppiano @ ALMAnaCH - Inria Paris Berlin, 18/09/2017 Agenda Introducing the (N)ERD service NERD REST API Usages and use cases Entities Rigid textual expressions corresponding to certain

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

University of Sheffield, NLP. Chunking Practical Exercise

University of Sheffield, NLP. Chunking Practical Exercise Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person

More information

TectoMT: Modular NLP Framework

TectoMT: Modular NLP Framework : Modular NLP Framework Martin Popel, Zdeněk Žabokrtský ÚFAL, Charles University in Prague IceTAL, 7th International Conference on Natural Language Processing August 17, 2010, Reykjavik Outline Motivation

More information

Performance Assessment using Text Mining

Performance Assessment using Text Mining Performance Assessment using Text Mining Mrs. Radha Shakarmani Asst. Prof, SPIT Sardar Patel Institute of Technology Munshi Nagar, Andheri (W) Mumbai - 400 058 Nikhil Kedar Student, SPIT 903, Sai Darshan

More information

University of Sheffield, NLP. Chunking Practical Exercise

University of Sheffield, NLP. Chunking Practical Exercise Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person

More information

Enhancing applications with Cognitive APIs IBM Corporation

Enhancing applications with Cognitive APIs IBM Corporation Enhancing applications with Cognitive APIs After you complete this section, you should understand: The Watson Developer Cloud offerings and APIs The benefits of commonly used Cognitive services 2 Watson

More information

An Entity Name Systems (ENS) for the [Semantic] Web

An Entity Name Systems (ENS) for the [Semantic] Web An Entity Name Systems (ENS) for the [Semantic] Web Paolo Bouquet University of Trento (Italy) Coordinator of the FP7 OKKAM IP LDOW @ WWW2008 Beijing, 22 April 2008 An ordinary day on the [Semantic] Web

More information

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de

More information

Advanced JAPE. Module 1. June 2017

Advanced JAPE. Module 1. June 2017 Advanced JAPE Module 1 June 2017 c 2017 The University of Sheffield This material is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence (http://creativecommons.org/licenses/by-nc-sa/3.0/)

More information

ANC2Go: A Web Application for Customized Corpus Creation

ANC2Go: A Web Application for Customized Corpus Creation ANC2Go: A Web Application for Customized Corpus Creation Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science, Vassar College Poughkeepsie, New York 12604 USA {ide, suderman, brsimms}@cs.vassar.edu

More information

JENA: A Java API for Ontology Management

JENA: A Java API for Ontology Management JENA: A Java API for Ontology Management Hari Rajagopal IBM Corporation Page Agenda Background Intro to JENA Case study Tools and methods Questions Page The State of the Web Today The web is more Syntactic

More information

Natural Language Interfaces to Ontologies. Danica Damljanović

Natural Language Interfaces to Ontologies. Danica Damljanović Natural Language Interfaces to Ontologies Danica Damljanović danica@dcs.shef.ac.uk Sponsored by Transitioning Applications to Ontologies: www.tao-project.eu GATE case study in TAO project collect software

More information

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases LIDER Survey Overview Participant profile (organisation type, industry sector) Relevant use-cases Discovering and extracting information Understanding opinion Content and data (Data Management) Monitoring

More information

Historical Text Mining:

Historical Text Mining: Historical Text Mining Historical Text Mining, and Historical Text Mining: Challenges and Opportunities Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.csc.liv.ac.uk/~azaroth/

More information

Semantic Annotation, Search and Analysis

Semantic Annotation, Search and Analysis Semantic Annotation, Search and Analysis Borislav Popov, Ontotext Ontology A machine readable conceptual model a common vocabulary for sharing information machine-interpretable definitions of concepts in

More information

Automatic Ontology-Based Document Annotation for Arabic Information Retrieval

Automatic Ontology-Based Document Annotation for Arabic Information Retrieval Islamic University-Gaza Deanery of Graduate Studies Faculty of Information Technology الجامعة اإلسالمية- غزة عمادة الدزاسات العليا كلية تكىىلىجيا المعلىمات بسم اهلل الرحمن الرحيم Automatic Ontology-Based

More information

Teamware: A Collaborative, Web-based Annotation Environment. Kalina Bontcheva, Milan Agatonovic University of Sheffield

Teamware: A Collaborative, Web-based Annotation Environment. Kalina Bontcheva, Milan Agatonovic University of Sheffield Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield Outline Why Teamware? What s Teamware? Teamware for annotation Teamware for quality

More information

University of Sheffield, NLP Annotation and Evaluation

University of Sheffield, NLP Annotation and Evaluation Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield Topics covered Defining annotation guidelines Manual annotation using the GATE GUI Annotation schemas and how they change the

More information

Statistical Parsing for Text Mining from Scientific Articles

Statistical Parsing for Text Mining from Scientific Articles Statistical Parsing for Text Mining from Scientific Articles Ted Briscoe Computer Laboratory University of Cambridge November 30, 2004 Contents 1 Text Mining 2 Statistical Parsing 3 The RASP System 4 The

More information

Language Resources and Linked Data

Language Resources and Linked Data Integrating NLP with Linked Data: the NIF Format Milan Dojchinovski @EKAW 2014 November 24-28, 2014, Linkoping, Sweden milan.dojchinovski@fit.cvut.cz - @m1ci - http://dojchinovski.mk Web Intelligence Research

More information

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it

More information

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014 NLP Chain Giuseppe Castellucci castellucci@ing.uniroma2.it Web Mining & Retrieval a.a. 2013/2014 Outline NLP chains RevNLT Exercise NLP chain Automatic analysis of texts At different levels Token Morphological

More information

Semantics Isn t Easy Thoughts on the Way Forward

Semantics Isn t Easy Thoughts on the Way Forward Semantics Isn t Easy Thoughts on the Way Forward NANCY IDE, VASSAR COLLEGE REBECCA PASSONNEAU, COLUMBIA UNIVERSITY COLLIN BAKER, ICSI/UC BERKELEY CHRISTIANE FELLBAUM, PRINCETON UNIVERSITY New York University

More information

Knowledge Engineering with Semantic Web Technologies

Knowledge Engineering with Semantic Web Technologies This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) Knowledge Engineering with Semantic Web Technologies Lecture 5: Ontological Engineering 5.3 Ontology Learning

More information

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper. Semantic Web Company PoolParty - Server PoolParty - Technical White Paper http://www.poolparty.biz Table of Contents Introduction... 3 PoolParty Technical Overview... 3 PoolParty Components Overview...

More information

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany Information Systems & University of Koblenz Landau, Germany Semantic Search examples: Swoogle and Watson Steffen Staad credit: Tim Finin (swoogle), Mathieu d Aquin (watson) and their groups 2009-07-17

More information

Large-scale, Parallel Automatic Patent Annotation

Large-scale, Parallel Automatic Patent Annotation Overview Large-scale, Parallel Automatic Patent Annotation Thomas Heitz & GATE Team Computer Science Dept. - NLP Group - Sheffield University Patent Information Retrieval 2008 30 October 2008 T. Heitz

More information

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 2.1 December 26, 2010

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 2.1 December 26, 2010 OwlExporter Guide for Users and Developers René Witte Ninus Khamis Release 2.1 December 26, 2010 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info Contents 1

More information

Question Answering Using XML-Tagged Documents

Question Answering Using XML-Tagged Documents Question Answering Using XML-Tagged Documents Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/trec11/index.html XML QA System P Full text processing of TREC top 20 documents Sentence

More information

UIMA-based Annotation Type System for a Text Mining Architecture

UIMA-based Annotation Type System for a Text Mining Architecture UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko, Katrin Tomanek, Scott Piao, Yoshimasa Tsuruoka, John McNaught, Sophia Ananiadou Jena University Language and

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains Eloise Currie and Mary Parmelee SAS Institute, Cary NC About SAS: The Power to Know SAS: The Market Leader in

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

0.1 Knowledge Organization Systems for Semantic Web

0.1 Knowledge Organization Systems for Semantic Web 0.1 Knowledge Organization Systems for Semantic Web 0.1 Knowledge Organization Systems for Semantic Web 0.1.1 Knowledge Organization Systems Why do we need to organize knowledge? Indexing Retrieval Organization

More information

Background and Context for CLASP. Nancy Ide, Vassar College

Background and Context for CLASP. Nancy Ide, Vassar College Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90 s and early 2000 s Text Encoding

More information

Semantic annotation toolkit (first version)

Semantic annotation toolkit (first version) www.kconnect.eu Semantic annotation toolkit (first version) Deliverable number D1.1 Dissemination level Public Delivery date 31 October 2015 Status Author(s) Final Ian Roberts, Fredrik Axelsson, Zoltan

More information

A cocktail approach to the VideoCLEF 09 linking task

A cocktail approach to the VideoCLEF 09 linking task A cocktail approach to the VideoCLEF 09 linking task Stephan Raaijmakers Corné Versloot Joost de Wit TNO Information and Communication Technology Delft, The Netherlands {stephan.raaijmakers,corne.versloot,

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

Precise Medication Extraction using Agile Text Mining

Precise Medication Extraction using Agile Text Mining Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,

More information

A Linguistic Approach for Semantic Web Service Discovery

A Linguistic Approach for Semantic Web Service Discovery A Linguistic Approach for Semantic Web Service Discovery Jordy Sangers 307370js jordysangers@hotmail.com Bachelor Thesis Economics and Informatics Erasmus School of Economics Erasmus University Rotterdam

More information

Linked Data and cultural heritage data: an overview of the approaches from Europeana and The European Library

Linked Data and cultural heritage data: an overview of the approaches from Europeana and The European Library Linked Data and cultural heritage data: an overview of the approaches from Europeana and The European Library Nuno Freire Chief data officer The European Library Pacific Neighbourhood Consortium 2014 Annual

More information

Module 9: Ontologies and Semantic Annotation

Module 9: Ontologies and Semantic Annotation Module 9: Ontologies and Semantic Annotation The University of Sheffield, 1995-2012 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence About this tutorial This

More information

Syntax and Grammars 1 / 21

Syntax and Grammars 1 / 21 Syntax and Grammars 1 / 21 Outline What is a language? Abstract syntax and grammars Abstract syntax vs. concrete syntax Encoding grammars as Haskell data types What is a language? 2 / 21 What is a language?

More information

New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites

New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites Tomas Krilavičius Žygimantas Medelis Jurgita Kapočiūtė-Dzikienė Tomas Žalandauskas Problem How to

More information

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Maca a configurable tool to integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Outline Morphological resources for Polish Tagset and segmentation differences

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Deliverable D1.4 Report Describing Integration Strategies and Experiments

Deliverable D1.4 Report Describing Integration Strategies and Experiments DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable D1.4 Report Describing Integration Strategies and Experiments The Consortium October 2004 Report Describing

More information

A Multilingual Social Media Linguistic Corpus

A Multilingual Social Media Linguistic Corpus A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th

More information

Deliverable D Adapted tools for the QTLaunchPad infrastructure

Deliverable D Adapted tools for the QTLaunchPad infrastructure This document is part of the Coordination and Support Action Preparation and Launch of a Large-scale Action for Quality Translation Technology (QTLaunchPad). This project has received funding from the

More information

Jumpstarting the Semantic Web

Jumpstarting the Semantic Web Jumpstarting the Semantic Web Mark Watson. Copyright 2003, 2004 Version 0.3 January 14, 2005 This work is licensed under the Creative Commons Attribution-NoDerivs-NonCommercial License. To view a copy

More information

Getting Lost in Semantics Selecting the Right Search Engine

Getting Lost in Semantics Selecting the Right Search Engine Getting Lost in Semantics Selecting the Right Search Engine Steve Mann VP Sales Concept Searching stevem@conceptsearching.com Robert Piddocke VP Channel and Business Development Concept Searching mikep@conceptsearching.com

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:

More information

TIC: A Topic-based Intelligent Crawler

TIC: A Topic-based Intelligent Crawler 2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon

More information