Tutorial on Text Mining for the Going Digital initiative. Natural Language Processing (NLP), University of Essex

Similar documents
Module 1: Information Extraction

CSC 5930/9010: Text Mining GATE Developer Overview

Introduction to IE and ANNIE

University of Sheffield NLP. Exercise I

Module 3: GATE and Social Media. Part 4. Named entities

Implementing a Variety of Linguistic Annotations

OneNote. Using OneNote on the Desktop. Starting screen. The OneNote interface the Ribbon

University of Sheffield, NLP. Chunking Practical Exercise


Tutorial 8 Sharing, Integrating and Analyzing Data

Introduction to Information Extraction (IE) and ANNIE

Creating new Resource Types

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Configure Eclipse - Part 2 - Settings and XML

BD003: Introduction to NLP Part 2 Information Extraction

feel free to poke around and change things. It's hard to break anything in a Moodle course, and even if you do it's usually easy to fix it.

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

FrontPage Help Center. Topic: FrontPage Basics

User Guide. Kronodoc Kronodoc Oy. Intelligent methods for process improvement and project execution

Quick Start Instructions for Using Postage $aver for Parcels for Mac with Filemaker Pro files

Krames On-Demand (KOD) v6 Content Export Module

Module 10: Advanced GATE Applications

Eclipse/Websphere. Page 1 Copyright 2004 GPL License. All rights reserved.

Dreamweaver is a full-featured Web application

University of Sheffield, NLP. Chunking Practical Exercise

Information Extraction with GATE

A bit of theory: Algorithms

ESRI stylesheet selects a subset of the entire body of the metadata and presents it as if it was in a tabbed dialog.

Sage CRM 2018 R3 Release Notes. Updated: August 2018

Module 3: Introduction to JAPE

My Site. Introduction

Getting Started with Web Services

Using the VMware vcenter Orchestrator Client. vrealize Orchestrator 5.5.1

ACCOUNT INFORMATION... 1

Research Tools: DIY Text Tools

StarTeam File Compare/Merge StarTeam File Compare/Merge Help

Module 2: Introduction to IE and ANNIE

Manage Files. Accessing Manage Files

Natural Language Interfaces to Ontologies. Danica Damljanović

Blackboard Learn 9.1 Reference Terminology elearning Blackboard Learn 9.1 for Faculty

25Live Custom Report Integration

Icons2go WordPress plugin manual

CAT User Manual for the NewsReader EU Project

Anchovy User Guide. Copyright Maxprograms

Once file and folders are added to your Module Content area you will need to link to them using the Item tool.

7.1. Browsing CHAPTER. The Internet. Web Browser software

Contents. List of Figures. List of Tables. Acknowledgements

University of Sheffield, NLP Annotation and Evaluation

How to add content to your course

TourMaker Reference Manual. Intro

with TestComplete 12 Desktop, Web, and Mobile Testing Tutorials

CEDMS User Guide

Job Aid. Remote Access BAIRS Printing and Saving a Report. Table of Contents

Lexis for Microsoft Office User Guide

The Start menu (overview)

EMC Documentum My Documentum Desktop (Windows)

Gender Info 2007 Getting Started

WA L KT H R O U G H 1

Manage and Generate Reports

Extracting and Storing PDF Form Data Into a Repository

Integrating Sintelix and ANB. Learn how to access and explore Sintelix networks in IBM i2 Analyst s Notebook

University of Sheffield, NLP Machine Learning

Getting Started with Web Services

Changing Languages (Localization)

Extended Search Administration

Story Workbench Quickstart Guide Version 1.2.0

Uploading a File in the Desire2Learn Content Area

Dreamweaver is a full-featured Web application

Advanced GATE Applications

HL7Spy 1.1 Getting Started

Introduction Add Item Add Folder Add External Link Add Course Link Add Test Add Selection Text Editing...

Getting Started with Mail Advanced Web Client. Mail is a full-featured messaging and collaboration application that offers reliable highperformance

Brainware Intelligent Capture

The Domino Designer QuickStart Tutorial

Editing and adding content to the deposit page

ecopy PaperWorks Connector for Microsoft SharePoint Administrator s Guide

OwlExporter. Guide for Users and Developers. René Witte Ninus Khamis. Release 2.1 December 26, 2010

DiscoveryGate Installation and Configuration

Report Commander 2 User Guide

Orckestra, Europe Nygårdsvej 16 DK-2100 Copenhagen Phone

GATE Teamware User Guide

Joomla Website User Guide

ACA-1095 Reporting Help Pro-Ware, LLC

InSite Prepress Portal Quick Start Guide IPP 9.0

The following issues and enhancements have been addressed in this release:

IBM Rational Rhapsody Gateway Add On. Customization Guide

Including Dynamic Images in Your Report

Desire2Learn eportfolio

Visualizing Venice Historic Environment Record (Geospatial Database)

Getting Started in CAMS Enterprise

Using the VMware vrealize Orchestrator Client

Vendio Stores WebDAV Setup & Access

ishipdocs User Guide

!!! !!!!!!!!!!! Help Documentation. Copyright V1.7. Copyright 2014, FormConnections, Inc. All rights reserved.

Password Memory 7 User s Guide

PISCES Installation and Getting Started 1

Sitecore guide building a blog

Archi - ArchiMate Modelling. What s New in Archi 4.x

Accessing Diagnostic Service Documentation for Non-Beckman Coulter Users

AVS4YOU Programs Help

Transcription:

Tutorial on Text Mining for the Going Digital initiative Natural Language Processing (NLP), University of Essex 6 February, 2013

Topics of This Tutorial o Information Extraction (IE) o Examples of IE systems o GATE (General Architecture for Text Engineering) Introduction to GATE developer Guided tour of the GATE GUI Language Resources, Datastores, Applications and Processing Resources Annotations ANNIE Tool, GATE's default IE system 2

Information Retrieval vs. Information Extraction Information Retrieval large text collections (Web) Documents 3

Information Retrieval vs. Information Extraction Getting Facts can be hard and slow. 4

Information Retrieval vs. Information Extraction Information Extraction Facts large text collections Structured Information 5

Named Entity Recognition o Cornerstone of IE o Identification of proper names in texts, o Classification them into a set of predefined categories Person Organisation (companies, government organisations, committees, etc) Location (cities, countries, rivers, etc) Date and time expressions o other types are frequently added, as appropriate to the Application. Emperors Historical events. Ships, etc. 6

Named Entity Recognition Person Location Organisation Year Mark lives in London. He has worked at University of London since 1990. 7

Co-reference Person Location Organisation Year Mark lives in London. He has worked at University of London since 1990. 8

Relations Person Location Organisation Year Mark lives in London. He has worked at University of London since 1990. Live_in 9

Relations Person Location Organisation Year Mark lives in London. He has worked at University of London since 1990. Employee_of 10

Relations Person Location Organisation Year Mark lives in London. He has worked at University of London since 1990. Based_in 11

Examples of IE systems 12

Health and Safety Information Extraction (HaSIE) 13

Obstetrics records 14

Guided tour to GATE GUI How to navigate the GATE GUI How to set up the different options Introduction to resources and parameters 15

ANNIE tool You can try ANNIE tool online at: http://services.gate.ac.uk/annie/ 16

How we will work on tutorial Each section in this tutorial will be explained first. After that enough time will be given to you to try yourself. red texts indicates your time to try experimenting with GATE. 17

Guided tour to GATE GUI Language resources (LRs) : documents Corpus: a collection of documents Data stores: specialised files where documents are kept for future use Resources Pane Processing resources (PRs) : annotation tools that operate on text within the documents Applications: groups of processes that run on one or more documents 18

19

Parameters Applications, LRs, and PRs all have various parameters. Parameters enable different settings to be used, e.g. case sensitivity Initialisation Parameters (set at load time): cannot be changed without reloading. Run time Parameters: can be changed between each application run 20

Display Pane 21

Display Pane Example of LRs 22

Setting up GATE options Enables you to adjust settings such as saving your options, and Change saving the the look session and so that when feel of you GATE, reopen such GATE, as it will remember menu and and text reload fonts the applications you had open at the end of your previous session 23

Try out GATE Open GATE Start All Programs GATE developer 7.0 Try setting different options in GATE Click Options Configuration Appearance Download the presentation file. Download the Hands-on-materials.zip file from https://sites.google.com/site/nlelab2013/gate Save the zipped file on desktop and unzip it. The folder Hands-on-materials contains all files required for the next experiments. 24

Loading and Viewing Documents Loading a document and setting its parameters. Creating a corpus. Populating a corpus of documents in different ways Removing documents. 25

Loading and Viewing Documents GATE can process documents in all kinds of formats: plain text, HTML, XML, PDF, Word etc. When GATE loads a document, it converts it into a special format for processing. Documents can be exported in various formats or saved in a datastore for future processing within GATE. 26

Loading Document 27

Document Initialisation parameters The sourceurl parameter enables you to specify the document to be loaded. You can type the filename or URL, or click the file browser icon to navigate to the correct document. 28

Document Initialisation parameters You can also just type a string of text into the box by selecting stringcontent rather than sourceurl. 29

Document Initialisation parameters Set to true to ensure GATE will process any existing annotations such as HTML tags and present them as annotations rather than leaving them in the text. 30

Opening and closing the document 31

Viewing the document 32

Experiment 2 Opening the document in GATE load the document Anna Comnena Alexiad.htm from handson-materials folder. right click on Language Resources and select New GATE Document or File menu New Language Resource GATE Document A dialogue box will appear. Leave the name input box empty. Click the file browser icon to navigate to the correct document. To view a document, double click on the document name in the Resources pane To view the annotations, you first need click Annotation Sets, and then select the relevant set and annotation(s) on the right hand side of the GUI To see a list of annotations at the bottom, click on Annotations List 33

Creating a corpus A corpus is a collection of documents. For most GATE applications, it is easier to work with a corpus rather than an individual document, even if that corpus only contains one document. 34

Corpus Initialisation Parameters Click the edit button [add icon] and add your document to the corpus 35

Another way to add documents to a corpus You can also create an empty corpus and then add documents to it, if these documents are already loaded in GATE ((2)) Press on add icon ((1)) Double click on the corpus 36

Removing documents from a corpus (3)) Press on delete icon ((1)) Double click on the corpus ((2)) Select the document you want to delete 37

Removing documents from a corpus To remove documents from a corpus, use the X button in the corpus editor Note that this does not remove the document from GATE, just from the corpus The document is available to be added to other corpora. Indeed a document can belong to several corpora If you do remove the document from GATE, it will also remove it from the corpus But if you remove the corpus, it doesn't remove the document! 38

Populate the corpus in one go Using the populate function means you don't have to preload the documents in GATE first, and allows you to load all the documents into the corpus in one go. 39

Populate the corpus in one go Extensions parameter: lets you select only documents of a certain type e.g., xml or htm files. Encoding: lets you choose the right encoding for the documents. The wrong encoding can cause characters to be incorrectly displayed e.g., UTF-8 Recurse directories : load documents in any subdirectories 40

Experiment 3: creating a corpus and populate it Create a corpus Right click Language Resources New GATE Corpus. A dialogue box will appear. Leave a name input box empty and press OK. Right click on the created corpus in the Resources pane and select Populate. A dialogue box will appear Use the file browser icon to select the name of the directory in which your documents are stored In Extensions parameter,type htm in the box (without the quotes) In Encoding parameter, enter UTF-8. Press OK all the documents will be loaded in one go Double click the corpus to view it. 41

Preprocessing Resources Loading processing resources. Managing plugins. Loading and running ANNIE and pre-existing applications Creating a new application 42

Preprocessing Resources Processing resources (PRs) are the tools that creates or modifies annotations on the text.they implement algorithms. An application consists of any number of PRs, run sequentially over a corpus of documents A plugin is a collection of one or more PRs, bundled together. For example, all the PRs needed for IE in Arabic are found in the Lang_Arabic plugin. An application can contain PRs from one or more different plugins. In order to access new PRs, you need to load the relevant plugin 43

Plugins To open Plugin Manager: 44

Plugins 45

Plugins 46

Applications Loading and running ANNIE and pre-existing applications Creating a new application 47

ANNIE Tool ANNIE is a readymade collection of PRs that performs IE on unstructured text. To use ANNIE tool: 48

Running ANNIE Tool Double click on the ANNIE application to view it. PRs in order of their execution Corpus on which the application is executed Runtime parameters of the selected PR Execute the application 49

ANNIE Tool 50

Viewing the results ((1)) Double click on the ((3)) document to view it ((2)) click on any Annotation Select Annotation Lists types in the Default and Annotation Sets (unnamed) set 51

Viewing the results 52

change the name of the annotation set Now we're going to change the name of the annotation set, so that all ANNIE annotations appear in a new set called ANNIEresult The annotation set where the results are stored is one of the runtime parameters of the PRs 53

change the name of the annotation set For each PR listed, click on it and check whether it has any parameters labelled annotationsetname, inputasname or outputasname Edit all of these by typing ANNIEresult in the box. 54

change the name of the annotation set 55

Add a new Processing Resources Load a plugin called Tools using Plugins Manager. Right click on Processing Resources in the Resources Pane and select New ANNIE VP Chunker Double click on ANNIE. You'll see the VP chunker in the list of loaded PRs. This means it's available in GATE, but isn't yet contained in the application. Add it to the application by selecting it and using the right arrow to transfer it. Now use the up arrow to move it to the right place in the application. It should go after (below) the POS tagger but before (above) the NE transducer. Change the inputasname and outputasname parameters to ANNIEresult. Run the application and view the results on the document. You should see a new annotation type VG. 56

Saving documents Using datastores Saving documents for use outside GATE 57

Saving documents There are 2 types of datastore Serial datastores store data directly in a directory Lucene datastores provide a searchable repository with Lucene-based indexing 58

Create a new serial datastore 59

Create a new serial datastore Right click Datastores from the Resources pane and select Create Datastore Select Serial Datastore Create a new empty directory by clicking the Create New Folder icon and give your new directory a name Select this directory and click Open Now your datastore is ready to store your documents 60

Save documents to the datastore Right click on your corpus and select Save to Datastore Select the datastore that you just created Now close the corpus and document Double click on the name of the datastore in the Resources pane You should see the corpus and document Double click on them to load them back into GATE and view them 61

remove things from the datastore ((1)) Double click on the name of the datastore ((2)) Double click on the name of corpus or documents and then select delete 62

Saving documents outside GATE If you want to use your documents outside GATE, you can save them in 2 ways: as standoff markup, in a special GATE representation as inline annotations (preserving the original format) Both formats are XML-based. However save as xml refers to the first option, while save preserving format refers to the second option. 63

Saving documents outside GATE <Person> Ravenna </person> 64

Saving as XML 65

Save preserving format 66

References http://gate.ac.uk/sale/tao/split.html Introduction to GATE Developer [PDF document]. Retrieved Web site: http://gate.ac.uk You can download the program from: http://gate.ac.uk/download/ 67