Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Size: px

Start display at page:

Download "Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner"

Mark Rodgers
5 years ago
Views:

1 Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

2 Overview What is UIMA? First Experiments NLP Teaching Conclusion 2

3 UIMA: Unstructured Information Management Architecture a software architecture for developing and deploying unstructured information management (UIM) applications UIM application: a software system analyse large volumes of unstructured information to discover, organize, and deliver relevant knowledge to the end user software architecture which specifies component interfaces, data representations, 3

UIMA: Unstructured Information Management Architecture may be takes interfaces used a CAS, by a to Collection analyzes collection its Reader of contents, datato items populate and (e.g., produces a documents) CAS an fromenriched a document.

4 UIMA: Unstructured Information Management Architecture may be takes interfaces used a CAS, by a to Collection analyzes collection its Reader of contents, datato items populate and (e.g., produces a documents) CAS an fromenriched a document. to be An example CAS. analyzed. Analysis of a Collection CAS Engines Initializer Readers canis be an return recursively HTML CASes parser composed thatcontain de-tags of other an HTML documents Analysis Engines to document (called analyze, and also possibly Aggregate inserts along paragraph Analysis with additional Engine). annotations Aggregates metadata. (determined may also from contain <P> tags CAS in theconsumers. original HTML) into the CAS. CAS: Common Analysis Structure CPE: Collecting Processing Manager consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference] 4

5 UIMA: Unstructured Information Management Architecture Analysis Engine (AE): a component that analyzes artifacts (e.g. documents) and infers information about them consists of two parts: Java classes (typically packaged as one or more JAR files) and AE descriptors (one or more XML files) the configuration settings for the Analysis Engine as well as a description of the AE s input and output requirements. 5

6 UIMA: Unstructured Information Management Architecture describe analysis engine: annotator class input parameter output of annotations external resources interface resources linked to a type system XML analysis engine define an annotator Java Annotator processing resources uses define annotation type: name features (begin, end, ) type system create Annotation Interface 6

7 UIMA: Unstructured Information Management Architecture Aggregate Analysis Engine: combine different analysis engine within one Analysis Engine [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference] 7

8 Overview Introduction First Experiments NLP Teaching Conclusion 8

9 First Experiments: UIMA vs. GATE base line: 2 persons, 2 systems, 1 corpus and 1 extraction task skills/experiences of the persons: UIMA GATE Eclipse/Java Person 1 Person 2 9

10 Task of the Experiment process a corpus of websites to detect and extract information relevant for tourists opening times of museum, prices of hotels, corpus: 30 tourism web sites of Egypt additional 20 web sites of Washington, New York, London output: Prolog facts for a reasoner Questions: Which museum is now open? 10

11 Evaluation Topics/Points ease of getting acquainted with system?: quality of docus: completeness, clarity, up-to-date,? tutorials, use cases,? processing and linguistic resources? lexica, Gazetteer lists, tools tools for resource maintenance and extension? quality: selfexplanatory, robust, comfortable speed of processing? single document vs. large corpora? limitations, suggestions for improvement? support for im-/export of a variety of document formats? 11

12 Excerpts from the Corpus The Egyptian Museum is open the hours: 9am-5pm daily The Military Museum is open the hours: Summer: 8am- 5:30pm; winter: 8am-4:30pm Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter) 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri 12

UIMA Application several annotators (like a pipeline) regular expressions... *Fraunces Tavern Museum* 54 Pearl St. - 1-212-425-1778 Tuesday-Friday, 12pm?

13 UIMA Application several annotators (like a pipeline) regular expressions... *Fraunces Tavern Museum* 54 Pearl St Tuesday-Friday, 12pm?5pm; restrictions time pattern museum pattern interval of Prolog facts: museumopen('fraunces Tavern Museum ', times ' T12:00:00',' T17:00:00'). museum museumopen('fraunces Tavern Museum information ', ' T12:00:00',' T17:00:00'). museumopen('fraunces window covering Tavern twomuseum time intervals ', and a ' T12:00:00',' T17:00:00'). restriction regular expressions window covering a museum and opening hours regular expressions 13

14 UIMA: Results information annotated in the documents: names of museums, hotels times, time intervals time restrictions prices, intervals of prices (hotel prices) keywords for museum category names of pharaohs (annotated with a correction of mispellings) information about hotel and museum are exported into Prolog facts and into a short textual summary templates filled with the detected information hotels: Price information about Cosmopolitan Hotel : $157 museums: *** *Fraunces Tavern Museum* *** Open from 12:00:00 to 17:00:00; Restriction: Tuesday-Friday 14

15 UIMA vs. GATE: Conclusion no final judgement about: use GATE or UIMA depends on your task task description expected results which processing resources are necessary your preferences for interface prefer the Eclispe environment (or other Java editors) prefer a comfortable GUI 15

16 GATE: tools available comfortable GUI UIMA vs. GATE: Conclusion UIMA: plain framework simplified definition of (complex) result structures simplified pre- and postprocessing of annotations both are extensible e.g. for processing German documents 16

17 'German' Extension of Processing Resources XDOC document suite tools for processing German documents tools implemented in CommonLisp for UIMA Java reimplementation of the tools several analysis engines 17

18 XDOC in UIMA annotation of part-of-speech (Morphix, heuristics) semantic categories named entities (vehicles, cities, ) a coarse approach for classification of PP using maxent library 18

19 UIMA: Evaluation documentation? - good processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? - illustrative examples (tutorial) - completeness: sometimes it is very shortly described - experiences with Eclipse and Java programming are advantageous - prior knowledge about Java and Eclipse is helpful limitations, suggestions for improvement? im-/export of document formats? 19

20 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? - annotators only from tutorial - sentence annotation - word annotation - date/time annotators - examples for using regular expressions etc. - external resources can be integrated: - lexical resources as external resources (text files) - existing processing resources - implementation of an interface is necessary im-/export of document formats? 20

21 UIMA: Evaluation documentation? processing and linguistic resources? - specific Eclipse component editors or - simple text editors tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? 21

22 UIMA: Evaluation documentation processing and linguistic resources - faster than GATE? - in CPE detailed information about processing time for each module tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? 22

23 UIMA: Evaluation documentation processing and linguistic resources - Collection Reader - document(s) from a directory tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? 23

24 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? no limitations: all is possible, but implementation or interfacing by user wish: more processing and linguistic resources within the distribution speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? 24

25 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? - import: CAS Initializer - export: CAS Consumer - transform annotations in any other format - export of - document + annotations - only annotations - required: Java application limitations, suggestions for improvement? im-/export of document formats? 25

26 Overview Introduction First Experiments NLP Teaching Conclusion 26

27 NLP Teaching course: Information Extraction aim of the course: to make our students acquainted with information extraction as basic NLP technology UIMA, GATE students: computer science, data-knowledge engineering skills of the students: programming Java 27

28 NLP Teaching different corpora: news about FIFA world cup 2006 in Germany, description of drugs, announcements of new books, tasks for students to develop different anaylsis engines and combine them for annotation of URLs, addresses, name of players, results of games, using regular expressions, external resources, maximum entropy models 28

29 NLP Teaching 29

30 UIMA: A Students View easy to handle Java programming (environment) problems of students: to understand the dependencies between the several descriptors for teaching helpful (future work): a 'comparator' of different solutions of students which solution is the best, related to a 'master' solution 30

31 Overview Introduction First Experiments NLP Teaching Conclusion 31

32 Conclusion UIMA: easy to learn and to handle support the management of different annotations different processing resources integration of external resources (processing resources as well lexical resources) splitting of 'processing steps': 'wish-list': reader, initalizer, analysis engine, consumer a kind of jape transducer interface to GATE's processing resources is available 'comparator' for evaluation of solutions 32

CSC 5930/9010: Text Mining GATE Developer Overview

1 CSC 5930/9010: Text Mining GATE Developer Overview Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 GATE Components 2 We will deal primarily with GATE Developer: