Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

Size: px

Start display at page:

Download "Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki"

Reynard Peters
6 years ago
Views:

1 Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

2 Overview What is UIMA? A framework for NLP tasks and tools Part-of-Speech Tagging Full Parsing Shallow Parsing More about UIMA

3 What is UIMA? An acronym: Unstructured Information Management Architecture A framework for NLP tasks and tools Originally from IBM (IBM UIMA) Now open source (Apache UIMA) What is a framework?

4 A Quick Finnish Lesson Finnish uima = English swim Uimahalli: Swimming hall Uimakurssi: Swimming course Uimamestari: Swimming master

11 Overview What is UIMA? Part-of-Speech Tagging With OpenNLP With OpenNLP and UIMA Full Parsing Shallow Parsing More about UIMA

12 What is OpenNLP? A set of NLP tools Open source Java University of Pennsylvania (Tom Morton) English, Spanish, German, Thai What is a toolkit?

13 Annotation Tools OpenNLP sentence detector OpenNLP tokenizer OpenNLP POS tagger OpenNLP chunker OpenNLP parser OpenNLP name finder OpenNLP coreferencer

14 Annotation Models OpenNLP English sentence detector model OpenNLP English tokenizer model OpenNLP English POS tagger model OpenNLP English chunker model OpenNLP English parser models OpenNLP English named entity models OpenNLP English coreferencer models

15 POS Tagging with OpenNLP export OPENNLP_HOME=~gwilcock/Tools/opennlp export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools jar:\ $OPENNLP_HOME/lib/maxent jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.sentencedetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz java opennlp.tools.lang.english.tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz java opennlp.tools.lang.english.postagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz

17 OpenNLP POS Tagger Format My/PRP$ mistress/nn '/POS eyes/nns are/vbp nothing/nn like/in the/dt sun/nn,/, Coral/NNP is/vbz far/rb more/rbr red/jj than/in her/prp$ lips/nns red/jj./. If/IN snow/nn be/vb white/jj,/, why/wrb then/rb her/prp$ breasts/nns are/vbp dun/vbn,/, If/IN hairs/nns be/vb wires/nns,/, black/jj wires/nns grow/vb on/in her/prp$ head/nn./.

18 POS Tagging with UIMA Run OpenNLP POS tagger in UIMA Also needs OpenNLP sentence detector and OpenNLP tokenizer Each tool needs a UIMA wrapper These wrappers provided by UIMA Tools run in primitive analysis engines Combined in aggregate analysis engine

20 OpenNLP Script (to compare) export OPENNLP_HOME=~gwilcock/Tools/opennlp export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools jar:\ $OPENNLP_HOME/lib/maxent jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.sentencedetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz java opennlp.tools.lang.english.tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz java opennlp.tools.lang.english.postagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz

23 Overview What is UIMA? Part-of-Speech Tagging Full Parsing With OpenNLP With OpenNLP and UIMA Shallow Parsing More about UIMA

24 Full Parsing with OpenNLP export OPENNLP_HOME=~gwilcock/Tools/opennlp export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools jar:\ $OPENNLP_HOME/lib/maxent jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.sentencedetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz java opennlp.tools.lang.english.tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz java opennlp.tools.lang.english.treebankparser -d \ $OPENNLP_HOME/models/english/parser

25 OpenNLP Parser Format (TOP (S (S (NP (NP (PRP$ My) (NN mistress) (POS ')) (NNS eyes)) (VP (VBP are) (NP (NP (NN nothing)) (PP (IN like) (NP (DT the) (NN sun)))))) (,,) (NP (NNP Coral)) (VP (VBZ is) (ADJP (ADJP (ADVP (RB far) (RBR more)) (JJ red)) (PP (IN than) (NP (NP (PRP$ her) (NNS lips)) (ADJP (JJ red)))))) (..))) (TOP (SBARQ (SBAR (IN If) (S (NP (NN snow)) (VP (VB be) (ADJP (JJ white))))) (,,) (WHADVP (WRB why)) (SQ (SBAR (RB then) (S (S (NP (PRP$ her) (NNS breasts)) (VP (VBP are) (VP (VBN dun)))) (,,) (S (SBAR (IN If) (S (NP (NNS hairs)) (VP (VB be) (NP (NNS wires))))) (,,) (NP (JJ black) (NNS wires)) (VP (VBP grow) (PP (IN on) (NP (PRP$ her) (NN head)))))))) (..)))

26 Full Parsing with UIMA Run OpenNLP parser in UIMA Also needs OpenNLP sentence detector and OpenNLP tokenizer Wrappers provided by UIMA Add parser to aggregate analysis engine

29 Toolkits vs. Frameworks Toolkit The tools support a defined API Your application calls the tools Framework Your components must support the API defined by the framework The framework calls your components Hollywood? Don t call us, we ll call you

30 Overview What is UIMA? Part-of-Speech Tagging Full Parsing Shallow Parsing Chunking with OpenNLP Chunking with OpenNLP and UIMA More about UIMA

31 Chunking with OpenNLP (Same CLASSPATH as tagging and parsing) java opennlp.tools.lang.english.sentencedetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz java opennlp.tools.lang.english.tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz java opennlp.tools.lang.english.postagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz java opennlp.tools.lang.english.treebankchunker \ $OPENNLP_HOME/models/english/chunker/EnglishChunk.bin.gz

32 OpenNLP Chunker Format CoNLL-2000 IOB format Inside chunk (I-NP, I-VP, I-PP) Outside chunk (O) Begin chunk (B-NP, B-VP, B-PP) Chunk labels attached to tokens

33 OpenNLP Chunker Format Token - postag - chunklabel a DT B-NP far RB I-NP more RBR I-NP pleasing JJ I-NP sound NN I-NP.. O

34 Chunking with UIMA Run OpenNLP chunker in UIMA Also needs OpenNLP sentence detector, tokenizer and POS tagger Write a Java wrapper No wrapper provided for chunker Similar to wrapper for POS tagger Add chunker to aggregate analysis engine

36 Editing the Type System UIMA type systems Types, subtypes and inheritance Features appropriate for type OpenNLPExampleTypes.xml Edit Token type Has existing postag feature Add new chunklabel feature

39 Defining Capabilities Capabilities of each component Specify input and output types Supports interoperability Enables check for correct types Important in pipeline of components

43 Colouring the Chunks UIMA Annotation Viewer All Token annotations are same colour Need different colours for NP, VP, PP Write a Chunk Marker annotator Read chunklabel features (B-NP, I-NP) Write new NP, VP, PP annotations NP, VP, PP types already defined for parser

46 Overview What is UIMA? Part-of-Speech Tagging Full Parsing Shallow Parsing More about UIMA UIMA and standards UIMA and community

47 Other UIMA Annotators UIMA is not just OpenNLP! Write your own annotators Regular expression annotators addresses, URLs, phone numbers Dictionary lookup annotators Use existing name lists (GATE gazetteers)

48 UIMA and Standards OASIS Organization for the Advancement of Structured Information Standards OASIS UIMA Technical Committee

49 UIMA and Standards: GUI GATE uses its own GUI WordFreak uses its own GUI UIMA uses Eclipse An existing, widely-used, open source GUI Eclipse Modeling Framework (EMF)

50 UIMA and Standards: XML GATE uses its own XML format WordFreak uses its own XML format UIMA uses XMI XML Metadata Interchange OMG standard Modelled by EMF in Eclipse

51 XML Metadata Interchange <tcas:documentannotation xmi:id="999998" sofa="1" begin="0" end="673" language="en"/> <examples:sourcedocumentinformation xmi:id="999999" sofa="1" begin="0" end="0" uri="file:/c:\annotations\sonnet130.txt" offsetinsource="0" documentsize="673" lastsegment="true"/> <opennlp:sentence xmi:id="2" sofa="1" begin="0" end="147" componentid="opennlp Sentence Detector"/> <opennlp:sentence xmi:id="31" sofa="1" begin="148" end="245" componentid="opennlp Sentence Detector"/> <opennlp:sentence xmi:id="56" sofa="1" begin="246" end="327" componentid="opennlp Sentence Detector"/>

52 UIMA and Community UIMA open source software Apache Software Foundation UIMA component repositories CMU (Carnegie-Mellon University) JULIE Lab (Jena University) PEAR format Processing Engine ARchive

53 UIMA and IBM IBM supports Apache UIMA UIMA Innovation Awards IBM LRWB LanguageWare Resource Workbench Free download for evaluation Non-programmers create annotators Install annotators in UIMA with PEAR

54 UIMA and Community: RASP RASP Parser Robust Accurate Statistical Parsing Sussex and Cambridge universities RASP4UIMA UIMA wrappers for RASP tools DigitalPebble ( Install RASP tools in UIMA with PEAR

55 UIMA and Community: LuCas Apache Lucene Widely-used high-performance full-text indexing and search library LuCas - Lucene CAS Indexer Stores UIMA CAS data in Lucene index Developed at JULIE Lab (Jena) Currently in UIMA sandbox Presentation at UIMA Workshop today

56 Overview What is UIMA? Part-of-Speech Tagging Full Parsing Shallow Parsing More about UIMA

Apache UIMA and Mayo ctakes

Apache UIMA and Mayo ctakes Apache and Mayo and how it is used in the clinical domain March 16, 2012 Apache and Mayo Outline 1 Apache and Mayo Outline 1 2 Introducing Pipeline Modules Apache and Mayo What is? (You - eee - muh) Unstructured