Treex: Modular NLP Framework

Similar documents
TectoMT: Modular NLP Framework

Machine Translation Zoo

Machine Translation and Discriminative Models

Machine Learning for Deep-syntactic MT

Corpus Linguistics: corpus annotation

Implementing a Variety of Linguistic Annotations

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian

Morpho-syntactic Analysis with the Stanford CoreNLP

Final Project Discussion. Adam Meyers Montclair State University

Annotation Quality Checking and Its Implications for Design of Treebank (in Building the Prague Czech-English Dependency Treebank)

TectoMT: Modular NLP Framework

Depfix: Automatic post-editing of phrase-based machine translation outputs

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

The Prague Dependency Treebank: Annotation Structure and Support

Annotation Procedure in Building the Prague Czech-English Dependency Treebank

Activity Report at SYSTRAN S.A.

Apache UIMA and Mayo ctakes

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Searching through Prague Dependency Treebank

An UIMA based Tool Suite for Semantic Text Processing

A Multilingual Social Media Linguistic Corpus

Package corenlp. June 3, 2015

DepPattern User Manual beta version. December 2008

PML: Prague Markup Language

Text Mining for Software Engineering

Managing a Multilingual Treebank Project

NLP in practice, an example: Semantic Role Labeling

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Ortolang Tools : MarsaTag

Diagnostic evaluation of MT with DELiC4MT

Automatic Reader. Multi Lingual OCR System.

Dependency Parsing. Ganesh Bhosale Neelamadhav G Nilesh Bhosale Pranav Jawale under the guidance of

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

LAB 3: Text processing + Apache OpenNLP

Corpus Linguistics. Seminar Resources for Computational Linguists SS Magdalena Wolska & Michaela Regneri

A Flexible Distributed Architecture for Natural Language Analyzers

Post-annotation Checking of Prague Dependency Treebank 2.0 Data

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

ANC2Go: A Web Application for Customized Corpus Creation

A tool for Cross-Language Pair Annotations: CLPA

Syntax and Grammars 1 / 21

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

STS Infrastructural considerations. Christian Chiarcos

Better translations with user collaboration - Integrated MT at Microsoft

The Multilingual Language Library

Information Extraction with GATE

A Quick Guide to MaltParser Optimization

PBML. logo. The Prague Bulletin of Mathematical Linguistics NUMBER??? NOVEMBER CzEng 0.9. Large Parallel Treebank with Rich Annotation

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment. Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart

Contents. List of Figures. List of Tables. Acknowledgements

Representing Layered and Structured Data in the CoNLL-ST Format

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

PROMT Flexible and Efficient Management of Translation Quality

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser

Constraints for corpora development and validation 1

Large-scale Semantic Networks: Annotation and Evaluation

Propbank Instance Annotation Guidelines Using a Dedicated Editor, Jubilee

TectoMT: Machine Translation System

Using the Ellogon Natural Language Engineering Infrastructure

Langforia: Language Pipelines for Annotating Large Collections of Documents

How to.. What is the point of it?

Vision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0

<Insert Picture Here> Oracle Policy Automation 10.0 Features and Benefits

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

Recent Developments in the Czech National Corpus

CSC401 Natural Language Computing

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering

CSC 5930/9010: Text Mining GATE Developer Overview

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Formats and standards for metadata, coding and tagging. Paul Meurer

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

The CKY algorithm part 1: Recognition

Corpus Linguistics. Automatic POS and Syntactic Annotation. POS tagging. Available POS Taggers. Stanford tagger. Parsing.

Jubilee: Propbank Instance Editor Guideline (Version 2.1)

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Machine Learning in GATE

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Dependency grammar and dependency parsing

Using UIMA to Structure an Open Platform for Textual Entailment. Tae-Gil Noh, Sebastian Padó Dept. of Computational Linguistics Heidelberg University

ABBYY Recognition Server 4 Release 5 (w. Japanese Helpfiles) Patch 1 Release Notes

SURVEY PAPER ON WEB PAGE CONTENT VISUALIZATION

The Web Enabling Company

Processing XML and JSON in Python

Whole World OLIF. Version 3.0 of the Versatile Language Data Format

Army Research Laboratory

Question Answering Using XML-Tagged Documents

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

An Adaptive Framework for Named Entity Combination

Online Service for Polish Dependency Parsing and Results Visualisation

of the word usage in the available corpora. The constraints incorporated in the system are used for writing a grammar of the sub-languages of the defi

Machine Translation Research in META-NET

KH Coder 3 Reference Manual

Data and Information Integration: Information Extraction

The Goal of this Document. Where to Start?

Transcription:

: Modular NLP Framework Martin Popel ÚFAL (Institute of Formal and Applied Linguistics) Charles University in Prague September 2015, Prague, MT Marathon

Outline Motivation, vs. architecture internals Future plans Conclusion and examples 2

Motivation Goals of elegant integration of in-house and third-party NLP tools modularity, reusability, cooperation ability to easily modify and add code in a full-fledged programming language (Perl) 3

vs. 2005 (Zdeněk Žabokrtský) NLP framework MT system lemmatization tagging parsing 4

vs. 2005 NLP framework MT system lemmatization tagging parsing 2011 multi-purpose NLP framework MT system coreference lemmatization tagging parsing PEDT preprocessing CzEng analysis treebank conversions named entity r. alignment (word,tree) SMT preproc. etc. 5

vs. 2005 NLP framework MT system lemmatization tagging parsing Now not only tectogrammatics and not only MT renamed 2011 multi-purpose NLP framework MT system coreference lemmatization tagging parsing PEDT preprocessing CzEng analysis treebank conversions named entity r. alignment (word,tree) SMT preproc. etc. 6

vs. 2005 NLP framework MT system lemmatization tagging parsing redesigned and reimplemented easier to use more flexible 2011 multi-purpose NLP framework MT system coreference lemmatization tagging parsing PEDT preprocessing CzEng analysis treebank conversions named entity r. alignment (word,tree) SMT preproc. etc. 7

vs. 2005 English Czech NLP framework lemmatization MT system tagging parsing redesigned and reimplemented easier to use more flexible more langs 2011 Czech English multi-purpose Russian Tamil NLP framework Polish Esperanto Spanish MT system German coreference lemmatization French tagging parsing Arabic PEDT preprocessing CzEng analysis treebank conversions Vietnamese Hindi named entity r. Urdu SMT preproc. alignment (word,tree) Finish etc. 8 *) Most of the listed languages are only drafts of anlysis made by students, not converted to yet. The entire risk as to the quality and performance of the program is with you.

vs. 2005 English Czech NLP framework lemmatization MT system tagging parsing redesigned and reimplemented easier to use more flexible more langs 2011 Czech English multi-purpose Russian Tamil NLP framework Polish Esperanto Spanish MT system German coreference lemmatization French tagging parsing Arabic PEDT preprocessing CzEng analysis treebank conversions Vietnamese Hindi named entity r. Urdu SMT preproc. alignment (word,tree) Finish etc. 9 *) Most of the listed languages are only drafts of anlysis made by students, not converted to yet. The entire risk as to the quality and performance of the program is with you.

linguistically motivated MT system (English to Czech pilot) deep syntactic (tectogrammatical) transfer translation process divided to more than 90 blocks combining statistical and rule based blocks ANALYSIS tectogramatical layer TRANSFER SYNTHESIS t-layer fill morphological categories grammatemes query use impose agreement build t-tree dictionary HMTM add functional words mark edges to contract analytical layer a-layer analytical functions generate wordforms parser (McDonald's MST) morphological layer m-layer tagger (Morce) concatenate lemmatization source language (English) target language (Czech) w-layer 10 tokenization segmentation fill formems

4 layers of language description implemented in Prague Dependency Treebank (PDT) tectogrammatical layer deep-syntactic dependency trees analytical layer surface-syntactic dependency trees, labeled edges morphological layer lemma & POS tag for each word word layer raw (tokenized) text 11

4 layers of language description implemented in Prague Dependency Treebank (PDT) tectogrammatical layer deep-syntactic dependency trees - abstraction from many language-specific phenomena - autosemantic (meaningful) words ~ nodes - functional words (prepositions, auxiliaries) ~ attributes - syntactic-semantic relations (dependecies) ~ edges - added nodes (e.g. becasue of pro-drop) -... 12

layers of language description implemented in Mostly backward compatible adaptations (adding attributes) formeme (n:2, n:k+3, v:že+vfin, v:rc, adj:attr) attributes for clauses, is_passive ( diathesis),... is_member (for conjuncts on a-layer) is stored with prepositions All layers stored in one file A-layer and m-layer merged into one Two more layers: P-layer phrase-structure trees N-layer named entities 13

architecture input files In-memory document representation (OOP API) document reader block 1... block n document writer output files14

architecture parallelization (using SGE cluster) document reader input files block In-memory document... block writer... document reader block output files In-memory document... block writer 15

architecture processing units block elementary processing unit in corresponding to a given NLP subtask one Perl class (::Block::*), saved in one file scenario a sequence of blocks saved in plain text files or a ::Scen::* Perl class just a list of the blocks' names and their parameters application represents an end-to-end NLP task described by a scenario that starts with a reader (input conversion) ends with a writer (output conversion) Readers can split the input file into more in-memory docs. There are readers&writers for a number of popular formats: plain text, CoNLL, PDT PML, Penn MRG, Tiger... 16 *.treex.gz

architecture processing units Blocks can be easily substituted with an alternative solution. Scenario 1: Scenario 2: Scenario 3: 17

architecture processing units Blocks can be easily substituted with an alternative solution. Scenario A Scenario B W2A::EN::Segment W2A::SegmentOnNewlines W2A::EN::Tokenize W2A::EN::TagMorce W2A::EN::TagLinguaEn W2A::EN::Lemmatize W2A::EN::Lemmatize W2A::EN::ParseMST W2A::EN::ParseMalt 18

architecture data units Document corresponds to one sentence Zone stored in one file sequence of sentences Bundle ( bundle of trees ) one for each language (Arabic, Czech, English,...) and optionally a variant ( selectors src, trg, ref,...) Tree layer of language description: A, T (plus P, N) m-layer is stored with the a-layer in one tree 19

architecture data units sentence 1 DOCUMENT M-layer love Mary NNP VBZ RB VBD NNP Petr milovat A-layer love Mary Pred Obj Petr Sb T-layer Peter ACT love PRED Marie NNMS1 VB-S 3P-NA NNFS4 A-layer Sb AuxV Neg BUNDLE W-layer Petr nemiluje Marii. M-layer Peter do not sentence N Zone cs_src W-layer Peter does not love Mary. Peter do not... BUNDLE BUNDLE Zone en_src sentence 2 milovat Pred... Marie Obj T-layer Mary PAT Petr ACT milovat PRED Marie PAT 20

architecture data units sentence 1 DOCUMENT M-layer love Mary NNP VBZ RB VBD NNP Petr milovat A-layer love Mary Pred Obj Petr Sb T-layer Peter ACT love PRED Marie NNMS1 VB-S 3P-NA NNFS4 A-layer Sb AuxV Neg BUNDLE W-layer Petr nemiluje Marii. M-layer Peter do not sentence N Zone cs_src W-layer Peter does not love Mary. Peter do not... BUNDLE BUNDLE Zone en_src sentence 2 milovat Pred... Marie Obj T-layer Mary PAT Petr ACT milovat PRED Marie PAT 21

architecture data units sentence 1 DOCUMENT M-layer love Mary NNP VBZ RB VBD NNP Petr milovat A-layer love Mary Pred Obj Petr Sb T-layer Peter ACT love PRED Marie NNMS1 VB-S 3P-NA NNFS4 A-layer Sb AuxV Neg BUNDLE W-layer Petr nemiluje Marii. M-layer Peter do not sentence N Zone cs_trg W-layer Peter does not love Mary. Peter do not... BUNDLE BUNDLE Zone en_src sentence 2 milovat Pred... Marie Obj T-layer Mary PAT Petr ACT milovat PRED Marie PAT 22

Internals Design decisions Perl (wrappers for binaries, Java,...) Linux (some applications platform-independent) OOP (Moose) Open source (GNU GPL for the versioned part) Neutral w.r.t. methodology (statistical, rule-based) Multilingual Open standards (Unicode, XML) 23

Internals Components Core Classes ::Core::Document ::Core::Node ::Core::Scenario ::Core::Block Blocks TagTNT ParseMST MarkPassives Visualization TrEd (Tree Editor with SVG and PDF export options) Applications scenarios + format conversions Readers and Writers plain text HTML & various XML corpora PDT, PennTB, EMILLE, PADT, CoNLL, vertical download SVN versioned part In-house Tools taggers, parsers NE recognizers language models API machine learning tools Third-party Tools Malt parser TreeTagger fntbl, CRF++ shared part Data models for stochastic tools translation dictionaries special-purpose lexical databases 24

Internals Statistics Developed since 2005, over ten developers Over 400 blocks (140 English, 120 Czech, 60 English-to-Czech, 30 other languages, 50 language independent) Taggers (5 English, 3 Czech, 1 German and Russian, Tamil) Parsers (Dep. 2 English, 3 Czech, 2 German; Const. 2 English) Named Entity Recognizers (2 Czech,1 English) Speed example: Best version of English-to-Czech MT 1.2 seconds per sentence plus 90 seconds loading, with 20 computers in cluster: 2000 sentences in 4 min 25

Conclusion main properties emphasized efficient development, modular design and reusability stratificational approach to the language unified object-oriented interface for accessing data structures comfortable development 26

TrEd visualization translation 27

TrEd visualization word alignment on the morphological layer 28

TrEd visualization word alignment on the tectogrammatical layer 29

TrEd visualization named entities 30

Block example SVO to SOV code package Tutorial::Svo2SovSolution; use Moose; use ::Core::Common; extends '::Core::Block'; core convention Perl keyword/convention sub process_anode { my ( $self, $a_node ) = @_; if ( $a_node >tag =~ /^V/ ) { # verb found foreach my $child ( $a_node >get_echildren() ) { if ( $child >afun eq 'Obj' ) { # object found # Move the object and its subtree so it precedes the verb $child >shift_before_node($a_node); } } } return; } 1; 31 20

Thank you Cooperation is welcomed. http://ufal.mff.cuni.cz/treex 32

Thank you is growing! http://ufal.mff.cuni.cz/treex 33