TectoMT: Modular NLP Framework

Similar documents
Treex: Modular NLP Framework

TectoMT: Modular NLP Framework

Morpho-syntactic Analysis with the Stanford CoreNLP

Implementing a Variety of Linguistic Annotations

Corpus Linguistics: corpus annotation

PML: Prague Markup Language

Annotation Quality Checking and Its Implications for Design of Treebank (in Building the Prague Czech-English Dependency Treebank)

Machine Translation Zoo

Depfix: Automatic post-editing of phrase-based machine translation outputs

Searching through Prague Dependency Treebank

The Prague Dependency Treebank: Annotation Structure and Support

Ortolang Tools : MarsaTag

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Final Project Discussion. Adam Meyers Montclair State University

NLP in practice, an example: Semantic Role Labeling

A tool for Cross-Language Pair Annotations: CLPA

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Text Mining for Software Engineering

Activity Report at SYSTRAN S.A.

Machine Learning for Deep-syntactic MT

Annotation Procedure in Building the Prague Czech-English Dependency Treebank

Package corenlp. June 3, 2015

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

Managing a Multilingual Treebank Project

Apache UIMA and Mayo ctakes

Dependency Parsing. Ganesh Bhosale Neelamadhav G Nilesh Bhosale Pranav Jawale under the guidance of

DepPattern User Manual beta version. December 2008

Machine Translation and Discriminative Models

Contents. List of Figures. List of Tables. Acknowledgements

An UIMA based Tool Suite for Semantic Text Processing

ANC2Go: A Web Application for Customized Corpus Creation

Using the Ellogon Natural Language Engineering Infrastructure

A Multilingual Social Media Linguistic Corpus

The Multilingual Language Library

A Flexible Distributed Architecture for Natural Language Analyzers

Post-annotation Checking of Prague Dependency Treebank 2.0 Data

English Understanding: From Annotations to AMRs

How to.. What is the point of it?

An Adaptive Framework for Named Entity Combination

PBML. logo. The Prague Bulletin of Mathematical Linguistics NUMBER??? NOVEMBER CzEng 0.9. Large Parallel Treebank with Rich Annotation

Syntax and Grammars 1 / 21

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment. Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart

Online Service for Polish Dependency Parsing and Results Visualisation

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Machine Learning in GATE

Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

The CKY algorithm part 2: Probabilistic parsing

Representing Layered and Structured Data in the CoNLL-ST Format

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

UIMA-based Annotation Type System for a Text Mining Architecture

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Recent Developments in the Czech National Corpus

Data and Information Integration: Information Extraction

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

SURVEY PAPER ON WEB PAGE CONTENT VISUALIZATION

Vision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0

Some good development practices (not only in NLP)

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

Corpus Linguistics. Automatic POS and Syntactic Annotation. POS tagging. Available POS Taggers. Stanford tagger. Parsing.

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF

Constraints for corpora development and validation 1

Formats and standards for metadata, coding and tagging. Paul Meurer

The Application of Constraint Rules to Data-driven Parsing

Domain Based Named Entity Recognition using Naive Bayes

Langforia: Language Pipelines for Annotating Large Collections of Documents

Language Resources and Linked Data

Large-scale Semantic Networks: Annotation and Evaluation

Information Extraction with GATE

Sustainability of Text-Technological Resources

Introduction to Text Mining. Aris Xanthos - University of Lausanne

The CKY algorithm part 1: Recognition

Text Analytics Introduction (Part 1)

Projektgruppe. Michael Meier. Named-Entity-Recognition Pipeline

STS Infrastructural considerations. Christian Chiarcos

LAB 3: Text processing + Apache OpenNLP

Semantic sentence structure search engine

CSC 5930/9010: Text Mining GATE Developer Overview

KH Coder 3 Reference Manual

Question Answering Using XML-Tagged Documents

The Prague Bulletin of Mathematical Linguistics NUMBER 92 DECEMBER CzEng 0.9. Large Parallel Treebank with Rich Annotation

CS395T Project 2: Shift-Reduce Parsing

The CKY algorithm part 1: Recognition

The Goal of this Document. Where to Start?

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Migrating LINA Laboratory to Apache UIMA

The KNIME Text Processing Plugin

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Maximum Entropy based Natural Language Interface for Relational Database

TechWatchTool: Innovation and Trend Monitoring

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Web Product Ranking Using Opinion Mining

CS 224N Assignment 2 Writeup

A Quick Guide to MaltParser Optimization

Transcription:

: Modular NLP Framework Martin Popel, Zdeněk Žabokrtský ÚFAL, Charles University in Prague IceTAL, 7th International Conference on Natural Language Processing August 17, 2010, Reykjavik

Outline Motivation Layers of language description in PDT architecture internals Future plans Conclusion 2 2

Motivation for creating Originally a framework for a linguistically motivated MT system deep syntactic (tectogrammatical) transfer started with English to Czech direction translation process divided to ~ 90 blocks combining statistical and rule-based blocks Goals: elegant integration of in-house and third-party NLP tools modularity, reusability, cooperation Analysis-Transfer-Synthesis Diagram interlingua tectogram. surf.synt. morpho. raw text. source language target language ability to easily modify and add code in full-fledged programming language (Perl) 3 3

Motivation for creating Now used for many other projects, not limited to MT nor tectogrammatics: automatic alignment & annotation of a parallel treebank (CzEng) support for manual annotations (PEDT) lemmatization, tagging, parsing named entity recognition, information retrieval, coreference preprocessing for phrase-base MT change word order, append determiners to nouns,... add deep-syntactic features as an input for factored translation conversions, evaluations, etc. 4 4

4 layers of language description implemented in Prague Dependency Treebank (PDT) word layer raw (tokenized) text 5 5

4 layers of language description implemented in Prague Dependency Treebank (PDT) morphological layer lemma & POS tag for each word word layer raw (tokenized) text 6 5

4 layers of language description implemented in Prague Dependency Treebank (PDT) analytical layer surface-syntactic dependency trees, labeled edges morphological layer lemma & POS tag for each word word layer raw (tokenized) text 7 5

4 layers of language description implemented in Prague Dependency Treebank (PDT) tectogrammatical layer deep-syntactic dependency trees analytical layer surface-syntactic dependency trees, labeled edges morphological layer lemma & POS tag for each word word layer raw (tokenized) text 8 5

4 layers of language description implemented in Prague Dependency Treebank (PDT) tectogrammatical layer deep-syntactic dependency trees - abstraction from many language-specific phenomena - autosemantic (meaningful) words ~ nodes - functional words (prepositions, auxiliaries) ~ attributes - syntactic-semantic relations (dependecies) ~ edges - added nodes (e.g. becasue of pro-drop) -... 9 6

architecture processing units block elementary processing unit in corresponding to a given NLP subtask one Perl class, saved in one file scenario a sequence of blocks saved in plain text files just a list of the blocks' names and their parameters application represents an end-to-end NLP task conversion of the input to internal format (XML) possibly split into more files applying a scenario to the files (loaded in memory) conversion to the desired output format 10 7

architecture processing units Blocks can be easily substituted with an alternative solution. Scenario 1: Scenario 2: Scenario 3: 11 8

architecture processing units Blocks can be easily substituted with an alternative solution. Scenario A Sentence_segmentation_simple Penn_style_tokenization TagMxPost Lemmatize_mtree McD_parser Scenario B Each_line_as_sentence Tokenize_and_tag Lemmatize_mtree Malt_parser 12 9

architecture data units Document stored in one file sequence of sentences Bundle corresponds to one sentence bundle of trees Tree direction (S=source, T=target) Language (Arabic, Czech, English, German,...) layer of language description (M, A, T) 13 10

architecture data units English layers SEnglishW Peter does not love Mary. SEnglishM DOCUMENT sentence 1 sentence 2... sentence N BUNDLE Czech layers SCzechW Petr nemiluje Marii. SCzechM BUNDLE BUNDLE Peter do not love Mary NNP VBZ RB VBD NNP SEnglishA Petr milovat Marie NNMS1 VB-S 3P-NA NNFS4 SCzechA... Peter do not love Mary Sb AuxV Neg Pred Obj SEnglishT Petr milovat Marie Sb Pred Obj SCzechT Peter love Mary ACT PRED PAT Petr milovat Marie ACT PRED PAT 14 11

architecture data units English layers SEnglishW Peter does not love Mary. SEnglishM DOCUMENT sentence 1 sentence 2... sentence N BUNDLE Czech layers SCzechW Petr nemiluje Marii. SCzechM BUNDLE BUNDLE Peter do not love Mary NNP VBZ RB VBD NNP SEnglishA Petr milovat Marie NNMS1 VB-S 3P-NA NNFS4 SCzechA... Peter do not love Mary Sb AuxV Neg Pred Obj SEnglishT Petr milovat Marie Sb Pred Obj SCzechT Peter love Mary ACT PRED PAT Petr milovat Marie ACT PRED PAT 15 11

architecture data units English layers SEnglishW Peter does not love Mary. SEnglishM DOCUMENT sentence 1 sentence 2... sentence N BUNDLE Czech layers TCzechW Petr nemiluje Marii. TCzechM BUNDLE BUNDLE Peter do not love Mary NNP VBZ RB VBD NNP SEnglishA Petr milovat Marie NNMS1 VB-S 3P-NA NNFS4 TCzechA... Peter do not love Mary Sb AuxV Neg Pred Obj SEnglishT Petr milovat Marie Sb Pred Obj TCzechT Peter love Mary ACT PRED PAT Petr milovat Marie ACT PRED PAT 16 11

architecture data units English layers SEnglishW Peter does not love Mary. SEnglishM DOCUMENT sentence 1 sentence 2... sentence N BUNDLE Czech layers TCzechW Petr nemiluje Marii. TCzechM BUNDLE BUNDLE Peter do not love Mary NNP VBZ RB VBD NNP SEnglishA Petr milovat Marie NNMS1 VB-S 3P-NA NNFS4 TCzechA... Peter do not love Mary Sb AuxV Neg Pred Obj SEnglishT Petr milovat Marie Sb Pred Obj TCzechT Peter love Mary ACT PRED PAT Petr milovat Marie ACT PRED PAT 17 11

Internals Design decisions Perl (wrappers for binaries, Java,...) Linux (some applications platform-independent) OOP (ClassStd, Moose) Open source (GNU GPL for the versioned part) Neutral w.r.t. methodology (statistical, rule-based) Multilingual Open standards (Unicode, XML) 18 12

Internals Components Data models for stochastic tools translation dictionaries special-purpose lexical databases Third-party Tools Malt parser TreeTagger fntbl, CRF++ Visualization TrEd (Tree Editor with SVG and PDF export options) download shared part In-house Tools taggers, parsers NE recognizers language models API machine learning tools Core Classes ::Document ::Bundle ::Node ::Scenario ::Block Blocks Tree_tagger McD_parser Mark_passives Applications scenarios + format conversions Format Convertors (to & from tmt) plain text HTML & various XML corpora PDT, PennTB, EMILLE, PADT, CoNLL, vertical SVN versioned part 19 13

Internals Statistics Developed since 2005, over ten developers Over 400 blocks (140 English, 120 Czech, 60 English-to-Czech, 30 other languages, 50 language independent) Taggers (5 English, 3 Czech, 1 German and Russian) Parsers (Dep. 2 English, 3 Czech, 2 German; Const. 2 English) Named Entity Recognizers (2 Czech,1 English) Speed example: Best version of English-to-Czech MT 1.2 seconds per sentence plus 90 seconds loading, with 20 computers in cluster: 2000 sentences in 4 min 20 14

Internals Statistics Developed since 2005, over ten developers Over 400 blocks (140 English, 120 Czech, 60 English-to-Czech, 30 other languages, 50 language independent) Taggers (5 English, 3 Czech, 1 German and Russian, Tamil) Parsers (Dep. 2 English, 3 Czech, 2 German; Const. 2 English) Named Entity Recognizers (2 Czech,1 English) Speed example: Best version of English-to-Czech MT 1.2 seconds per sentence plus 90 seconds loading, with 20 computers in cluster: 2000 sentences in 4 min 21 14

Future plans ( Treex) Reimplementation of core components CPAN release Adding new languages more easily Improved parallelization support Faster code, smaller files,... 22 15

Conclusion main properties emphasized efficient development, modular design and reusability stratificational approach to the language unified object-oriented interface for accessing data structures comfortable development 23 16

TrEd visualization translation 24 17

TrEd visualization word alignment on the morphological layer 25 18

TrEd visualization word alignment on the tectogrammatical layer 26 19

Block example SVO to SOV code package Tutorial::SVO_to_SOV_solution; use strict; use warnings; use base qw(::block); sub process_bundle { my ( $self, $bundle ) = @_; my $a_root = $bundle >get_tree('senglisha'); core convention Perl keyword/convention foreach my $a_node ( $a_root >get_descendants() ) { if ( $a_node >get_attr('m/tag') =~ /^V/ ) { # verb found foreach my $child ( $a_node >get_eff_children() ) { if ( $child >get_attr('afun') eq 'Obj' ) { # object found # Move the object and its subtree so it precedes the verb $child >shift_before_node($a_node); } } } } } 27 1; 20

Thank you Cooperation is welcomed. http://ufal.mff.cuni.cz/tectomt 28