Ortolang Tools : MarsaTag

Size: px

Start display at page:

Download "Ortolang Tools : MarsaTag"

Ronald Marshall
5 years ago
Views:

1 Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements d avenir» program (ANR-11-EQPX-0032)

2 - Open Resources and TOols for LANGuage - 6 partners : and - SLDR : Speech and Language Data Repository - SPPAS, Phonedit : speech annotation tools - MarsaTag : a tagger for French written text and speech transcription

3 Outline - MarsaTag components Tokenizer Morphoyntax POS Parser - MarsaTag packaging : Architecture Installer GUI CLI NLP frameworks integration WebService

4 - Input, French productions : Written texts Speech transcriptions MarsaTag - Output, the associated syntactic information : Part-Of-Speech tagging, e.g. Det, Noun,... Syntactic tree structure, e.g. NP, VP,... Functional relations, e.g. SUB, OBJ,... - The processing chain : Pre-lexical treatment, the tokenizer Lexical access (morphosyntax) MarsaLex lexicon : entries, frequency computed from a 140 Megawords corpus (mainly newspapers) Stochastic POS tagger Trained on the 700,000 manually corrected LPL-Grace corpus. Tags set of 51 categories, score of (F-Measure) on written texts. Stochastic deep parser Trained on the words LPL-FT corpus, a grammar of 15 constituents (NP, VP, VN,...) and 9 relations (SUB, OBJ, COORD,...). Evaluation in progress.

Split the text into lexical units : Tokenizer Le commis change la vis à l'intérieur du meuble. M. Ebène veille sur lui en même temps qu'il discute. The assistant changes the screw inside the wardrobe.

5 Split the text into lexical units : Tokenizer Le commis change la vis à l'intérieur du meuble. M. Ebène veille sur lui en même temps qu'il discute. The assistant changes the screw inside the wardrobe. Mr. Ebony watches over him while he discusses. - Mainly based on separators (space, newline,...) and typography marks (punctuation, apostrophe,...), but : Compound token (locution, Multi-Word Expression,...) punctuation/apostrophe part of token (abbreviation,...)

6 Morphosyntax Associate to each token all possible morphosyntactic tags : grammatical category : Noun, Verb, Adjective, Preposition,... syntactic features : Masculine/Feminine, Singular/Plural,... Le commis change la vis à l'intérieur du meuble. The assistant changes the screw inside the wardrobe.

7 Morphosyntax (2) M. Ebène veille sur lui en même temps qu'il discute. Mr. Ebony watches over him while he discusses.

meuble. The assistant changes the screw inside the wardrobe. M.

8 POS tagger Disambiguation of the various morphosyntactic, information looking for the optimal sequence Le commis change la vis à l'intérieur du meuble. The assistant changes the screw inside the wardrobe. M. Ebène veille sur lui en même temps qu'il discute. Mr. Ebony watches over him while he discusses.

9 Parser Build a syntactic parse tree with functional relations

10 Outline - MarsaTag components Tokenizer Morphoyntax POS Parser - MarsaTag packaging : Architecture Installer GUI CLI NLP frameworks integration WebService

11 Architecture - Analyses steps are independent modules sharing the same API - Separate input/output stages easy addition of new formats integration with other frameworks

12 Installer - Download from the SLDR - Install in just a few clicks - Java : Windows, Mac OS X & Linux

13 GUI - Select input file(s) - Choose input/output options - Set the level of analysis and run...

14 CLI - Batch mode : process files without GUI USAGE : MarsaTag UI [ options] <file>* ([ options] <file>*)* For each <file> only the reader/writer options defined before apply, so reader/writer options could change between 2 files. OPTIONS : g gui gui Start GUI cli cli Run the CLI (i.e. do not start the GUI, desactivate a preview ' g' - Preconfigure the GUI, example of the SPPAS plugin :

15 NLP frameworks GATE : tokenisation, morphosyntax, POS and Parser (Future : )

16 WebService (Work in progress) - WebLicht Wep Applications Combine tools (tokenizers, POS taggers, parsers) encapsulated as web services A common XML data exchange format (Text Corpus Format, TCF) - annotation layers : text, tokens, sentences, lemmas, part-of-speech, constituent parsing, dependency parsing, morphology, named entities, references, lexical-semantic annotations, matches, word splittings, geographical locations, discourse connectives, phonetics, text structure, orthography. - compatible with the Linguistic Annotation Format (LAF) and Graph- based Format for Linguistic Annotations (GrAF) developed in the ISO/TC37/SC4 technical committee CLARIN (pan-european collaborative effort) -...

Implementing a Variety of Linguistic Annotations

Implementing a Variety of Linguistic Annotations through a Common Web-Service Interface Adam Funk, Ian Roberts, Wim Peters University of Sheffield 18 May 2010 Adam Funk, Ian Roberts, Wim Peters Implementing