Multiword deconstruction in AnCora dependencies and final release data
|
|
- Darleen Woods
- 5 years ago
- Views:
Transcription
1 Multiword deconstruction in AnCora dependencies and final release data TECHNICAL REPORT GLICOM Benjamin Kolz, Toni Badia, Roser Saurí Universitat Pompeu Fabra {benjamin.kolz, toni.badia, June 2014 I. Introduction AnCora Surface Syntax Dependency Corpus (AnCora SSD) represents a new resource available to the research community which offers a surface syntax-oriented annotation of AnCora dependencies (Taulé, Martí and Recasens, 2008). The annotation was done by an automatic conversion from AnCora constituents into the new dependency format. The annotation process was widely covered by the article From constituents to syntax-oriented dependencies (Kolz, Badia and Saurí, 2014). The article describes the linguistic decisions taken for this annotation and presents the new syntactic function tagset which was applied. Nevertheless it did not yet discuss the deconstruction of AnCora multiwords. The main goal of this technical report is therefore the description of the deconstruction process and the presentation of some new data concerning the AnCora Surface Syntax Dependencies in its final version. II. Multiword Theory and AnCora Treatment II.I. What is a multiword? A multiword can be described as "idiosyncratic interpretations that cross word boundaries (or spaces)". (Sag et al., 2002: 2). AnCora groups them together in one single token (by means of underscore characters) and its usage can be found in almost all types of part-of-speech (table 1). Universitat_de_Barcelona (proper noun) página_web (common noun) querer_decir (verb) on_line (adjective) de_nuevo (adverb) a_pesar_del (preposition) cien_mil (number) Table 1. Multiword part-of-speech examples. Furthermore it is noticeable that time expressions (e. g. 2_de_marzo_de_1995) fall within the applied multiword concept and that one can find even complex structures within multiwords such as coordinations (e. g. Industria_y_Comercio) or nouns modified by adjectives (e. g. el_bucle_melancólico). 1
2 II.II. Internal structure of multiwords A multiword contains at least two tokens with theoretically no limitation on the side of maximum tokens. AnCora joins tokens to a multiword by making use of the underscore character (e. g. Universitat_de_Barcelona ). As a multiword only occupies one token in the AnCora annotation, it has also only one syntactic function assigned (see table 2). 17 presentada presentado 15 S 18 en en 17 cc 19 la el 20 spec 20 Universitat_de_Barcelona Universitat_de_Barcelona n 18 sn Table 2. AnCora multiword example. 5th column: syntactic function. The internal head of the multiword is dependent of a node outside of the multiword. In the example of table 2 Universitat would be the internal head of the multiword structure and it would be dependent of en. The head-dependent relations among tokens within the multiword is not expressed. This information will have to be calculated in a deconstruction process as in principle any token within the multiword is a candidate for being the head of a head-dependent pair. This is the case as multiwords can contain complex structures such as coordinations or specifiers. These structures inside the multiword have to be analyzed according to the criteria set up for the annotation and each token needs to get attached to a head within the multiword range. The part-of-speech of the tokens can give valuable information for the identification of their head and the setting of the syntactic function. crearon upper head se a_partir_del internal head a acuerdo lower dependent Table 3. Example for upper head, internal head and lower dependent. II.III. External relations A multiword has on the one hand a head (upper head) of which it is a dependent and on the other hand it can be head of further nodes (lower dependency relations). Table 3 exemplifies this constellation. Identifying the head of the internal multiword head in a deconstruction process is straightforward as it keeps this relation from the multiword (attached to the upper head). The correct setting of lower dependents of the former multiword is more complicated as there is not a simple default solution. Table 4 shows different examples of multiwords and their dependents. 2
3 a) usan el 95_por_ciento de los ordenadores b) el Tribunal_Supremo_de_Justicia (TSJ) decidió suspender c) creadas a_partir_del acuerdo suscrito por Argentina y la Unión_Europea d) table a_pesar_de_que 6: lower dependents el precio examples sigue al alza Table 4. Examples of multiwords and their variety of dependents. Again the part-of-speech gives us important information of how to set up the relation to lower dependents. Normally a lower dependent will connect to the head of the multiword structure (see a and b in table 4). But a preposition or conjunction in the last position of the multiword will lead to a different treatment as they work as head for lower dependents (see c and d in table 4). III. Multiword Deconstruction III.I. Why the deconstruction? The deconstruction of AnCora multiwords was necessary as the corpus showed certain inconsistencies in their treatment. A few examples can be found in table 5. Token use in AnCora aún así mayo pasado de hecho Multiword use in AnCora aún_así mayo_pasado de_hecho mientras que mientras_que Table 5. Example of inconsistencies in AnCora multiword treatment. A parser would have problems with this data as it would have to be aware of the possibility to find a multiword written together as one token but also as a sequence of tokens. The same happens with searches over the corpus, if one is interested in gathering all temporal expressions, for example, it has to be considered that they can be found within multiword tokens but also outside of them. Furthermore the concept of writing a group of words together as if it were one word is not expressed in natural language and introduces a source of artificiality over the word forms of the corpus. Considering these points, we decided to deconstruct the multiwords into individual tokens. III.II. Multiword statistics AnCora contains a total of 9,113 types of multiwords which make up 18,953 instances in the whole corpus. The multiwords have a wide range of lengths, the majority are two-token multiwords but the longest entry is an 18-token multiword. The following table shows the distribution of multiword instances according to their token length. The upper row indicates the length of the multiword in tokens and the lower one shows the number of instances found with the corresponding length. 3
4 Table 6. Multiword lengths statistics. III.III. Algorithm An overview of the multiword deconstruction algorithm is presented in table 7 and further explanations may refer to the indicated line numbers. The program starts the deconstruction process by reading first all AnCora multiwords and storing their types (line 2). All tokens which can be found within multiwords get then labeled with their part-of-speech (line 3) and afterwards a part-of-speech sequence table is created which gathers possible deconstruction settings for multiwords based on their part-of-speech combination (line 4). These possible solutions come from multiwords which were also treated as separated tokens in AnCora and also by the creation of further token combinations which correspond to needed part-of-speech sequences and which are connected among each other in head-dependent relations. 1 Function Multiword_Deconstruction(dependencies): 2 multiwords=get_multiwords(dependencies) 3 add_pos_to_multiwords(multiwords) 4 pos_sequence_table=create_pos_sequence_table(dependencies) 5 classifier=classify_multiwords(multiwords,pos_sequence_table) 6 for sentence in dependencies: 7 deconstruct_multiwords(sentence,classifier) Table 7. Multiword deconstruction algorithm. Afterwards a classifier takes all multiwords and sets all head-dependent pairs within the multiword and their respective syntactic function according to the solution proposed by the part-of-speech sequence table.. Boca_Júniors :[0, 1],['dobj', 'appos'] El_Noticiero_Universal :[2, 0, 2],['det', 'coord', 'amod'] 17_de_octubre :[0, 1, 2],['dobj', 'prep', 'pobj'] Table 8. Multiword classifier output. As one can see in table 8, each token gets a head within the multiword (besides the one with the value 0 which is in this way identified as internal head of the multiword structure) and a syntactic function label according to its dependent-head relation. While the syntactic function labels of the internal multiword tokens generally stay the same in all kind of contexts, it is worth to comment that the label assigned to the relation of the internal head of the multiword to its upper head varies according to the context as it depends on the syntactic configuration of each case. So the first entry in the label list of Boca_Júniors could as well be a nsubj if it was used as a subject in a certain sentence. This label is therefore set in the deconstruction process. In case that the classifier could not find a solution for the part-of-speech combination of the multiword, a default rule was set up which connects all tokens in the multiword from right to left according to their position, only taking into account the treatment of determiners and adjectives modifying nouns. This means that in the deconstruction of a multiword like Jurado_Nacional_de_Elecciones the preposition de would still be attached to Jurado and not to Nacional, even if the classifier uses the default rule. 4
5 Finally each sentence of the corpus is passed through the program which by help of the classifier deconstructs all multiwords setting the head of each multiword token and its respective syntactic function. The deconstruction process handles by rules the treatment of lower dependents of multiwords. The AnCora Surface Syntax Dependencies are then annotated. 1 El 3 det da0ms0 el 2 ex 3 amod aq0cn0 ex 3 ministro 12 nsubj ncms000 ministro 4 español 3 amod aq0ms0 español 5 de 3 prepn sps00 de 6 Industria 7 coord np00000 Industria 7 y 5 pobj cc y 8 Energía 7 coord np00000 Energía Table 9. Example of a deconstructed multiword with a coordination inside. III.IV. Evaluation of Multiword Deconstruction AnCora contains 18,953 multiword instances which correspond to 9,113 multiword types. As multiwords were classified in this approach by type the evaluation was based on an amount of 500 multiword types which makes up around a 5.5 % of the total amount. As the classifier of the program does not include solutions for all part-of-speech combinations found in multiwords the evaluation has to consider this by taking into account a corresponding amount of multiwords which were deconstructed by the default rule. 459 of the 9,113 multiword types were classified in this way, so we decided to include a 5 % (25 of 500) of those default solutions also in the evaluation in order to get meaningful results. The evaluation consisted then in a manual revision of 500 selected multiword types by checking all their individual heads and their syntactic function label. Those 500 multiwords contained a total of 1,374 tokens. The results obtained are highly satisfactory as label accuracy (LA) reached 0.92, the unlabeled attachment score (UAS) 0.96 and the labeled attachment score (LAS) a value of The fact that both LA and LAS show the same result can be explained as the setting of the syntactic function label is highly dependent on a previous correct identification of its head and the overall high accuracy results. LA 0.92 UAS 0.96 LAS 0.92 Accuracy Table 10. Multiword deconstruction evaluation. IV. Final release version of AnCora Surface-Syntax Dependencies The final version of AnCora Surface-Syntax Dependencies contains 547,724 tokens. The change in token count compared to the former AnCora dependency annotation (517,269 tokens) results of the deletion of elliptic subjects and the deconstruction of multiwords. 5
6 Our tagset is presented in Table 11. It contains 43 function tags (including underspecified ones), which makes it fully adequate for automatic annotation. In the table, indentation shows the tagset hierarchical structure, conveying that general tags like obj or mod include more specific subclasses. In the annotation, the goal is obviously to be as specific as possible, as this leads to more informative data. Therefore the generic tags like dep, comp, obj, mod and prep are not expected to be of common use but only for cases where a more specific tag cannot be applied. Tag root dep arg comp attr cpred obj cobj dobj iobj oobj pobj vobj crobj subj nsubj csubj coord conj agent reflec te mod abbrev amod appos advcl det infmod partmod advmod neg rcmod nn tmod num prep prepv prepn prepa Full name root dependent argument complement attributive predicative complement object complementizer object direct object indirect object oblique object object of a preposition object of verb object of comparative subject nominal subject clausal subject coordination conjunct agent reflexive ( se ) textual element modifier abbreviation modifier adjectival modifier appositional modifier adverbial clause modifier determiner infinitival modifier participial modifier adverbial modifier negation modifier relative clause modifier noun compound modifier temporal modifier numeric modifier prepositional modifier prep. mod. of a verb prep. mod. of a noun prep. mod. of adjective 6
7 poss punct voc possession modifier punctuation vocative Table 11. Tagset in hierarchical view. The following table shows the distribution of the tagset in AnCora Surface-Syntax Dependencies. The label dep was set if the system could not identify a more detailed label and this was the case in only around a 1.5 % of the corpus. It was also checked that each sentence had a root node and not more than one root as this a requirement for a correctly parsed sentence. Tag Name Frequency pobj 87,409 det 71,332 punct 65,280 prepn 42,021 dobj 32,533 coord 30,305 amod 29,713 nsubj 29,316 prepv 22,405 root 17,364 cobj 16,848 advmod 16,049 appos 15,631 rcmod 7,820 dep 7,421 attr 6,814 poss 5,501 reflec 5,313 oobj 5,133 vobj 4,363 advcl 3,944 neg 3,923 iobj 3,535 prep 3,463 prepa 3,453 num 3,352 cpred 2,236 conj 1,497 agent 1,411 tmod 814 te 783 csubj 395 crobj 329 voc 13 mod 3 partmod 2 Table 12. Tagset sorted by frequency. 7
8 The AnCora Surface-Syntax Dependencies are now available at Bibliography: Kolz, B., Badia, T. and Saurí, R From constituents to syntax-oriented dependencies. Procesamiento del Lenguaje Natural, [S.l.], v. 52, p , mar ISSN Available at: < Last access: 8th June Sag, I., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D Multiword Expressions: A Pain in the Neck for NLP, in: Lecture Notes in Computer Science, Vol. 2276, pp Taulé, M., Martí, M. A., and Recasens, M AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In ELRA (Ed.), LREC, Marrakech, Morocco, p
Dependency Parsing. Allan Jie. February 20, Slides: Allan Jie Dependency Parsing February 20, / 16
Dependency Parsing Allan Jie February 20, 2016 Slides: http://www.statnlp.org/dp.html Allan Jie Dependency Parsing February 20, 2016 1 / 16 Table of Contents 1 Dependency Labeled/Unlabeled Dependency Projective/Non-projective
More informationRefresher on Dependency Syntax and the Nivre Algorithm
Refresher on Dependency yntax and Nivre Algorithm Richard Johansson 1 Introduction This document gives more details about some important topics that re discussed very quickly during lecture: dependency
More informationGrammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian
Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian Kiril Simov and Petya Osenova Linguistic Modelling Department, IICT, Bulgarian Academy of Sciences DELPH-IN, Sofia, 2012
More informationOntology-guided Extraction of Complex Nested Relationships
2010 22nd International Conference on Tools with Artificial Intelligence Ontology-guided Extraction of Complex Nested Relationships Sushain Pandit, Vasant Honavar Department of Computer Science Iowa State
More informationMorpho-syntactic Analysis with the Stanford CoreNLP
Morpho-syntactic Analysis with the Stanford CoreNLP Danilo Croce croce@info.uniroma2.it WmIR 2015/2016 Objectives of this tutorial Use of a Natural Language Toolkit CoreNLP toolkit Morpho-syntactic analysis
More informationINF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9
1 INF5830 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 4, 10.9 2 Working with texts From bits to meaningful units Today: 3 Reading in texts Character encodings and Unicode Word tokenization
More informationMeaning Banking and Beyond
Meaning Banking and Beyond Valerio Basile Wimmics, Inria November 18, 2015 Semantics is a well-kept secret in texts, accessible only to humans. Anonymous I BEG TO DIFFER Surface Meaning Step by step analysis
More informationOrtolang Tools : MarsaTag
Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements
More informationGetting Started With Syntax October 15, 2015
Getting Started With Syntax October 15, 2015 Introduction The Accordance Syntax feature allows both viewing and searching of certain original language texts that have both morphological tagging along with
More informationExtracting Domain Models from Natural-Language Requirements: Approach and Industrial Evaluation
Extracting Domain Models from Natural-Language Requirements: Approach and Industrial Evaluation Chetan Arora, Mehrdad Sabetzadeh, Lionel Briand SnT Centre for Security, Reliability and Trust University
More informationAutomatic Discovery of Related Concepts
Automatic Discovery of Related Concepts Bahareh Sarrafzadeh and Olga Vechtomova University of Waterloo TR#: CS-2014-13 1 Overview This report describes an Information Discovery tool, which is designed
More informationStack- propaga+on: Improved Representa+on Learning for Syntax
Stack- propaga+on: Improved Representa+on Learning for Syntax Yuan Zhang, David Weiss MIT, Google 1 Transi+on- based Neural Network Parser p(action configuration) So1max Hidden Embedding words labels POS
More informationCS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019
CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019 1 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Lisa Wang, Juhi Naik,
More informationDependency Parsing. Ganesh Bhosale Neelamadhav G Nilesh Bhosale Pranav Jawale under the guidance of
Dependency Parsing Ganesh Bhosale - 09305034 Neelamadhav G. - 09305045 Nilesh Bhosale - 09305070 Pranav Jawale - 09307606 under the guidance of Prof. Pushpak Bhattacharyya Department of Computer Science
More informationNLP in practice, an example: Semantic Role Labeling
NLP in practice, an example: Semantic Role Labeling Anders Björkelund Lund University, Dept. of Computer Science anders.bjorkelund@cs.lth.se October 15, 2010 Anders Björkelund NLP in practice, an example:
More informationIntroduction to Lexical Functional Grammar. Wellformedness conditions on f- structures. Constraints on f-structures
Introduction to Lexical Functional Grammar Session 8 f(unctional)-structure & c-structure/f-structure Mapping II & Wrap-up Summary of last week s lecture LFG-specific grammar rules (i.e. PS-rules annotated
More informationEDAN20 Language Technology Chapter 13: Dependency Parsing
EDAN20 Language Technology http://cs.lth.se/edan20/ Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ September 19, 2016 Pierre Nugues EDAN20 Language Technology http://cs.lth.se/edan20/
More informationOntology-guided extraction of structured information from unstructured text: Identifying and capturing complex relationships.
Ontology-guided extraction of structured information from unstructured text: Identifying and capturing complex relationships by Sushain Pandit A thesis submitted to the graduate faculty in partial fulfillment
More informationDependency grammar and dependency parsing
Dependency grammar and dependency parsing Syntactic analysis (5LN455) 2014-12-10 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Mid-course evaluation Mostly positive
More informationThe CKY algorithm part 2: Probabilistic parsing
The CKY algorithm part 2: Probabilistic parsing Syntactic analysis/parsing 2017-11-14 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Recap: The CKY algorithm The
More informationStanford s System for Parsing the English Web
Stanford s System for Parsing the English Web David McClosky s, Wanxiang Che h, Marta Recasens s, Mengqiu Wang s, Richard Socher s, and Christopher D. Manning s s Natural Language Processing Group, Stanford
More informationTransition-Based Dependency Parsing with Stack Long Short-Term Memory
Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented
More informationA Lattice Based Algebraic Model for Verb Centered Constructions
A Lattice Based Algebraic Model for Verb Centered Constructions Bálint Sass Research Institute for Linguistics, Hungarian Academy of Sciences sass.balint@nytud.mta.hu Abstract. In this paper we present
More informationEnglish Understanding: From Annotations to AMRs
English Understanding: From Annotations to AMRs Nathan Schneider August 28, 2012 :: ISI NLP Group :: Summer Internship Project Presentation 1 Current state of the art: syntax-based MT Hierarchical/syntactic
More informationCS395T Project 2: Shift-Reduce Parsing
CS395T Project 2: Shift-Reduce Parsing Due date: Tuesday, October 17 at 9:30am In this project you ll implement a shift-reduce parser. First you ll implement a greedy model, then you ll extend that model
More informationDependency grammar and dependency parsing
Dependency grammar and dependency parsing Syntactic analysis (5LN455) 2015-12-09 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Activities - dependency parsing
More informationExam III March 17, 2010
CIS 4930 NLP Print Your Name Exam III March 17, 2010 Total Score Your work is to be done individually. The exam is worth 106 points (six points of extra credit are available throughout the exam) and it
More informationLING/C SC/PSYC 438/538. Lecture 3 Sandiway Fong
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong Today s Topics Homework 4 out due next Tuesday by midnight Homework 3 should have been submitted yesterday Quick Homework 3 review Continue with Perl intro
More informationDependency grammar and dependency parsing
Dependency grammar and dependency parsing Syntactic analysis (5LN455) 2016-12-05 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Activities - dependency parsing
More informationUNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES
UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Saturday 10 th December 2016 09:30 to 11:30 INSTRUCTIONS
More informationHidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney
Hidden Markov Models Natural Language Processing: Jordan Boyd-Graber University of Colorado Boulder LECTURE 20 Adapted from material by Ray Mooney Natural Language Processing: Jordan Boyd-Graber Boulder
More informationLet s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed
Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,
More informationTransition-based Parsing with Neural Nets
CS11-747 Neural Networks for NLP Transition-based Parsing with Neural Nets Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between
More informationINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume
More informationWh-questions. Ling 567 May 9, 2017
Wh-questions Ling 567 May 9, 2017 Overview Target representation The problem Solution for English Solution for pseudo-english Lab 7 overview Negative auxiliaries interactive debugging Wh-questions: Target
More informationCollins and Eisner s algorithms
Collins and Eisner s algorithms Syntactic analysis (5LN455) 2015-12-14 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Recap: Dependency trees dobj subj det pmod
More information13.1 End Marks Using Periods Rule Use a period to end a declarative sentence a statement of fact or opinion.
13.1 End Marks Using Periods Rule 13.1.1 Use a period to end a declarative sentence a statement of fact or opinion. Rule 13.1.2 Use a period to end most imperative sentences sentences that give directions
More informationCHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS
82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the
More informationTopics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016
Topics in Parsing: Context and Markovization; Dependency Parsing COMP-599 Oct 17, 2016 Outline Review Incorporating context Markovization Learning the context Dependency parsing Eisner s algorithm 2 Review
More informationA Multilingual Social Media Linguistic Corpus
A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th
More informationBenedikt Perak, * Filip Rodik,
Building a corpus of the Croatian parliamentary debates using UDPipe open source NLP tools and Neo4j graph database for creation of social ontology model, text classification and extraction of semantic
More informationCS 224N Assignment 2 Writeup
CS 224N Assignment 2 Writeup Angela Gong agong@stanford.edu Dept. of Computer Science Allen Nie anie@stanford.edu Symbolic Systems Program 1 Introduction 1.1 PCFG A probabilistic context-free grammar (PCFG)
More informationHomework 2: Parsing and Machine Learning
Homework 2: Parsing and Machine Learning COMS W4705_001: Natural Language Processing Prof. Kathleen McKeown, Fall 2017 Due: Saturday, October 14th, 2017, 2:00 PM This assignment will consist of tasks in
More informationTIOM FtWR #t TEMU1):W VA 12 MAY 83 IUNCLASSIFIEDF/ 25 EMEMONI
TIOM FtWR #t TEMU1):W VA 12 MAY 83 N86@14-85-C-2444 IUNCLASSIFIEDF/ 25 EMEMONI N Ia L4 fl~v ~.11.8 11111_.25 IINL.4. 111.6 IQ Lt LTf)(IJIC FILE oj. S DTiC ELECTE REPRESENTATION FOR NARRATIVE TEXT D Deliverable
More informationFinal Project Discussion. Adam Meyers Montclair State University
Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...
More informationProjective Dependency Parsing with Perceptron
Projective Dependency Parsing with Perceptron Xavier Carreras, Mihai Surdeanu, and Lluís Màrquez Technical University of Catalonia {carreras,surdeanu,lluism}@lsi.upc.edu 8th June 2006 Outline Introduction
More informationDependency and (R)MRS
Dependency and (R)MRS Ann Copestake aac@cl.cam.ac.uk December 9, 2008 1 Introduction Note: for current purposes, this document lacks a proper introduction, in that it assumes readers know about MRS and
More informationTEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION
TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.
More informationFachbereich 3: Mathematik und Informatik. Bachelor Report. An NLP Assistant for Clide. Tobias Kortkamp. Matriculation No
Fachbereich 3: Mathematik und Informatik arxiv:1409.2073v1 [cs.cl] 7 Sep 2014 Bachelor Report An NLP Assistant for Clide Tobias Kortkamp Matriculation No. 2491982 Monday 26 th May, 2014 First reviewer:
More informationComputational Linguistics: Feature Agreement
Computational Linguistics: Feature Agreement Raffaella Bernardi Contents 1 Admin................................................... 4 2 Formal Grammars......................................... 5 2.1 Recall:
More informationXML Support for Annotated Language Resources
XML Support for Annotated Language Resources Nancy Ide Department of Computer Science Vassar College Poughkeepsie, New York USA ide@cs.vassar.edu Laurent Romary Equipe Langue et Dialogue LORIA/CNRS Vandoeuvre-lès-Nancy,
More informationUtilizing Semantic Equivalence Classes of Japanese Functional Expressions in Machine Translation
Utilizing Semantic Equivalence Classes of Japanese Functional Expressions in Machine Translation Akiko Sakamoto Takehito Utsuro University of Tsukuba Tsukuba, Ibaraki, 305-8573, JAPAN Suguru Matsuyoshi
More informationIdentifying Idioms of Source Code Identifier in Java Context
, pp.174-178 http://dx.doi.org/10.14257/astl.2013 Identifying Idioms of Source Code Identifier in Java Context Suntae Kim 1, Rhan Jung 1* 1 Department of Computer Engineering, Kangwon National University,
More informationInter-Annotator Agreement for a German Newspaper Corpus
Inter-Annotator Agreement for a German Newspaper Corpus Thorsten Brants Saarland University, Computational Linguistics D-66041 Saarbrücken, Germany thorsten@coli.uni-sb.de Abstract This paper presents
More informationTHE knowledge needed by software developers is captured
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 6, JUNE 2015 565 Extracting Development Tasks to Navigate Software Documentation Christoph Treude, Martin P. Robillard, and Barthelemy Dagenais Abstract
More informationA Brief Incomplete Introduction to NLTK
A Brief Incomplete Introduction to NLTK This introduction ignores and simplifies many aspects of the Natural Language TookKit, focusing on implementing and using simple context-free grammars and lexicons.
More informationGTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3
GTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3 INTRODUCTIO N Robert Dietz GTE Government Systems Corporatio n 100 Ferguson Drive Mountain View, CA 9403 9 dietz%gtewd.dnet@gte.com (415) 966-2825 This
More informationAUTOMATIC LFG GENERATION
AUTOMATIC LFG GENERATION MS Thesis for the Degree of Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science (Computer Science) at the National University of Computer and
More informationText Mining for Software Engineering
Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software
More informationQuestion Answering Using XML-Tagged Documents
Question Answering Using XML-Tagged Documents Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/trec11/index.html XML QA System P Full text processing of TREC top 20 documents Sentence
More informationA tool for Cross-Language Pair Annotations: CLPA
A tool for Cross-Language Pair Annotations: CLPA August 28, 2006 This document describes our tool called Cross-Language Pair Annotator (CLPA) that is capable to automatically annotate cognates and false
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More informationSyntax and Grammars 1 / 21
Syntax and Grammars 1 / 21 Outline What is a language? Abstract syntax and grammars Abstract syntax vs. concrete syntax Encoding grammars as Haskell data types What is a language? 2 / 21 What is a language?
More informationSEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches
SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY Parser Evaluation Approaches NATURE OF PARSER EVALUATION Return accurate syntactic structure of sentence. Which representation? Robustness of parsing. Quick
More informationLecture 14: Annotation
Lecture 14: Annotation Nathan Schneider (with material from Henry Thompson, Alex Lascarides) ENLP 23 October 2016 1/14 Annotation Why gold 6= perfect Quality Control 2/14 Factors in Annotation Suppose
More informationAbstract Syntax and Universal Dependencies
Abstract Syntax and Universal Dependencies Aarne Ranta Joint work with Prasanth Kolachina University of Malta, 4 April 2017 Structural representations - defining the level of abstraction In computational
More informationarxiv: v1 [cs.db] 3 May 2018
Scalable Semantic Querying of Text Xiaolan Wang Aaron Feng Behzad Golshan Alon Halevy George Mihaila Hidekazu Oiwa Wang-Chiew Tan University of Massachusetts Megagon Labs xlwang@umass.cs.edu {aaron,behzad,alon,george,oiwa,wangchiew}@megagon.ai
More informationNLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014
NLP Chain Giuseppe Castellucci castellucci@ing.uniroma2.it Web Mining & Retrieval a.a. 2013/2014 Outline NLP chains RevNLT Exercise NLP chain Automatic analysis of texts At different levels Token Morphological
More informationThe Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation
The Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/dppdemo/index.html Dictionary Parsing Project Purpose: to
More informationLing/CSE 472: Introduction to Computational Linguistics. 5/9/17 Feature structures and unification
Ling/CSE 472: Introduction to Computational Linguistics 5/9/17 Feature structures and unification Overview Problems with CFG Feature structures Unification Agreement Subcategorization Long-distance Dependencies
More informationAlphabetical Index referenced by section numbers for PUNCTUATION FOR FICTION WRITERS by Rick Taubold, PhD and Scott Gamboe
Alphabetical Index referenced by section numbers for PUNCTUATION FOR FICTION WRITERS by Rick Taubold, PhD and Scott Gamboe?! 4.7 Abbreviations 4.1.2, 4.1.3 Abbreviations, plurals of 7.8.1 Accented letters
More informationMaximum Entropy based Natural Language Interface for Relational Database
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 7, Number 1 (2014), pp. 69-77 International Research Publication House http://www.irphouse.com Maximum Entropy based
More informationI Know Your Name: Named Entity Recognition and Structural Parsing
I Know Your Name: Named Entity Recognition and Structural Parsing David Philipson and Nikil Viswanathan {pdavid2, nikil}@stanford.edu CS224N Fall 2011 Introduction In this project, we explore a Maximum
More informationThe CKY algorithm part 1: Recognition
The CKY algorithm part 1: Recognition Syntactic analysis (5LN455) 2014-11-17 Sara Stymne Department of Linguistics and Philology Mostly based on slides from Marco Kuhlmann Recap: Parsing Parsing The automatic
More informationTHE knowledge needed by software developers
SUBMITTED TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1 Extracting Development Tasks to Navigate Software Documentation Christoph Treude, Martin P. Robillard and Barthélémy Dagenais Abstract Knowledge
More informationScalable Semantic Querying of Text
Scalable Semantic Querying of Text Xiaolan Wang Aaron Feng Behzad Golshan Alon Halevy George Mihaila Hidekazu Oiwa Wang-Chiew Tan University of Massachusetts Megagon Labs xlwang@umass.cs.edu {aaron,behzad,alon,george,oiwa,wangchiew}@megagon.ai
More informationA Flexible Distributed Architecture for Natural Language Analyzers
A Flexible Distributed Architecture for Natural Language Analyzers Xavier Carreras & Lluís Padró TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya
More informationVision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0
Vision Plan For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0 Submitted in partial fulfillment of the Masters of Software Engineering Degree. Naga Sowjanya Karumuri CIS 895 MSE Project
More informationWBJS Grammar Glossary Punctuation Section
WBJS Grammar Glossary Punctuation Section Punctuation refers to the marks used in writing that help readers understand what they are reading. Sometimes words alone are not enough to convey a writer s message
More informationContent Based Key-Word Recommender
Content Based Key-Word Recommender Mona Amarnani Student, Computer Science and Engg. Shri Ramdeobaba College of Engineering and Management (SRCOEM), Nagpur, India Dr. C. S. Warnekar Former Principal,Cummins
More informationMaca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology
Maca a configurable tool to integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Outline Morphological resources for Polish Tagset and segmentation differences
More informationAdvanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation. ATIR April 28, 2016
Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR April 28, 2016 Organizational
More informationSemantic and Multimodal Annotation. CLARA University of Copenhagen August 2011 Susan Windisch Brown
Semantic and Multimodal Annotation CLARA University of Copenhagen 15-26 August 2011 Susan Windisch Brown 2 Program: Monday Big picture Coffee break Lexical ambiguity and word sense annotation Lunch break
More informationDesign First ITS Instructor Tool
Design First ITS Instructor Tool The Instructor Tool allows instructors to enter problems into Design First ITS through a process that creates a solution for a textual problem description and allows for
More informationDocumentation and analysis of an. endangered language: aspects of. the grammar of Griko
Documentation and analysis of an endangered language: aspects of the grammar of Griko Database and Website manual Antonis Anastasopoulos Marika Lekakou NTUA UOI December 12, 2013 Contents Introduction...............................
More informationTransforming Requirements into MDA from User Stories to CIM
, pp.15-22 http://dx.doi.org/10.14257/ijseia.2017.11.8.03 Transing Requirements into MDA from User Stories to CIM Meryem Elallaoui 1, Khalid Nafil 2 and Raja Touahni 1 1 Faculty of Sciences, Ibn Tofail
More informationA NOVAL HINDI LANGUAGE INTERFACE FOR DATABASES
Mahesh Singh et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.4, April- 2014, pg. 1179-1189 Available Online at www.ijcsmc.com International Journal of Computer Science
More informationA Natural Language Interface for Querying General and Individual Knowledge (Full Version) Yael Amsterdamer, Anna Kukliansky, and Tova Milo
A Natural Language Interface for Querying General and Individual Knowledge (Full Version) Yael Amsterdamer, Anna Kukliansky, and Tova Milo Tel Aviv University {yaelamst,annaitin,milo}@post.tau.ac.il Abstract
More informationTransition-based dependency parsing
Transition-based dependency parsing Syntactic analysis (5LN455) 2014-12-18 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Overview Arc-factored dependency parsing
More informationSyntactic N-grams as Machine Learning. Features for Natural Language Processing. Marvin Gülzow. Basics. Approach. Results.
s Table of Contents s 1 s 2 3 4 5 6 TL;DR s Introduce n-grams Use them for authorship attribution Compare machine learning approaches s J48 (decision tree) SVM + S work well Section 1 s s Definition n-gram:
More information15 212: Principles of Programming. Some Notes on Grammars and Parsing
15 212: Principles of Programming Some Notes on Grammars and Parsing Michael Erdmann Spring 2011 1 Introduction These notes are intended as a rough and ready guide to grammars and parsing. The theoretical
More informationMIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion
MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad
More informationSpecifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ
Specifying Syntax Language Specification Components of a Grammar 1. Terminal symbols or terminals, Σ Syntax Form of phrases Physical arrangement of symbols 2. Nonterminal symbols or syntactic categories,
More informationBy Marco V. Fabbri. Some searches in the GNT-T.Syntax require the use of the Greek Construct Window.
Greek Syntax Search Examples from the Accordance Forums By Marco V. Fabbri The following material was compiled from the Accordance Forums. Formatting has been tweaked slightly and a few typos corrected.
More informationA Collaborative Annotation between Human Annotators and a Statistical Parser
A Collaborative Annotation between Human Annotators and a Statistical Parser Shun ya Iwasawa Hiroki Hanaoka Takuya Matsuzaki University of Tokyo Tokyo, Japan {iwasawa,hkhana,matuzaki}@is.s.u-tokyo.ac.jp
More informationEvaluation of Text Analysis Core Technologies. Two successfull examples : Evaluating POS Taggers and Parsers for French
Evaluation of Text Analysis Core Technologies Two successfull examples : Evaluating POS Taggers and Parsers for French Patrick Paroubek Laboratoire pour la Mécanique et les Sciences de l Ingénieur Centre
More informationSETS: Scalable and Efficient Tree Search in Dependency Graphs
SETS: Scalable and Efficient Tree Search in Dependency Graphs Juhani Luotolahti 1, Jenna Kanerva 1,2, Sampo Pyysalo 1 and Filip Ginter 1 1 Department of Information Technology 2 University of Turku Graduate
More informationThe CKY algorithm part 1: Recognition
The CKY algorithm part 1: Recognition Syntactic analysis (5LN455) 2016-11-10 Sara Stymne Department of Linguistics and Philology Mostly based on slides from Marco Kuhlmann Phrase structure trees S root
More informationIntroduction to Scheme
How do you describe them Introduction to Scheme Gul Agha CS 421 Fall 2006 A language is described by specifying its syntax and semantics Syntax: The rules for writing programs. We will use Context Free
More informationA Framework for the Automatic Extraction of Rules from Online Text
A Framework for the Automatic Extraction of Rules from Online Text Saeed Hassanpour, Martin J. O Connor, Amar Das Stanford Center for Biomedical Informatics Research Stanford, CA, U.S.A. RuleML, Barcelona,
More information