Multiword deconstruction in AnCora dependencies and final release data

Size: px

Start display at page:

Download "Multiword deconstruction in AnCora dependencies and final release data"

Darleen Woods
5 years ago
Views:

Multiword deconstruction in AnCora dependencies and final release data TECHNICAL REPORT GLICOM 2014-1 Benjamin Kolz, Toni Badia, Roser Saurí Universitat Pompeu Fabra {benjamin.kolz, toni.badia, roser.

1 Multiword deconstruction in AnCora dependencies and final release data TECHNICAL REPORT GLICOM Benjamin Kolz, Toni Badia, Roser Saurí Universitat Pompeu Fabra {benjamin.kolz, toni.badia, June 2014 I. Introduction AnCora Surface Syntax Dependency Corpus (AnCora SSD) represents a new resource available to the research community which offers a surface syntax-oriented annotation of AnCora dependencies (Taulé, Martí and Recasens, 2008). The annotation was done by an automatic conversion from AnCora constituents into the new dependency format. The annotation process was widely covered by the article From constituents to syntax-oriented dependencies (Kolz, Badia and Saurí, 2014). The article describes the linguistic decisions taken for this annotation and presents the new syntactic function tagset which was applied. Nevertheless it did not yet discuss the deconstruction of AnCora multiwords. The main goal of this technical report is therefore the description of the deconstruction process and the presentation of some new data concerning the AnCora Surface Syntax Dependencies in its final version. II. Multiword Theory and AnCora Treatment II.I. What is a multiword? A multiword can be described as "idiosyncratic interpretations that cross word boundaries (or spaces)". (Sag et al., 2002: 2). AnCora groups them together in one single token (by means of underscore characters) and its usage can be found in almost all types of part-of-speech (table 1). Universitat_de_Barcelona (proper noun) página_web (common noun) querer_decir (verb) on_line (adjective) de_nuevo (adverb) a_pesar_del (preposition) cien_mil (number) Table 1. Multiword part-of-speech examples. Furthermore it is noticeable that time expressions (e. g. 2_de_marzo_de_1995) fall within the applied multiword concept and that one can find even complex structures within multiwords such as coordinations (e. g. Industria_y_Comercio) or nouns modified by adjectives (e. g. el_bucle_melancólico). 1

2 II.II. Internal structure of multiwords A multiword contains at least two tokens with theoretically no limitation on the side of maximum tokens. AnCora joins tokens to a multiword by making use of the underscore character (e. g. Universitat_de_Barcelona ). As a multiword only occupies one token in the AnCora annotation, it has also only one syntactic function assigned (see table 2). 17 presentada presentado 15 S 18 en en 17 cc 19 la el 20 spec 20 Universitat_de_Barcelona Universitat_de_Barcelona n 18 sn Table 2. AnCora multiword example. 5th column: syntactic function. The internal head of the multiword is dependent of a node outside of the multiword. In the example of table 2 Universitat would be the internal head of the multiword structure and it would be dependent of en. The head-dependent relations among tokens within the multiword is not expressed. This information will have to be calculated in a deconstruction process as in principle any token within the multiword is a candidate for being the head of a head-dependent pair. This is the case as multiwords can contain complex structures such as coordinations or specifiers. These structures inside the multiword have to be analyzed according to the criteria set up for the annotation and each token needs to get attached to a head within the multiword range. The part-of-speech of the tokens can give valuable information for the identification of their head and the setting of the syntactic function. crearon upper head se a_partir_del internal head a acuerdo lower dependent Table 3. Example for upper head, internal head and lower dependent. II.III. External relations A multiword has on the one hand a head (upper head) of which it is a dependent and on the other hand it can be head of further nodes (lower dependency relations). Table 3 exemplifies this constellation. Identifying the head of the internal multiword head in a deconstruction process is straightforward as it keeps this relation from the multiword (attached to the upper head). The correct setting of lower dependents of the former multiword is more complicated as there is not a simple default solution. Table 4 shows different examples of multiwords and their dependents. 2

3 a) usan el 95_por_ciento de los ordenadores b) el Tribunal_Supremo_de_Justicia (TSJ) decidió suspender c) creadas a_partir_del acuerdo suscrito por Argentina y la Unión_Europea d) table a_pesar_de_que 6: lower dependents el precio examples sigue al alza Table 4. Examples of multiwords and their variety of dependents. Again the part-of-speech gives us important information of how to set up the relation to lower dependents. Normally a lower dependent will connect to the head of the multiword structure (see a and b in table 4). But a preposition or conjunction in the last position of the multiword will lead to a different treatment as they work as head for lower dependents (see c and d in table 4). III. Multiword Deconstruction III.I. Why the deconstruction? The deconstruction of AnCora multiwords was necessary as the corpus showed certain inconsistencies in their treatment. A few examples can be found in table 5. Token use in AnCora aún así mayo pasado de hecho Multiword use in AnCora aún_así mayo_pasado de_hecho mientras que mientras_que Table 5. Example of inconsistencies in AnCora multiword treatment. A parser would have problems with this data as it would have to be aware of the possibility to find a multiword written together as one token but also as a sequence of tokens. The same happens with searches over the corpus, if one is interested in gathering all temporal expressions, for example, it has to be considered that they can be found within multiword tokens but also outside of them. Furthermore the concept of writing a group of words together as if it were one word is not expressed in natural language and introduces a source of artificiality over the word forms of the corpus. Considering these points, we decided to deconstruct the multiwords into individual tokens. III.II. Multiword statistics AnCora contains a total of 9,113 types of multiwords which make up 18,953 instances in the whole corpus. The multiwords have a wide range of lengths, the majority are two-token multiwords but the longest entry is an 18-token multiword. The following table shows the distribution of multiword instances according to their token length. The upper row indicates the length of the multiword in tokens and the lower one shows the number of instances found with the corresponding length. 3

4 Table 6. Multiword lengths statistics. III.III. Algorithm An overview of the multiword deconstruction algorithm is presented in table 7 and further explanations may refer to the indicated line numbers. The program starts the deconstruction process by reading first all AnCora multiwords and storing their types (line 2). All tokens which can be found within multiwords get then labeled with their part-of-speech (line 3) and afterwards a part-of-speech sequence table is created which gathers possible deconstruction settings for multiwords based on their part-of-speech combination (line 4). These possible solutions come from multiwords which were also treated as separated tokens in AnCora and also by the creation of further token combinations which correspond to needed part-of-speech sequences and which are connected among each other in head-dependent relations. 1 Function Multiword_Deconstruction(dependencies): 2 multiwords=get_multiwords(dependencies) 3 add_pos_to_multiwords(multiwords) 4 pos_sequence_table=create_pos_sequence_table(dependencies) 5 classifier=classify_multiwords(multiwords,pos_sequence_table) 6 for sentence in dependencies: 7 deconstruct_multiwords(sentence,classifier) Table 7. Multiword deconstruction algorithm. Afterwards a classifier takes all multiwords and sets all head-dependent pairs within the multiword and their respective syntactic function according to the solution proposed by the part-of-speech sequence table.. Boca_Júniors :[0, 1],['dobj', 'appos'] El_Noticiero_Universal :[2, 0, 2],['det', 'coord', 'amod'] 17_de_octubre :[0, 1, 2],['dobj', 'prep', 'pobj'] Table 8. Multiword classifier output. As one can see in table 8, each token gets a head within the multiword (besides the one with the value 0 which is in this way identified as internal head of the multiword structure) and a syntactic function label according to its dependent-head relation. While the syntactic function labels of the internal multiword tokens generally stay the same in all kind of contexts, it is worth to comment that the label assigned to the relation of the internal head of the multiword to its upper head varies according to the context as it depends on the syntactic configuration of each case. So the first entry in the label list of Boca_Júniors could as well be a nsubj if it was used as a subject in a certain sentence. This label is therefore set in the deconstruction process. In case that the classifier could not find a solution for the part-of-speech combination of the multiword, a default rule was set up which connects all tokens in the multiword from right to left according to their position, only taking into account the treatment of determiners and adjectives modifying nouns. This means that in the deconstruction of a multiword like Jurado_Nacional_de_Elecciones the preposition de would still be attached to Jurado and not to Nacional, even if the classifier uses the default rule. 4

5 Finally each sentence of the corpus is passed through the program which by help of the classifier deconstructs all multiwords setting the head of each multiword token and its respective syntactic function. The deconstruction process handles by rules the treatment of lower dependents of multiwords. The AnCora Surface Syntax Dependencies are then annotated. 1 El 3 det da0ms0 el 2 ex 3 amod aq0cn0 ex 3 ministro 12 nsubj ncms000 ministro 4 español 3 amod aq0ms0 español 5 de 3 prepn sps00 de 6 Industria 7 coord np00000 Industria 7 y 5 pobj cc y 8 Energía 7 coord np00000 Energía Table 9. Example of a deconstructed multiword with a coordination inside. III.IV. Evaluation of Multiword Deconstruction AnCora contains 18,953 multiword instances which correspond to 9,113 multiword types. As multiwords were classified in this approach by type the evaluation was based on an amount of 500 multiword types which makes up around a 5.5 % of the total amount. As the classifier of the program does not include solutions for all part-of-speech combinations found in multiwords the evaluation has to consider this by taking into account a corresponding amount of multiwords which were deconstructed by the default rule. 459 of the 9,113 multiword types were classified in this way, so we decided to include a 5 % (25 of 500) of those default solutions also in the evaluation in order to get meaningful results. The evaluation consisted then in a manual revision of 500 selected multiword types by checking all their individual heads and their syntactic function label. Those 500 multiwords contained a total of 1,374 tokens. The results obtained are highly satisfactory as label accuracy (LA) reached 0.92, the unlabeled attachment score (UAS) 0.96 and the labeled attachment score (LAS) a value of The fact that both LA and LAS show the same result can be explained as the setting of the syntactic function label is highly dependent on a previous correct identification of its head and the overall high accuracy results. LA 0.92 UAS 0.96 LAS 0.92 Accuracy Table 10. Multiword deconstruction evaluation. IV. Final release version of AnCora Surface-Syntax Dependencies The final version of AnCora Surface-Syntax Dependencies contains 547,724 tokens. The change in token count compared to the former AnCora dependency annotation (517,269 tokens) results of the deletion of elliptic subjects and the deconstruction of multiwords. 5

6 Our tagset is presented in Table 11. It contains 43 function tags (including underspecified ones), which makes it fully adequate for automatic annotation. In the table, indentation shows the tagset hierarchical structure, conveying that general tags like obj or mod include more specific subclasses. In the annotation, the goal is obviously to be as specific as possible, as this leads to more informative data. Therefore the generic tags like dep, comp, obj, mod and prep are not expected to be of common use but only for cases where a more specific tag cannot be applied. Tag root dep arg comp attr cpred obj cobj dobj iobj oobj pobj vobj crobj subj nsubj csubj coord conj agent reflec te mod abbrev amod appos advcl det infmod partmod advmod neg rcmod nn tmod num prep prepv prepn prepa Full name root dependent argument complement attributive predicative complement object complementizer object direct object indirect object oblique object object of a preposition object of verb object of comparative subject nominal subject clausal subject coordination conjunct agent reflexive ( se ) textual element modifier abbreviation modifier adjectival modifier appositional modifier adverbial clause modifier determiner infinitival modifier participial modifier adverbial modifier negation modifier relative clause modifier noun compound modifier temporal modifier numeric modifier prepositional modifier prep. mod. of a verb prep. mod. of a noun prep. mod. of adjective 6

7 poss punct voc possession modifier punctuation vocative Table 11. Tagset in hierarchical view. The following table shows the distribution of the tagset in AnCora Surface-Syntax Dependencies. The label dep was set if the system could not identify a more detailed label and this was the case in only around a 1.5 % of the corpus. It was also checked that each sentence had a root node and not more than one root as this a requirement for a correctly parsed sentence. Tag Name Frequency pobj 87,409 det 71,332 punct 65,280 prepn 42,021 dobj 32,533 coord 30,305 amod 29,713 nsubj 29,316 prepv 22,405 root 17,364 cobj 16,848 advmod 16,049 appos 15,631 rcmod 7,820 dep 7,421 attr 6,814 poss 5,501 reflec 5,313 oobj 5,133 vobj 4,363 advcl 3,944 neg 3,923 iobj 3,535 prep 3,463 prepa 3,453 num 3,352 cpred 2,236 conj 1,497 agent 1,411 tmod 814 te 783 csubj 395 crobj 329 voc 13 mod 3 partmod 2 Table 12. Tagset sorted by frequency. 7

8 The AnCora Surface-Syntax Dependencies are now available at Bibliography: Kolz, B., Badia, T. and Saurí, R From constituents to syntax-oriented dependencies. Procesamiento del Lenguaje Natural, [S.l.], v. 52, p , mar ISSN Available at: < Last access: 8th June Sag, I., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D Multiword Expressions: A Pain in the Neck for NLP, in: Lecture Notes in Computer Science, Vol. 2276, pp Taulé, M., Martí, M. A., and Recasens, M AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In ELRA (Ed.), LREC, Marrakech, Morocco, p

Dependency Parsing. Allan Jie. February 20, Slides: Allan Jie Dependency Parsing February 20, / 16

Dependency Parsing. Allan Jie. February 20, Slides: Allan Jie Dependency Parsing February 20, / 16 Dependency Parsing Allan Jie February 20, 2016 Slides: http://www.statnlp.org/dp.html Allan Jie Dependency Parsing February 20, 2016 1 / 16 Table of Contents 1 Dependency Labeled/Unlabeled Dependency Projective/Non-projective