Tools for Annotating and Searching Corpora Practical Session 1: Annotating

Size: px

Start display at page:

Download "Tools for Annotating and Searching Corpora Practical Session 1: Annotating"

Betty Briggs
5 years ago
Views:

1 Tools for Annotating and Searching Corpora Practical Session 1: Annotating Stefanie Dipper Institute of Linguistics Ruhr-University Bochum Corpus Linguistics Fest (CLiF) June 6-10, 2016 Indiana University, Bloomington Stefanie Dipper Tools for annotating and searching 1 / 28

2 Today s session 1 We have a closer look at a particular POS tagset: Penn Treebank Tagset 2 We annotate some text (manually), using the tool 3 We evaluate our agreement, using different measures Stefanie Dipper Tools for annotating and searching 2 / 28

3 Outline Stefanie Dipper Tools for annotating and searching 3 / 28

4 Penn Treebank Penn Treebank Project: one of the earliest and most influential annotation projects 4.5 million words of American English Includes the Brown Corpus and the Wall Street Journal Corpus Manually annotated with POS tags and syntactic structures we only look at the POS tags today html Stefanie Dipper Tools for annotating and searching 4 / 28

5 Tag labels Labels that mark related parts of speech start with the same letter NN, NNS, NNP, NNPS: subtypes of nouns The final parts of some labels mirror inflectional endings JJ, JJR, JJS: positive, comparative, superlative adjectives VB, VBD, VBG, VBN, VBP, VBZ: base, past tense, gerund, past participle, non-3rd sg present tense, 3rd sg present tense Stefanie Dipper Tools for annotating and searching 5 / 28

6 Documentation Alphabetical list of labels with examples: comp.leeds.ac.uk/ccalas/tagsets/upenn.html the punctuation labels have been added later these are not available in Full guidelines: cgi/viewcontent.cgi? article=1603&context=cis_reports Stefanie Dipper Tools for annotating and searching 6 / 28

7 Outline Stefanie Dipper Tools for annotating and searching 7 / 28

8 A web-based annotation tool Supports different kinds of annotations spans, links/pointers we only use simple spans today: spans = tokens Supports crowd annotations annotations by multiple users followed by curation Provides agreement measures (or else: tu-darmstadt.de/webanno-testing) Stefanie Dipper Tools for annotating and searching 8 / 28

9 : Exercise Go to this page: uni-tuebingen.de/ Annotate the text in the project Penn_Anno according to the Penn Tagset Stefanie Dipper Tools for annotating and searching 9 / 28

10 Outline Stefanie Dipper Tools for annotating and searching 10 / 28

11 Manual annotations For an overview: Artstein and Poesio (2008) Manual annotations are provided either by experts (e.g. linguists) or by crowd sourcing If we want to use the annotations, they should be correct Question: how do we know how good our annotations are? Stefanie Dipper Tools for annotating and searching 11 / 28

12 Manual annotations For an overview: Artstein and Poesio (2008) Manual annotations are provided either by experts (e.g. linguists) or by crowd sourcing If we want to use the annotations, they should be correct Question: how do we know how good our annotations are? Take a lot of annotators and let them independently annotate the same text Stefanie Dipper Tools for annotating and searching 11 / 28

13 Manual annotations For an overview: Artstein and Poesio (2008) Manual annotations are provided either by experts (e.g. linguists) or by crowd sourcing If we want to use the annotations, they should be correct Question: how do we know how good our annotations are? Take a lot of annotators and let them independently annotate the same text Compare their annotations: if they annotate the same tags for most of the time, the tag meanings and guidelines are well defined, the annotators are well trained, and the resulting annotations are of good quality Stefanie Dipper Tools for annotating and searching 11 / 28

14 Manual annotations For an overview: Artstein and Poesio (2008) Manual annotations are provided either by experts (e.g. linguists) or by crowd sourcing If we want to use the annotations, they should be correct Question: how do we know how good our annotations are? Take a lot of annotators and let them independently annotate the same text Compare their annotations: if they annotate the same tags for most of the time, the tag meanings and guidelines are well defined, the annotators are well trained, and the resulting annotations are of good quality if there is rather low agreement, this could also mean that the annotation task is very difficult Stefanie Dipper Tools for annotating and searching 11 / 28

15 Manual annotations For an overview: Artstein and Poesio (2008) Manual annotations are provided either by experts (e.g. linguists) or by crowd sourcing If we want to use the annotations, they should be correct Question: how do we know how good our annotations are? Take a lot of annotators and let them independently annotate the same text Compare their annotations: if they annotate the same tags for most of the time, the tag meanings and guidelines are well defined, the annotators are well trained, and the resulting annotations are of good quality if there is rather low agreement, this could also mean that the annotation task is very difficult How do we rate agreement? measures for inter-annotator agreement (IAA) aka inter-rater / inter-coder agreement Stefanie Dipper Tools for annotating and searching 11 / 28

16 Comparing two annotators Slides adapted from Poesio and Carpenter (2010), Carpenter and Poesio (2010) Compare the annotation results of two annotators (here: annotated tags A and B ) Item Annotator 1 Annotator 2 1 A B 2 B A 3 A A 4 A B 5 B B 6 B B Stefanie Dipper Tools for annotating and searching 12 / 28

17 Observed agreement Assumption: in total 100 annotation Results displayed in a contingency table A B Total A B Total Stefanie Dipper Tools for annotating and searching 13 / 28

18 Observed agreement Assumption: in total 100 annotation Results displayed in a contingency table A B Total A B Total Agreement? =.88 Stefanie Dipper Tools for annotating and searching 13 / 28

19 Observed agreement Assumption: in total 100 annotation Results displayed in a contingency table A B Total A B Total Agreement? = Observed agreement (percent agreement) Stefanie Dipper Tools for annotating and searching 13 / 28

20 Chance agreement Some agreement is to be expected simply by chance Stefanie Dipper Tools for annotating and searching 14 / 28

21 Chance agreement Some agreement is to be expected simply by chance e.g. two annotators who annotate A or B by chance they approx. agree half of the time Stefanie Dipper Tools for annotating and searching 14 / 28

22 Chance agreement Some agreement is to be expected simply by chance e.g. two annotators who annotate A or B by chance they approx. agree half of the time The amount of chance agreement depends on the annotation scheme and the annotated data Stefanie Dipper Tools for annotating and searching 14 / 28

23 Chance agreement Some agreement is to be expected simply by chance e.g. two annotators who annotate A or B by chance they approx. agree half of the time The amount of chance agreement depends on the annotation scheme and the annotated data Sensible agreement: the amount above chance Stefanie Dipper Tools for annotating and searching 14 / 28

24 Expected agreement Observed agreement (A o ): amount of actual agreement Expected agreement (A e ): expected value of A o Stefanie Dipper Tools for annotating and searching 15 / 28

25 Expected agreement Observed agreement (A o ): amount of actual agreement Expected agreement (A e ): expected value of A o Agreement above chance: A o A e Maximally possible agreement above chance: 1 A e Stefanie Dipper Tools for annotating and searching 15 / 28

26 Expected agreement Observed agreement (A o ): amount of actual agreement Expected agreement (A e ): expected value of A o Agreement above chance: A o A e Maximally possible agreement above chance: 1 A e Proportion of sensible agreement: A o A e 1 A e Stefanie Dipper Tools for annotating and searching 15 / 28

27 Expected agreement Observed agreement (A o ): amount of actual agreement Expected agreement (A e ): expected value of A o Agreement above chance: A o A e Maximally possible agreement above chance: 1 A e Proportion of sensible agreement: A o A e 1 A e Question: how do we compute chance agreement? (A e ) there are different ways... Stefanie Dipper Tools for annotating and searching 15 / 28

28 Measure S: considers #categories S: assumption: same chance for all annotators and categories Stefanie Dipper Tools for annotating and searching 16 / 28

29 Measure S: considers #categories S: assumption: same chance for all annotators and categories Number of category labels: q Stefanie Dipper Tools for annotating and searching 16 / 28

30 Measure S: considers #categories S: assumption: same chance for all annotators and categories Number of category labels: q Probability that an annotator picks a particular category q a : 1 q Stefanie Dipper Tools for annotating and searching 16 / 28

31 Measure S: considers #categories S: assumption: same chance for all annotators and categories Number of category labels: q Probability that an annotator picks a particular category q a : 1 q Probability that both annotators pick a particular category q a : ( 1 q )2 Stefanie Dipper Tools for annotating and searching 16 / 28

32 Measure S: considers #categories S: assumption: same chance for all annotators and categories Number of category labels: q Probability that an annotator picks a particular category q a : 1 q Probability that both annotators pick a particular category q a : ( 1 q )2 Probability that both annotators pick the same category: A S e = q ( 1 q )2 = 1 q Stefanie Dipper Tools for annotating and searching 16 / 28

33 Are the categories equally likely? A B Total A B Total Stefanie Dipper Tools for annotating and searching 17 / 28

34 Are the categories equally likely? A B Total A B Total A o =.88 A e = 1 2 =.5 S = =.76 Stefanie Dipper Tools for annotating and searching 17 / 28

35 Are the categories equally likely? A B Total A B Total A B C D Total A B C D Total A o =.88 A e = 1 2 =.5 S = =.76 Stefanie Dipper Tools for annotating and searching 17 / 28

36 Are the categories equally likely? A B Total A B Total A o =.88 A e = 1 2 =.5 S = =.76 A B C D Total A B C D Total A o =.88 A e = 1 4 =.25 S = =.84 Stefanie Dipper Tools for annotating and searching 17 / 28

37 π: different chance for different categories Scott s π: assumption: different chance for categories (Scott 1955) Stefanie Dipper Tools for annotating and searching 18 / 28

38 π: different chance for different categories Scott s π: assumption: different chance for categories (Scott 1955) Number of annotations: N Number of annotations with category q a : n qa Stefanie Dipper Tools for annotating and searching 18 / 28

39 π: different chance for different categories Scott s π: assumption: different chance for categories (Scott 1955) Number of annotations: N Number of annotations with category q a : n qa Probability that an annotator picks a particular category q a : n qa N Stefanie Dipper Tools for annotating and searching 18 / 28

40 π: different chance for different categories Scott s π: assumption: different chance for categories (Scott 1955) Number of annotations: N Number of annotations with category q a : n qa Probability that an annotator picks a particular category q a : n qa N Probability that both annotators pick a particular category q a : ( nq a N )2 Stefanie Dipper Tools for annotating and searching 18 / 28

41 π: different chance for different categories Scott s π: assumption: different chance for categories (Scott 1955) Number of annotations: N Number of annotations with category q a : n qa Probability that an annotator picks a particular category q a : n qa N Probability that both annotators pick a particular category q a : ( nq a N )2 Probability that both annotators pick the same category: A π e = q ( n q N )2 = 1 N 2 q n 2 q Stefanie Dipper Tools for annotating and searching 18 / 28

42 Comparison of S and π A B C Total A B C Total Stefanie Dipper Tools for annotating and searching 19 / 28

43 Comparison of S and π A B C Total A B C Total A o =.88 S =.88 1/3 1 1/3 =.82 π = =.76 Stefanie Dipper Tools for annotating and searching 19 / 28

44 Comparison of S and π A B C Total A B C Total A B C Total A B C Total A o =.88 S =.88 1/3 1 1/3 =.82 π = =.76 Stefanie Dipper Tools for annotating and searching 19 / 28

45 Comparison of S and π A B C Total A B C Total A o =.88 S =.88 1/3 1 1/3 =.82 π = =.76 A B C Total A B C Total A o =.88 S =.88 1/3 1 1/3 =.82 π = =.647 Stefanie Dipper Tools for annotating and searching 19 / 28

46 Prevalence Imagine: Two annotators disambiguate 1000 instances of love: emotion vs. zero (as in tennis) Each annotator found 995 instances of emotion and 5 of zero, but in different cases Stefanie Dipper Tools for annotating and searching 20 / 28

47 Prevalence Imagine: Two annotators disambiguate 1000 instances of love: emotion vs. zero (as in tennis) Each annotator found 995 instances of emotion and 5 of zero, but in different cases How useful are these annotations? emotion zero Total emotion zero Total Stefanie Dipper Tools for annotating and searching 20 / 28

48 Prevalence Imagine: Two annotators disambiguate 1000 instances of love: emotion vs. zero (as in tennis) Each annotator found 995 instances of emotion and 5 of zero, but in different cases How useful are these annotations? emotion zero Total emotion zero Total A o =.99 S = =.98 π = =.005 Stefanie Dipper Tools for annotating and searching 20 / 28

49 Kappa: considers individual bias Cohen s κ: assumption: different annotators have different interpreations of the guidelines (bias/prejudice) (Cohen 1960; Carletta 1996) Stefanie Dipper Tools for annotating and searching 21 / 28

50 Kappa: considers individual bias Cohen s κ: assumption: different annotators have different interpreations of the guidelines (bias/prejudice) (Cohen 1960; Carletta 1996) Total number of items/markables: i Stefanie Dipper Tools for annotating and searching 21 / 28

51 Kappa: considers individual bias Cohen s κ: assumption: different annotators have different interpreations of the guidelines (bias/prejudice) (Cohen 1960; Carletta 1996) Total number of items/markables: i Probability that an annotator c x picks a particular category q a : nc x qa i Stefanie Dipper Tools for annotating and searching 21 / 28

52 Kappa: considers individual bias Cohen s κ: assumption: different annotators have different interpreations of the guidelines (bias/prejudice) (Cohen 1960; Carletta 1996) Total number of items/markables: i Probability that an annotator c x picks a particular category q a : nc x qa i Probability that both annotators pick a particular category q a : nc 1 q a i nc 2 q a i Stefanie Dipper Tools for annotating and searching 21 / 28

53 Kappa: considers individual bias Cohen s κ: assumption: different annotators have different interpreations of the guidelines (bias/prejudice) (Cohen 1960; Carletta 1996) Total number of items/markables: i Probability that an annotator c x picks a particular category q a : nc x qa i Probability that both annotators pick a particular category nc 2 q a i q a : nc 1 q a i Probability that both annotators pick the same category: A κ e = q n c1 q i n c 2 q i = 1 i 2 n c1 qn c2 q q Stefanie Dipper Tools for annotating and searching 21 / 28

54 Comparison of S, π and κ A B Total A B Total Stefanie Dipper Tools for annotating and searching 22 / 28

55 Comparison of S, π and κ A B Total A B Total A o : =.88 S = =.76 π = =.759 κ = =.76 Stefanie Dipper Tools for annotating and searching 22 / 28

56 Comparison of S, π and κ A B Total A B Total A B Total A B Total A o : =.88 S = =.76 π = =.759 κ = =.76 Stefanie Dipper Tools for annotating and searching 22 / 28

57 Comparison of S, π and κ A B Total A B Total A o : =.88 S = =.76 π = =.759 κ = =.76 A B Total A B Total A o : =.3 S = =.4 π = =.414 κ = =.129 Stefanie Dipper Tools for annotating and searching 22 / 28

58 Comparison of π and κ It can be proven that for any sample: π κ Stefanie Dipper Tools for annotating and searching 23 / 28

59 Comparison of π and κ It can be proven that for any sample: π κ If annotators interpret guidelines differently bad mirrored by π Stefanie Dipper Tools for annotating and searching 23 / 28

60 Comparison of π and κ It can be proven that for any sample: π κ If annotators interpret guidelines differently bad mirrored by π With many annotators, the difference between π and κ is small Stefanie Dipper Tools for annotating and searching 23 / 28

61 Interpreting agreement scores κ = 0: no agreement above chance κ = 1 perfect agreement κ <.7: often considered as bad agreement (controversial) According to Landis and Koch (1977): κ < 0 poor agreement 0 < κ <.2 slight.21 < κ <.4 fair.41 < κ <.6 moderate.61 < κ <.8 substantial κ >.81 near perfect Stefanie Dipper Tools for annotating and searching 24 / 28

62 Multiple annotators (> 2) Either: average of pairwise agreement Or: take specific measures, such as Fleiss κ Stefanie Dipper Tools for annotating and searching 25 / 28

63 Online tools Stefanie Dipper Tools for annotating and searching 26 / 28

64 References I Artstein, R. and M. Poesio (2008). Inter-coder agreement for computational linguistics (survey a rticle). Computational Linguistics 34(4), Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statis tic. Computational Linguistics 22(2), Carpenter, B. and M. Poesio (2010). Models of data annotation. malta-2010-slides.pdf. Slides from the LREC 2010 tutorial (part II). Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1), Stefanie Dipper Tools for annotating and searching 27 / 28

65 References II Landis, J. R. and G. G. Koch (1977). The measurement of observer agreement for categorical data. Biometrics 33(1). Poesio, M. and B. Carpenter (2010). Statistical models of the annotation process. Part I: lrec-sli.pdf. Slides from the LREC 2010 tutorial (part I). Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly 19(3), Stefanie Dipper Tools for annotating and searching 28 / 28

Lecture 14: Annotation

Lecture 14: Annotation Nathan Schneider (with material from Henry Thompson, Alex Lascarides) ENLP 23 October 2016 1/14 Annotation Why gold 6= perfect Quality Control 2/14 Factors in Annotation Suppose