Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization

Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization Tingming Lu 1,2, Man Zhu 3, and Zhiqiang Gao 1,2( ) 1 Key Lab of Computer Network and Information Integration (Southeast University), Ministry of Education, China 2 School of Computer Science and Engineering, Southeast University, China 3 School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, China lutingming@163.com, mzhu@njupt.edu.cn, zqgao@seu.edu.cn Abstract. Annotated named entity corpora play a significant role in many natural language processing applications. However, annotation by humans is time-consuming and costly. In this paper, we propose a high recall pre-annotator which combines multiple existing named entity taggers based on ensemble learning, to reduce the number of annotations that humans have to add. In addition, annotations are categorized into normal annotations and candidate annotations based on their estimated confidence, to reduce the number of human corrective actions as well as the total annotation time. The experiment results show that our approach outperforms the baseline methods in reduction of annotation time without loss in annotation performance (in terms of F-measure). Keywords: Corpus Construction, Named Entity Recognition, Assisted Annotation, Ensemble Learning 1 Introduction Named Entity Recognition (NER), one of the fundamental tasks for building Natural Language Processing (NLP) systems, is a task that detects Named Entity (NE) mentions in a given text and classifies these mentions to a predefined list of types. Machine learning (ML) based approaches can achieve good performance in NER, but they often require large amounts of annotated samples, which are time-consuming and costly to build. One usual way to improve this situation is to automatically pre-annotate the corpora, so that human annotators need merely to correct errors rather than annotate from scratch. Resulted from more than two decades of research, many named entity taggers are publicly available now, so a question to ask is how to utilize these existing taggers to assist named entity annotation. It is well known that multiple taggers can be combined using ensemble learning techniques to create a system that outperforms the best individual taggers within the system [1, 2]. Therefore a natural solution is to create a pre-annotator combining multiple taggers based

2 on ensemble learning. However, as far as we know, no previous study leverage ensemble learning to combine multiple existing taggers to assist named entity annotation. On the other hand, when served as a pre-annotator, a system is expected to have high recall[3, 4], because in general, adding a new annotation takes more time than modifying an existing pre-annoteated one for an annotator. Most NE taggers are tuned to a trade-off between recall and precision, and not all taggers support setting parameters to increase the recall. A high recall pre-annotator may introduces some low confidence annotations, which are more likely to be spurious than those with high confidence. In extremes, too many spurious annotations may mislead annotators and therefore hurt precision of the result corpus. Our intuition is that low confidence annotations play a different role with those with high confidence. But in previous work, annotations are all treated in the same way regardless of their confidence. In order to address these issues, we propose an approach which combines multiple existing NE taggers based on ensemble learning to create a high recall pre-annotator. Annotations produced by this pre-annotator are categorized into normal annotations with high confidence and candidate annotations with low confidence. Take Fig. 1 as an example. Background color indicates the NE type (person, location, or organization) of the annotation. A general pre-annotator may produce annotations like Fig. 1(a). Then annotators need to delete the spurious annotation Washington/LOC, and add the missed annotation MaliVai Washington/PER. In Fig. 1(b), annotators do not need to add MaliVai Washington/PER due to high recall of the pre-annotator, although they still need to delete Washington/LOC, and in addition, they have to delete the new introduced spurious annotation Alami/LOC. Our approach is illustrated in Fig. 1(c), where normal annotations are rendered with black font and underline, while candidate annotations with gray font. Annotators do not need to delete the candidate annotation Alami/LOC, since candidate annotations will not be counted as valid annotations when annotators submit the results. All that annotators have to do is approving MaliVai Washington/PER by a simple click on it, and the annotation Washington/LOC will be deleted automaticly because it has a overlapping token Washington with the approved annotation Mali- Vai Washington/PER. As shown in the above example, candidate annotations improve the recall, so annotators need to add less annotations. Spurious ones among the candidate annotations do not need to be deleted by annotators, so the number of human corrective actions will not increase significantly. In summary, we make the following contributions in this paper. 1) We propose an approach which combines multiple existing named entity taggers based on ensemble learning to create a high recall pre-annotator. Our approach does not require annotators to annotate additional training data. 2) Annotations are categorized into normal annotations with high confidence and candidate annotations with low confidence, and treated in different ways to reduce the annotation time. 3) We empirically show that our approach outperforms several baseline ap-

3 Fig. 1. Illustration of annotations by various pre-annotators. proaches in terms of annotation time on a test dataset collected from three publicly available datasets. The related resources are freely available 4 for research purposes. The remaining part of this paper is organized as follows: In Section 2, we mention related work. Section 3 introduces definitions and system architecture, and Section 4 details our method. Section 5 describes the experimental setup, followed by analysis on the obtained results. Finally, we conclude and discuss future directions in Section 7. 2 Related Work The goal of pre-annotation is to reduce the time required to annotate a text by reducing the number of annotations an annotator must add or modify. Preannotation has been studied widely in NLP tasks such as NER [3, 5, 6], semantic category disambiguation [4], and part of speech tagging [7]. Many applications for different domains have been built in order to assist named entity annotation, using a single tagger [5, 6], or multiple taggers [3]. Lingren et al. [5] pre-annotate disease and symptom entities for clinical trial announcements using either an automatically extracted or a manually generated dictionary. They conclude that dictionary-based pre-annotation can reduce the cost of clinical NER without introducing bias in the annotation process. Ogren 4 http://58.192.114.226/assistedner

4 et al. [6] use a third party tagger to pre-annotate disease and disorder in clinical domain. However, they find little benefit to pre-annotating the corpus. In biomedical domain, to generate potential gene mentions for the semi-automated annotation, Ganchev et al. [3] run two taggers on the texts: a high recall tagger trained on the local corpus and a high recall tagger trained on a standalone corpus. At decode time, they take the gene mentions from the top two predictions of each of these taggers. Stenetorp et al. [4] study pre-annotation in a sub task of NER semantic category disambiguation which assigns the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example PROTEIN to Fibrin. They consider a task setting that allows for multiple semantic categories to be suggested, aiming to minimize the number of suggestions while maintaining high recall. Their system maintains an average recall of 99% while reducing the number of candidate semantic categories on average by 65% over all datasets. In the development of a Part of Speech (PoS) tagged corpus of Icelandic, Loftsson et al. [7] combine five individual PoS taggers to improve the tagging accuracy. Their preliminary evaluation results show that this tagger combination method is crucial with regard to the amount of hand-correction that must be carried out in future work. Our approach combining multiple taggers is different from [5, 6] which use a single tagger. And it differs from [3] which uses the union of the outputs by two taggers and requires additional training data. In addition, our study is unique in the sense that we categorize annotations into normal annotations and candidate annotations to reduce the number of corrective actions. Unlike NER, either semantic category disambiguation and part of speech tagging does not deal with mention detection, so methods in [4, 7] can not be applied directly to assisted named entity annotation. 3 Preliminaries 3.1 Definitions NER task can be splitted into the identification phase, where NE mentions are identified in text; and the classification phase, where the identified NE mentions are classified into the predefined types. Only three types are considered in this paper, namely person (PER), location (LOC), and organization (ORG). To construct a NE corpus, texts in the corpus are often pre-annotated with annotations so that human do not have to manually annotate the texts from scratch. Further, we categorize the annotations into normal annotations and candidate annotations based on their estimated confidence. In the following, formal definitions for annotation, normal annotation, and candidate annotation are presented. Definition 1. (Annotation). An annotation is a tuple A = B, T, where B is the boundary which consists of a start position and an end position indicating a sequence of words that makes up the mention of A, and T is the type of A, T T = {PER, ORG, LOC}. We say B = B if they have identical start positions and identical end positions, and B B otherwise.

5 Definition 2. (Normal Annotation). A normal annotation is an annotation with relative high confidence. If a normal annotation is spurious, annotators shall delete it. Definition 3. (Candidate Annotation). A candidate annotation is an annotation with relative low confidence. If a candidate annotation is correct, annotators shall approve it. If a candidate annotation is spurious, annotators do not need to delete it. After the texts in a corpus have been pre-annotated, texts and annotations are displayed to annotators in an user interface (UI). In our task setting, annotators shall add missed annotations, re-select type for annotations with incorrect types, delete spurious normal annotations, and approve correct candidate annotations. Notice that annotators do not need to delete spurious candidate annotations, and they also do not need to approve correct normal annotations. Since re-selecting, deleting and approving are all actions performed on pre-annotated annotations, they are collectively referred to as modifying actions. The definitions of adding action and modifying action are given below. Definition 4. (Adding Action). An adding action is an action of selecting a span of text and selecting a type for it to create an annotation. Definition 5. (Modifying Action). An modifying action is an action of reselecting type of an annotation, deleting a normal annotation, or approving a candidate annotation. 3.2 System Architecture The system architecture is presented in Fig. 2. Input text is annotated by several individual taggers firstly. Then the outputs of the taggers are fed to a combiner which produces normal annotations and candidate annotations based on ensemble learning techniques. Via a Web-based UI, annotators can add new annotations and modify pre-annotated annotations, while statistical informations including total time, number of adding actions, and number of modifying actions, are recorded automaticly by a background program. Finally, annotations and statistical informations are submitted and stored in a database. In the beginning of the annotation process, Majority Voting (MV) is used as the combination strategy. After some texts have been manually annotated, these data can be utilized to train a classifier, so the combination strategy is switched to stacking. During the whole annotation process, newly annotated data is added to the training data continually to retrain the classifier, and annotators do not need to annotate additional training instances. Annotators can add new annotations by mouse drags. When the mouse is over an annotation, a menu will pop up, and annotators can re-select NE type for the annotation. If annotators need to delete a normal annotation or approve a candidate annotation, a simple mouse click on the annotation will be enough. The UI is implemented in Java Server Page (JSP) and JavaScript, and all of the data is stored in a MySQL database. The UI runs in annotator s browser, and no additional software or plug-ins are required.

6 Fig. 2. Overview of our approach. 4 Method Suppose there are K taggers. Given a text, the kth tagger outputs M k annotations A k 1,..., A k M, A k k m = Bm, k Tm k. We note the set of M k boundaries produced by the kth tagger as B k. Then the set of distinct boundaries by K taggers is K B = B k (1) k=1 For each boundary B in B, we create a vector where L k = x = (L 1,..., L K ) (2) { Tm k if Bm, k Bm k = B, NONE otherwise. L L = T {NONE} = {PER, ORG, LOC, NONE}. Now, we can utilize x to estimate the confidence score S of B to be labeled with a label L, based on a voter or a classifier f, S(B, L) = f(x) (4) For an annotation A = B, T, if S(B, T ) = arg max L L S(B, L), then it is a normal annotation, otherwise it is a candidate annotation. In this way, the produced annotations are categorized into normal annotations with relatively high confidence and candidate annotations with relatively low confidence. 5 Experimental Setup 5.1 Datasets All the datasets in our experiments are public available. From these datasets, 64 sentences are collected as the test set to perform the actual assisted annotation experiments, and 22 sentences to train the annotators. An overview of the (3)

7 datasets is shown in Table 1. The last two columns are the average number of tokens and the average number of annotations per sentence in the test set. The three datasets are detailed below. The AKSW-News dataset consists of 325 newspaper articles as described in [2]. Most articles in the dataset are reports in aerospace domain. The CoNLL-2003 shared task [8] data is widely used in NER task. Since one of our tagger (Stanford Named Entity Recognizer) is trained on the training part of CoNLL-2003, we only use the testing part (CoNLL 03-Test). The average number of entities per sentence of CoNLL 03- Test is much larger than the other two datasets, because many articles in this dataset are about sports events, and therefore there are many players, teams, cities or countries. Different with the previous two datasets, NEs identified in the Reuters-128 dataset [9] are manually disambiguated to knowledge bases, while the entity types are not given. So we manually annotated the types of the entities in this dataset. Dataset Table 1. Datasets used in the experiments. Total Articles Testing Sentences Avg. Tokens Avg. Annotations AKSW-News 325 28 34.0 2.1 CoNLL 03-Test 447 28 86.1 13.6 Reuters-128 128 8 37.9 3.1 5.2 Taggers Six named entity taggers are involved so far: the Stanford Named Entity Recognizer 5 (Stanford)[10], the Illinois Named Entity Tagger 6 (Illinois)[11], the Ottawa Baseline Information Extraction 7 (Balie)[12], the Apache OpenNLP Name Finder 8 (OpenNLP)[13], the General Architecture for Text Engineering 9 (GATE)[14] and the Line Pipe 10 (LingPipe). For outputs of these taggers, only three classes were considered in our experiment, namely person, location, and organization. Performance of these taggers on our test set are listed in Table 2. 5.3 Pre-annotators The pre-annotator Ensemble+Categorization 6 which combines the six taggers based on ensemble learning and produces normal and candidate annotations is 5 http://nlp.stanford.edu/software/crf-ner.shtml (version 3.6.0). 6 http://cogcomp.cs.illinois.edu/page/software view/netagger (version 2.8.8). 7 http://balie.sourceforge.net (version 1.8.1). 8 http://opennlp.apache.org/index.html (version 1.6.0). 9 http://gate.ac.uk/ (version 8.1). 10 http://alias-i.com/lingpipe/ (version 4.1.0).

8 Table 2. Performance of the taggers on the testing sentences. Tagger Precision Recall F 1 Stanford 0.931 0.901 0.916 Illinois 0.880 0.825 0.852 OpenNLP 0.817 0.806 0.811 LinePipe 0.724 0.672 0.697 Gate 0.759 0.616 0.680 Balie 0.511 0.416 0.458 used to evaluate our approach. In order to measure the impact of candidate annotations, a baseline pre-annotator Ensemble 6 which combines the same six taggers but produces only normal annotations is tested. Another baseline preannotator Union 6 produces annotations which are union of the outputs of the six taggers. No ensemble learning technique is applied on Union 6. To test the case where less taggers are available, a group of pre-annotators combining two taggers are tested, which are denoted as Ensemble+Categorization 2, Ensemble 2, and Union 2. For fair comparison, we choose Stanford and Illinois the best two taggers in terms of F 1 on the test dataset (Table. 2). We also create two other baseline pre-annotators. The first one None do not use any tagger, so annotators have to annotate sentences from scratch. The second one Stanford uses a single tagger (Stanford) to produce annotations. For the pre-annotators based on ensemble learning, to simulate the annotation process, four strategies are used, including Majority Voting which do not need training data, and Support Vector Machine (SVM) [15] which is trained using 10%, 50%, 90% portion of articles in each dataset. The SVM implementation is provided by LIBSVM [16]. 5.4 Assisted Annotation Experiments Eight annotators (H 1 to H 8 ) participate in our annotation experiments. They are graduate students in our school, and major in NLP study. Each annotator has to annotate all of the 64 sentences, after he/she has annotated 22 sentences to get familiar with the Web-based UI. The 64 testing sentences are splitted into 8 subsets (S 1 to S 8 ), each of which contains 8 sentences. Annotators are presented with the sentences in the same order (Table 3), but each sentence is pre-annotated by different pre-annotators (P 1 to P 8 ) for different annotators (H 1 to H 8 ). We carefully design the experiments, to ensure that each sentence will be pre-annotated by all the pre-annotators, and will be manually annotated by all the annotators.

Table 3. Assisted annotation experiments. Annotators are assigned to annotate sentences with various pre-annotations. P 1,..., P 8 stand for the pre-annotators None, Stanford, Union 2, Ensemble 2, Ensemble+Categorization 2, Union 6, Ensemble 6, and Ensemble+Categorization 6, respectively. 9 Subset H 1 H 2 H 3 H 4 H 5 H 6 H 7 H 8 S 1 P 1 P 8 P 3 P 6 P 5 P 4 P 7 P 2 S 2 P 2 P 7 P 4 P 5 P 6 P 3 P 8 P 1 S 3 P 3 P 6 P 5 P 4 P 7 P 2 P 1 P 8 S 4 P 4 P 5 P 6 P 3 P 8 P 1 P 2 P 7 S 5 P 5 P 4 P 7 P 2 P 1 P 8 P 3 P 6 S 6 P 6 P 3 P 8 P 1 P 2 P 7 P 4 P 5 S 7 P 7 P 2 P 1 P 8 P 3 P 6 P 5 P 4 S 8 P 8 P 1 P 2 P 7 P 4 P 5 P 6 P 3 6 Results and Analysis 6.1 Performance of Pre-annotated Annotations In Table 4, we present performance of pre-annotated annotations. The preannotator Ensemble+Categorization 6 achieves highest recall of 0.981, and only 0.14 annotations per sentence need to be added by annotators. Annotators do not need to modify spurious candidate annotations, and only 0.58 normal annotations need to be modified per sentence, although the precision of Ensemble+Categorization 6 is very poor. Table 4. Performance of pre-annotated annotations. Spurious stands for the average number of spurious annotations per sentence, which annotators have to modify, either delete or re-select the entity type. Missed stands for the average number of missed annotations per sentence, which annotators need to add. Pre-annotator Precision Recall Spurious Missed None N/A N/A 0.00 7.25 Stanford 0.931 0.901 0.48 0.72 Union 2 0.854 0.948 1.17 0.38 Ensemble 2 0.908 0.871 0.64 0.94 Ensemble+Categorization 2 0.326 0.950 0.69 0.36 Union 6 0.528 0.976 6.33 0.17 Ensemble 6 0.928 0.922 0.52 0.56 Ensemble+Categorization 6 0.256 0.981 0.58 0.14

10 6.2 Performance of Annotators The performance of annotators are presented in Table 5. The pre-annotator Ensemble+Categorization 6 which combines six taggers and produces normal and candidate annotations assists human to take the least time per sentence. Table 5. Experimental results. The second column Time is the average time in seconds taken per sentence, N add is the average number of adding actions per sentence, and N modify is the average number of modifying actions. Pre-annotator Time N add N modify Precision Recall F 1 None 38.05 7.39 0.00 0.920 0.916 0.918 Stanford 19.25 0.56 0.34 0.959 0.951 0.955 Union 2 19.58 0.38 1.08 0.970 0.959 0.965 Ensemble 2 20.80 0.72 0.61 0.961 0.953 0.957 Ensemble+Categorization 2 19.23 0.34 0.81 0.959 0.949 0.954 Union 6 26.97 0.16 6.23 0.963 0.953 0.958 Ensemble 6 18.80 0.48 0.44 0.961 0.951 0.956 Ensemble+Categorization 6 18.27 0.14 0.75 0.970 0.959 0.965 6.3 Analysis There are 64 sentences in the test dataset, and each of them is pre-annotated by 8 pre-annotators, and then annotated by 8 annotators. Finally we get 512 instances. The time model computed on these instances by means of linear regression is as follows: T ime = 0.14 N T oken + 2.83 N Add + 1.84 N Modify + 8.60 (5) where T ime is the total time in seconds spent on a sentence, N T oken is the number of tokens in the sentence, N add is the number of adding actions, and N modify is the number of modifying actions. The model has an intuitive interpretation: the annotator read each token (0.14 seconds per token); adding an annotation takes 2.83 seconds, and modifying an annotation takes 1.84 seconds. Additionally, there is 8.60 seconds of overhead per sentence. For this model, the Relative Absolute Error (RAE) is 33.2%. As we expected, adding an new annotation takes more time than modifying an existing annotation. The estimated time taken by adding and modifying annotations per sentence is listed in Table 6. Ensemble+Categorization 6 outperforms Ensemble 6 because candidate annotations improve recall. Union 6 does not achieve good performance due to too much spurious annotations which need to be deleted. Since two taggers does not bring as many annotations as six taggers do, Ensemble+Categorization 2 does not perform as well as Ensemble+Categorization 6.

Table 6. Estimated time taken by adding actions and modifying actions per sentence. 11 Pre-annotator ˆTadd ˆTmodify ˆTadd+modify None 20.92 0.00 20.92 Stanford 1.59 0.63 2.22 Union 2 1.06 1.98 3.04 Ensemble 2 2.03 1.12 3.16 Ensemble+Categorization 2 0.97 1.50 2.47 Union 6 0.44 11.47 11.91 Ensemble 6 1.37 0.81 2.18 Ensemble+Categorization 6 0.40 1.38 1.78 7 Conclusion In this paper, we employ ensemble learning techniques including voting and stacking to combine multiple existing named entity taggers. The proposed preannotator achieves high recall, and therefore the number of adding actions is reduced. Based on their estimated confidence, annotations are categorized into normal annotations and candidate annotations to reduce the number of modifying actions. In addition, our approach does not require human to annotate additional training data. We conduct experiments under various pre-annotation conditions. The experiment results show that our approach outperforms the baseline methods in reduction of the number of corrective actions as well as the annotation time, without loss of performance (in terms of F-measure). In future work, we will increase the amount of testing data, evaluate on Chinese datasets, and apply our approach to other NLP tasks. Acknowledgement. This work is partially funded by the National Science Foundation of China under Grant 61170165, 61602260, 61502095. We would like to thank all the anonymous reviewers for their helpful comments. References 1. Wu, D., Ngai, G., Carpuat, M.: A stacked, voted, stacked model for named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 200 203. Association for Computational Linguistics, Stroudsburg (2003) 2. Speck, R., Ngomo, A. C. N.: Ensemble learning for named entity recognition. In The Semantic Web ISWC 2014. LNCS, vol 8796, pp. 519 534. Springer, Heidelberg (2014) 3. Ganchev, K., Pereira, F., Mandel, M., Carroll, S., White, P.: Semi-automated named entity annotation. In Proceedings of the linguistic annotation workshop, pp. 53 56. Association for Computational Linguistics (2007) 4. Stenetorp, P., Pyysalo, S., Ananiadou, S., Jun ichi T.: Generalising semantic category disambiguation with large lexical resources for fun and profit. J. Biomedical Semantics, 5, 26 (2014)

12 5. Lingren, T., Deleger, L., Molnar, K., Zhai, H., Meinzen-Derr, J., Kaiser, M., Stoutenborough, L., Li, Q., Solti, I.: Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements. Journal of the American Medical Informatics Association, 21(3), 406 413 (2014) 6. Ogren, P. V., Savova, G. K., Chute, C. G.: Constructing evaluation corpora for automated clinical named entity recognition. Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 28 30 (2008) 7. Loftsson, H., Yngvason, J. H., Helgadttir, S., Rgnvaldsson, E.: Developing a PoStagged corpus using existing tools. In Proceedings of Creation and use of basic lexical resources for less-resourced languages, workshop at the 7th International Conference on Language Resources and Evaluation. (2010) 8. Tjong Kim Sang, E. F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 142 147. Association for Computational Linguistics (2003) 9. Rder, M., Usbeck, R., Hellmann, S., Gerber, D., Both, A.: N3 a collection of datasets for named entity recognition and disambiguation in the NLP interchange format. In Proceeding of the Ninth International Conference on Language Resources and Evaluation. (2014) 10. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363 370 (2005) 11. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, pp. 147 155. Association for Computational Linguistics, Stroudsburg (2009) 12. Nadeau, D.: Baliebaseline information extraction: Multilingual information extraction from text with machine learning and natural language techniques. Technical report, University of Ottawa (2005) 13. Baldridge, J.: The OpenNLP Project (2005) 14. Cunningham, H.. GATE: A general architecture for text engineering. Computers and the Humanities, 36(2), pp. 223 254 (2001) 15. Boser, B. E., Guyon, I. M., Vapnik, V. N.: A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144 152. ACM (1992) 16. Chang, C. C., Lin, C. J: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27 (2011)