University of Sheffield NLP. Exercise I

Size: px

Start display at page:

Download "University of Sheffield NLP. Exercise I"

Julius Lambert Joseph
5 years ago
Views:

1 Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company name; address; fax; phone; web site; industry type; creation date; industry sector; main products; market locations; number of employees; stock exchange listings

2 Exercise I Materials : we are working with material in directory hands-onresources/ml/entity-learning training documents: a set of 5 company profiles annotated with the target concepts (corpus/annotated) - each document contains an annotation Mention with a feature class representing the target concept (human annotated), the documents also contain annotation produced by ANNIE plus an annotation called Entity that wraps up named entities of type Person, Organization, Location, Date, Address. All annotations are in the default annotation set test documents (without target concepts and without annotations): a set of company profiles from the same source as the training data (corpus/testing) SVM configuration file learn-company.xml (experiments/company-profilelearning) Open the configuration file in a text editor to see how the target concept and the linguistic annotations are encoded, remember that the target concept is encoded using the <CLASS/> sub-element in the <ATTRIBUTE> element (in this case we are trying to learn a Mention and its class ).

3 Exercise I PART I 1. Run an experiment with the training documents to check the performance of the learning component on annotated data we will use the GATE GUI for this exercise Load the Batch Learning plug-in using the plug-in manager (it has the name learning in the list of plug-ins) Create a corpus (ANNOTATED) Populate it with the training documents (corpus/annotated) use encoding UFT-8 (you may want to look at one of the documents to see the annotations, the target annotation is Mention) Create a Batch Learning PR using the provided configuration file (experiments/company-profile-learning/learn-company.xml) - should appear in the list of processing resources Create a corpus pipeline and add the Batch Learning PR to the corpus pipeline Set the parameter learningmode of the Batch Learning PR to evaluation Run the corpus pipeline over the ANNOTATED corpus (by setting the corpus parameter) When finished, evaluation information will be dumped on the GATE console Examine the GATE console to see the evaluation results

4 Exercise I PART I In this exercise we have tested how to evaluate the learning component over annotated documents. Note that we have provided very few documents for training. According to the configuration file and the number of documents in the corpus, the ML pipeline will execute 2 runs, each run will use 3 documents for training and 2 documents for testing, in each test document the Mention annotation automatically produced will be compared to the true Mention annotation (gold standard) to compute precision, recall, and f-measure values. The evaluation results will be an average over the two runs.

5 Exercise I - PART II 1. Run an experiment to TRAIN the machine learning component Create a corpus and populate it with the training data (or use ANNOTATED from previous steps) Create a Batch Learning PR using the provided configuration file (or use the same PR as before) Create a corpus pipeline containing the Batch Learning PR (or use the one before) In the corpus pipeline, set the learningmode of the Batch Learning PR component to training Set the corpus in the corpus pipeline to the ANNOTATED corpus Run the corpus pipeline Now you have trained the ML component to recognise Mentions

6 Exercise I PART III 1. Run an experiment to apply the trained model to unseen documents We will use the trained model produced in the previous exercise Create a corpus (TEST) and populate it with the test documents (use UTF-8 encoding) NOTE: the documents are not annotated, so you need to produce the annotations! The steps below produce the annotations. Load the ANNIE system (with defaults) Create an ANNIE NE Transducer (call it ENTITY-GRAMMAR) using the grammar file under (grammars/create_entity.jape) Add the ENTITY-GRAMMAR as the last component of ANNIE Run ANNIE (+ the new grammar) over the TEST corpus Verify that the documents contain the ANNIE annotations + the Entity annotation

7 Exercise I PART III Take the corpus pipeline created in the previous exercise and change the parameter learning mode of the Batch Learning PR to application The input annotation set should be empty (default) because the ANNIE annotations are there, and the output annotation set can be any set (including the default) Apply (run) the corpus pipeline to the TEST corpus (by setting the corpus) Examine the result of the annotation process (see if Mention annotations have been produced) Mention annotations should contain a feature class (one of the concepts listed in the first slide) and a feature prob which is a probability produced by the ML component Now you have applied a trained model to a set of unseen documents With the parts I, II, and III you have use the evaluation, training, and application modes of the Batch Learning PR

8 Exercise I PART IV 1. Run your own experiment: copy the configuration file to another directory and edit this configuration file. You may comment out some of the features used, or the windows used, or the type of ML. Chapter 11 of the GATE guide contains enough information on options you can adjust.

9 Exercise II Objective: Implement a ML component based on SVM to learn ANNIE, e.g. To learn to identify the following concepts or named entities: Location, Address, Date, Person, Organization Materials (under directory hand-on-resources/ml/entity-learning) We will need the GATE GUI and the learning plug-in loaded using the plug-in manager (see previous exercise) We will use the testing documents provided in Exercise I Before starting, it better to close all documents and resources of the previous exercise Configuration file is learn-nes.xml in experiments/learning-nes, it is very similar to the previously used but check the target annotation to be learned (Entity and its type)

10 Exercise II PART I 1. Annotate the documents Create a corpus (CORPUS) and populate it with the test documents (use UTF-8 encoding) NOTE: the documents are not annotated, so you need to produce the annotations! The steps below produce the annotations. Load the ANNIE system (with defaults) Create an ANNIE NE Transducer (call it ENTITY-GRAMMAR) using the grammar file under (grammars/create_entity.jape) Add the ENTITY-GRAMMAR as the last component of ANNIE Run ANNIE (+ the new grammar) over the CORPUS Verify that the documents contain the ANNIE annotations + the Entity annotation

11 Exercise II PART I 1. Evaluate an SVM to identify ANNIE s named entities Create a Batch Learning PR using the provided configuration file (experiments/learning-nes/learn-nes.xml) Create a corpus pipeline and add the Batch Learning PR to the corpus pipeline Set the parameter learningmode of the Batch Learning PR to evaluation Run the corpus pipeline over the CORPUS corpus (by setting the corpus parameter) When finished, evaluation information will be dumped on the GATE console Examine the GATE console to see the evaluation results NOTE: For the sake of this exercise we have used annotations produced by ANNIE as gold standard and learn an named entity recognition system based on those annotations. Note however that training should be based on human annotations.

12 Exercise II PART II 1. Train a SVM to learn named entities and apply it to unseen documents We will use the documents you annotated (automatically!) in PART I (corpus CORPUS) Using the corpus editor remove from CORPUS the first 5 documents in the list (profile_a, profile_aa, profile_ab, profile_ac, profile_ad) Create a corpus called TESTING Add to TESTING (using the corpus editor) documents profile_a, profile_aa, proffile_ab, profile_ac, profile_ad should be the last 5 of the list! Now we have one corpus for training (CORPUS) and one corpus for testing (TESTING)

13 Exercise II PART II We will use the learning corpus pipeline we have evaluated in PART I of this exercise In the learning corpus pipeline, set the parameter training of the Batch Learning PR to training Run the learning corpus pipeline over the CORPUS corpus (by setting the corpus parameter) Now we have a trained model to recognise Entity and its type In the learning corpus pipeline, set the parameter learningmode of the Batch Learning PR to application Also set the output annotation set outputasname to Output (to hold the annotations produced by the system) Run the learning corpus pipeline over the TESTING corpus (by setting the corpus parameter) After execution, check the annotations produced on any of the testing documents (Output annotation set)

14 Exercise II PART III On any of the automatically annotated documents from TESTING you may want to use the annotationdiff tool verify in each document how the learner performed, comparing the Entity in the default annotation set with the Entity in the Output annotation set. Run your own experiment varying any of the parameters of the configuration file, modifying or adding new features, etc.

Machine Learning in GATE

Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort