Deliverable D1.4 Report Describing Integration Strategies and Experiments

DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable D1.4 Report Describing Integration Strategies and Experiments The Consortium October 2004

Report Describing Integration Strategies and Experiments D1.4 Project ref. no. Project acronym Project full title - Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Security (distribution level) Public Contractual date of delivery 15.10.2004 Actual date of delivery 15.10.2004 Deliverable number D1.4 Deliverable name Type Status & version Number of pages 9 WP contributing to the deliverable WP / Task responsible Other contributors Author(s) EC Project Officer Keywords Abstract Report Describing Integration Strategies and Experiments Report Final WP1b WP1b John Carroll, Alex Fang, Melanie Siegel Evangelia Markidou Hybrid NLP, Named-Entity Recognition, architecture The implemented strategies for hybrid NLP are described and examples are given using screenshots. II

Content 1 Integration Strategies in the Heart of Gold 2 2 Mobile Phone Name Recognition for English 5 2.1 Input and Output Specification... 5 2.2 Construction of an annotated sub-corpus... 7 2.3 The recognition program... 8 2.4 A quantitative evaluation of the mobile phone name recogniser... 9 2.5 Error Analysis... 11 3 References 11 1

1 Integration Strategies in the Heart of Gold Implemented strategies for hybrid NLP in the project include: The analysis results of NLP tools at lower processing levels can be used by components at higher levels. o For example, the deep linguistic analysis module PET uses default lexicon entries for Named-Entities that the Named-Entity Recognition Sprout delivers. o For example, the deep linguistic analysis module PET uses default lexicon entries for Part-of-Speech tags that the POS tagger TnT delivers. Deliver the deepest result found. If a module of the required depth cannot deliver a result, deliver the next deepest result. This is the approach that the email autoresponse application mainly follows. 2

Deliver partial results, whenever a complete analysis is not available. Partial results are taken from the deepest module that delivers results. Combine modules and grammars for different languages. Each language has its own configuration of valid modules and grammars. 3

The different modules use a compatible output formalism, RMRS. o In case of shallower modules, this robust semantic structure allows for underspecification of, e.g., argument structure. Refine the data provided by shallower modules through deep parsing. This is a strategy the applications Business Intelligence and Email Autoresponse use. Chunk processing and named-entity recognition is used to find relevant information sources, while deep processing is then applied to the found information snippets, either to verify or to filter the extracted information. 4

2 Mobile Phone Name Recognition for English We describe below the construction and evaluation of a module for named entity recognition of mobile phone names. The module was integrated into the RASP English shallow analysis system, which in turn forms part of the Heart of Gold. On a manually annotated test set, the module achieved a recognition F-score of 81.5%. 2.1 Input and Output Specification The input to the mobile phone name recognition module is a sequence of sentences in English that have already been marked up in XML style for word boundaries, with part-of-speech tags automatically assigned by RASP (Briscoe and Carroll 2002). Since this happens before the morphological analyser in the RASP pipeline, the tokens have not been lemmatized. For example, given the sentence I am thinking of upgrading to the Sony Ericsson T68is from a Nokia 8260, the input to the module is: ^ ^ <w s='2' e='2'>i</w> PPIS1 <w s='4' e='5'>am</w> VBM <w s='7' e='14'>thinking</w> VVG <w s='16' e='17'>of</w> IO <w s='19' e='27'>upgrading</w> NN1 <w s='29' e='30'>to</w> II <w s='32' e='34'>the</w> AT <w s='36' e='39'>sony</w> NP1 <w s='41' e='48'>ericsson</w> NP1 <w s='50' e='54'>t68is</w> NN1 <w s='56' e='59'>from</w> II <w s='61' e='61'>a</w> AT1 <w s='63' e='67'>nokia</w> NP1 <w s='69' e='72'>8260</w> MC ^ ^ The task of the module is to mark up the mobile phone named entities in the input, namely Sony Ericsson T68is and Nokia 8260 in this example: ^ ^ 5

<w s='2' e='2'>i</w> PPIS1 <w s='4' e='5'>am</w> VBM <w s='7' e='14'>thinking</w> VVG <w s='16' e='17'>of</w> IO <w s='19' e='27'>upgrading</w> NN1 <w s='29' e='30'>to</w> II <w s='32' e='34'>the</w> AT <w netype='phone'> <w s='36' e='39'>sony</w> <w s='41' e='48'>ericsson</w> <w s='50' e='54'>t68is</w> </w> NP <w s='56' e='59'>from</w> II <w s='61' e='61'>a</w> AT1 <w netype='phone'> <w s='63' e='67'>nokia</w> <w s='69' e='72'>8260</w> </w> NP ^ ^ where Sony Ericsson T68is and Nokia 8260 are marked up as named entities of type mobile phone (i.e. netype='phone'). They are then treated as a single unit tagged as NP, namely, a proper name. The analysis based on this output from the module will be taken further down the RASP pipeline and yield the following RMRS representation: 6

2.2 Construction of an annotated sub-corpus For work described in Workpackage 2B, a 4,000,000-word corpus of Internet discussions on mobile phones was created for the domain-specific extraction of a verb subcategorisation lexicon (Carroll and Fang 2004). From this corpus, we randomly selected two sets of 200 texts each. Each text was then manually annotated such that each instance of a mobile phone name, a model number, or any combination of the two was marked up as an entity (<mobile> and </mobile>). Here is an example: I have for sale the following ORIGINAL <mobile> Nokia </mobile> accessories that will fit any of the <mobile> Nokia 6100 </mobile> / <mobile> 5100 </mobile> series phones, including but not limited to <mobile> 6160 </mobile>, <mobile> 6190 </mobile>, <mobile> 6188 </mobile>, <mobile> 6185 </mobile>, <mobile> 6162 </mobile>, <mobile> 6161 </mobile>, <mobile> 6185i </mobile>, <mobile> 5160 </mobile>, <mobile> 5190 </mobile>, etc. 7

The two sets are summarised in Table 1: Texts Sentences Words Entities Set 1 200 3314 60804 447 Set 2 200 2624 46117 454 Total 400 5938 106921 901 Table 1: A summary of the annotated corpus 2.3 The recognition program The automatic recogniser was implemented in C. The algorithm was designed based on the observation that the distribution of mobile phone names in our corpus is relatively sparse. There is insufficient data to train a purely statistical recogniser (e.g. a Maximum Entropy Model); it may however be possible to train a combined symbolic/statistical model (incorporating information for example on manufacturer names). A set of mobile phone manufacturer names, such as Nokia and Ericsson, was manually drawn up. The remainder of the mobile phone corpus that had not been annotated (ca 2,800,000 tokens) was then used to construct a list of all the alphanumeric strings that contain at least 1 digit and that immediately follow one of these names. This process resulted in two entity sets: a list of mobile phone names a list of model numbers with their associated mobile phone names The automatic recogniser marks the following as an entity: every occurrence of the mobile phone names every occurrence of the model numbers, given the following conditions they are longer than 3 characters in length they occurred more than once in the training corpus they occurred less than 2000 1 times in the training corpus 1 Numbers occurring more than 2000 times are interpreted as genuine "free" cardinals that are unlikely to be used in reference to a mobile phone. 8

2.4 A quantitative evaluation of the mobile phone name recogniser For the quantitative evaluation of the mobile phone name recogniser's performance, the first annotated set was used for development and the second set was kept for testing. Both sets were sub-divided into 4 sets containing the same number of word tokens with a view to indicate any possible variation in terms of performance. The initial run of the recogniser on the development set produced the following results: 1 2 3 4 Total Precision 63.0 69.7 87.9 73.8 75.8 Recall 96.7 84.8 79.7 84.3 83.4 F-Score 76.3 76.5 83.6 78.7 79.4 Table 2: Performance before tuning on the development set The F-Score for the development set was just under 80%. Variations across the four sub-sets can be observed, with Set 3 showing the best F-Score of 83.6%. The output was manually inspected and changes made to the list of mobile phone names and model numbers. Subsequent performance on the development set shows an F-Score of 82.1%, an increase of nearly 3% from the previous 79.4%: 1 2 3 4 Total Precision 61.7 74.9 89.2 74.1 78.3 Recall 96.7 90.3 81.3 85.7 86.4 F-Score 75.3 81.9 85.1 79.5 82.1 Table 3: Performance after tuning on the development set When tested on the test set, the recogniser achieved an overall performance of 81.5%, with a precision score of 81% and a recall rate of 81.9%: 1 2 3 4 Total 9

Precision 85.9 69.7 83.5 92.9 81.0 Recall 97.0 77.1 71.7 81.2 81.9 F-Score 91.1 73.2 77.1 86.7 81.5 Table 4: Performance on the test set As can be observed from the table above, the best performance was 91.1% F-Score and the worst performance was 73.2%, showing considerable variation in this set and therefore suggesting that the performance of the system varies with different types of input. 10

2.5 Error Analysis There are two major sources of errors. First of all, there is frequent ambiguity between phone names and company names, as in the following example: I was wondering if anyone has any information on how the Ericsson Bluetooth kits calculates the BER packets when the BER test is run. where Ericsson can be analysed as referring to the company instead of the phone. Arguably, this is a genuinely ambiguous case. The second major ambiguity is between numbers and model numbers: There are 2 connectors on the cable, 1 RS 232 and 1 cigarette lighter. Since 232 has been observed before as co-occurring with mobile phone names, the module believes that in the current context it refers to a mobile phone product and therefore erroneously marks it as a phone name. 3 References Briscoe, E. and J. Carroll. 2002. Robust accurate statistical annotation of general text. In Proceedings of the 3 rd International Conference on Language Resources and Evaluation, Las Palmas, Gran Canaria. 1499-1504. Carroll, J. and A.C. Fang. 2004. The Automatic Acquisition of Verb Subcategorisations and their Impact on an HPSG Parser. In Proceedings of the 1 st International Joint Conference on Natural Language Processing, 22-24 March 2004, Hainan, China. Uszkoreit, Hans, Ulrich Callmeier, Andreas Eisele, Ulrich Schäfer, Melanie Siegel, Jakob Uszkoreit (2004): Hybrid Robust Deep and Shallow Semantic Processing for Creativity Support in Document Production. In Proceedings of KONVENS 2004, Vienna, Austria. Callmeier, Ulrich, Eisele, Andreas, Schäfer, Ulrich and Melanie Siegel (2004): The Core Architecture Framework. In Proceedings of LREC 04, Lisbon, Portugal. 11