Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Size: px

Start display at page:

Download "Self-tuning ongoing terminology extraction retrained on terminology validation decisions"

Cordelia Morgan
6 years ago
Views:

Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is funded under the SFI Research

1 Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

2 Agenda Motivation Why do we need to do terminology extraction on an ongoing basis? Methodology Ongoing terminology extraction with and without learning Experimental Setup and Results Description of Simulation Experiments and Results Conclusions and next steps The feedback loop in machine learning-based ongoing terminology extraction can help in identifying the majority of terms in a batch of new content

3 MOTIVATION

4 A frequent assumption in terminology extraction Surely if I do terminology extraction at some point towards the beginning of a content creation project, I will capture the majority of the terms of interest that are ever likely to appear, right? I m basically taking a representative sample of the terms in the project

5 Let s test that assumption Here s an actual example using the term-annotated ACL RD-TEC (QasemiZadeh and Handschuh, 2014) ACL RD-TEC: a corpus of ACL academic papers written between 1965 to 2006 in which domain-specific terms have been manually annotated

Motivation new content introduces new terms The proportion of new terms in a subsequent year never reaches 0 Between 12% and 20% of all valid terms in

6 Motivation new content introduces new terms The proportion of new terms in a subsequent year never reaches 0 Between 12% and 20% of all valid terms in any given year will be new If you don t do term extraction periodically (e.g. annually) you will start missing out A LOT OF new terms within a few years

7 The reality is As content gets updated, new previously unseen terms will start appearing These terms will not have been captured during our initial term extraction and will have to be researched by our users or our terminologists downstream, causing bottlenecks in translation / usage of terminology, perhaps incurring additional costs Clipart from

8 THE SOLUTION? (METHODOLOGY)

9 Ongoing terminology extraction First proposed by Warburton (2013) automatically filtering previously identified terms and non-terms in subsequent extraction exercises Content Batch 1 Content Batch 2 Content Batch 3 Extraction and Ranking Extraction and Ranking Extraction and Ranking Validation Automatic filtering Automatic filtering Validation Validation Selected terms Rejected terms Selected terms Rejected terms Selected terms Rejected terms Terminology Pipeline Filtered terms Terminology Pipeline Filtered terms Terminology Pipeline Filtered terms Filtered terms Filtered terms

Proposed Solution: Machine Learning ongoing Terminology Extraction (MLTE) Instead of compiling term lists for filtering, we introduce a Machine Learning classification model that learns from

10 Proposed Solution: Machine Learning ongoing Terminology Extraction (MLTE) Instead of compiling term lists for filtering, we introduce a Machine Learning classification model that learns from terminologist s validation decisions Content Batch 1 Content Batch 2 Content Batch 3 Extraction Extraction Extraction Validation Candidate Classification Candidate Classification Validation Validation Selected terms Rejected terms Selected terms Rejected terms Selected terms Rejected terms Terminology Pipeline Train model Terminology Pipeline Terminology Pipeline Retrained model Retrained model

11 Proposed System Architecture CURRENT BATCH Current batch text Text from previous k batches Valid Not Valid Training, model, etc. Validation decisions from previous k batches Training Model for current batch Validation decisions for current batch Parameter: History size k (number of past batches to use as training data)

12 EXPERIMENTAL SETUP AND RESULTS

13 Dataset Usage of the ACL RD-TEC corpus Has terminology gold standard Has term index info (which terms appear in which docs) Documents are time-stamped (date of conference) C _cln.txt J _cln.txt Sample: RDTEC papers from 2004 till ,781 articles 9,114,767 words 3,300 words per article on average Sample divided in chronological batches of approx. 40 articles each 69 batches Simulation of ongoing term extraction AND validation using an annotated, time-stamped corpus

14 Simulation Given current batch b t : 1. Extract term candidate n-grams from articles in batch (n = 1.. 7) 2. Automatically remove any term candidates that appeared in any previous batch like Warburton (2013) 3. Automatically remove any term candidates with POS patterns not associated with any valid terms in previous batches This is to reduce the amount of non-valid term candidates in training data to counteract skewness towards non-valid candidates Notice no need to supply manual POS pattern filters! 4. Using previously trained model (if available), predict whether each term candidate is a valid term or not 5. Evaluate prediction by comparing predictions with gold standard in ACL RD-TEC annotation Simulates manual validation step 6. Create new training data by concatenating this gold standard data points with that of the previous k-1 batches (history of size k). In our experiments, best results with k = Train a new model using newly created training data. 8. Go to next batch b t+1 and start from 1 until completing all batches.

15 Model and Features Model Support Vector Machine (SVM) classifier Linear Kernel Features Term candidate s POS pattern Term candidate s character 3-grams Two domain contrastive features: Domain Relevance (DR) (Navigli and Velardi, 2002) Term Cohesion (TC) (Park et al., 2002) Contrastive corpus 1 a 500-way clustering of 2009 Wikipedia documents (Baroni et al., 2009) Contrastive corpus 2 a dynamic clustering of batch history (each cluster has roughly 40 articles)

16 Experiments Our simulated approach, as described Two baselines: Baseline 1: An approximation to Warburton s (2013) method using standard, off-the-shelf filter-rankers provided by JATE (Zhang et al., 2008) Automatic filtering across batches takes place No learning model is trained Baseline 2:Train SVM classifier using our features on first batch and use that classifier to predict terms from all subsequent batches Same as our approach, but no retraining at each batch takes place

17 Evaluation Recall (coverage): % of valid terms in a batch were predicted as valid Low recall indicates we re missing many valid terms Precision (true positives): % of valid terms in the set of term candidates predicted as valid Low precision indicates we re producing many false positives Usually, we want to identify as many true valid terms as possible, potentially at the risk of returning a relatively high number of false positives. We re interested in achieving high recall (coverage) at the expense of a moderate precision

18 Results

19 CONCLUSIONS AND NEXT STEPS

20 Conclusions Obtained good recall (coverage) scores using our method (ONGOING), much better than the two baselines Average recall of 74.16% across all batches Precision scores are quite disappointing, meaning that we can expect many false positives in each batch Ongoing retraining does help in keeping high recall Manual terminology validation already takes place in virtually all terminology extraction tasks. Let s just use them to train an ongoing machine-learning classifier automatically! The lack of a feedback loop mechanism in the statistical filterrankers does hinder their performance when used on an ongoing basis with automatic exclusion lists

21 Future work Conduct human-based benchmarks Address low precision scores Post-processing strategies like re-ranking predicted candidates (e.g. by using statistical rankers) Exploring new features based on topic models Exploring reinforced learning techniques Experiment on other datasets from several other domains Further investigate role of contrastive corpus E.g. not all specialised terms will feature in Wikipedia Fall-back strategy like relying in sub-terms Distributional vector composition techniques in order to estimate feature values of terms missing in contrastive corpus

22 QUESTIONS? Alfredo Maldonado Research Fellow ADAPT Centre at Trinity College Dublin @alfredomg on Twitter The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 12/RC/2106) and is co-funded under the European Regional Development Fund. Clipart from

23 APPENDIX

Two assumptions Assumption 1: We have a terminology pipeline in which extracted terms will be

translators, specialists), organisations and systems Content Terms Research Translation Termbase

Academic/scientific papers from journals, conferences proceedings, etc.

Strings for software, mobile apps, web apps Content that starts getting translated before source

24 Two assumptions Assumption 1: We have a terminology pipeline in which extracted terms will be further processed by terminologists and other linguists (e.g. research, translation, etc.) and will end up in a terminology database (termbase) to be used by other professionals (e.g. translators, specialists), organisations and systems Content Terms Research Translation Termbase Users Assumption 2: We have a non-static, ongoing source of new content Examples: Academic/scientific papers from journals, conferences proceedings, etc. Technical manuals for industrial, technological or medical equipment Web-based/online content Strings for software, mobile apps, web apps Content that starts getting translated before source text is completed ( sim ship or simultaneous ship ) If your content is static and finite, you perhaps won t benefit from ongoing terminology extraction Is your content really static??? Clipart from and

25 Evaluation of filter-rankers Consider the top N ranked candidates as valid term predictions and all other candidates as non-valid term predictions (Pecina, 2010). If a batch has v valid terms, we could consider the N = v top candidates as valid terms and the rest as non-valid terms. However, N = v is too inflexible and will tend to penalise the recall of rankers In our experiments we use N = 2v

26 Evaluation of filter-rankers N = 2v

27 Evaluation of filter-rankers N = 7v

Domain Adaptation Using Domain Similarity- and Domain Complexity-based Instance Selection for Cross-domain Sentiment Analysis

Domain Adaptation Using Domain Similarity- and Domain Complexity-based Instance Selection for Cross-domain Sentiment Analysis Robert Remus rremus@informatik.uni-leipzig.de Natural Language Processing Group