Discriminative Training with Perceptron Algorithm for POS Tagging Task

Size: px

Start display at page:

Download "Discriminative Training with Perceptron Algorithm for POS Tagging Task"

Lee Casey
5 years ago
Views:

1 Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon 1 Introduction One of the most popular algorithms for structured prediction problems in natural language and speech processing is the perceptron algorithm [Rosenblatt1958, Collins2002]. The perceptron algorithm can be used to estimate the model parameters in any structured prediction learning frameworks. Collins presented a discriminative log-linear model and the perceptron algorithm to estimate the model parameters, as a global framework for discriminative training. This framework can be used for training finite-state tagging models such as POS tagging, shallow parsing, sentence segmentation, named entity recognition, etc. In the first part of this study, section 2, we show experimental results for POS tagging of English using the Collins framework. Structured prediction models including perceptron are supervised machine learning techniques and they need a large amount of labeled input-output data to provide an improvement to system performance. Training the model using large amount of data can be cumbersome. McDonald and colleagues [McDonald et al.2010] investigated distributed training strategies for the structured perceptron to reduce training times in the two tasks of named entity recognition and dependency parsing. In the second part of this paper, section 3, we investigate their techniques for another structure prediction task, POS tagging. 2 Discriminative Model and the Perceptron Algorithm This section describes a discriminative log-linear model and the perceptron algorithm to learn the model parameters to train a POS tagger. This framework is first 1

2 Perceptron(T= {(x i, y i )} N i=1, ᾱ{default = 0}, T) For t = 1..T For i = 1..N calculate z i = argmax z GEN(xi )Φ(x i, z).ᾱ If (z i y i ) then ᾱ = ᾱ + Φ(x i, y i ) Φ(x i, z i ) return ᾱ Figure 1: The perceptron algorithm presented by [Collins2002] as a global framework for discriminative training. To train a discriminative POS tagging model, the task is to learn a mapping from inputs x X to outputs y Y, where X is the set of all input sentences and Y is the set of all possible POS tag sequences. Given a set of training examples (x i, y i ), a function GEN(x) that enumerates a set of possible POS tag sequences of length n (where n is the length of x), ᾱ R d a parameter vector, and representation Φ that maps each (x, y) X Y to a feature vector Φ(x, y), there is a mapping from an input x to an output F (x) defined by the formula: F (x) = arg max Φ(x, y) ᾱ (1) y GEN(x) where Φ(x, y) ᾱ is the inner product Σ i α i Φ i (x, y). The model learns the parameter values ᾱ during the training and the decoding algorithm searches for the y that maximizes (1). The feature vector Φ(x, y) represents arbitrary features of the sentence and POS tag sequences. In section 2.1 we describe the feature templates we used in our experiments. To estimate the parameter values ᾱ of the model we use the perceptron algorithm. Figure 1 shows the perceptron algorithm [Collins2002]. At each training example (x i, y i ), the algorithm updates the parameter vector ᾱ by subtracting the features values of the best-scoring hypothesis z i to it and adding the feature values of the true hypothesis y i from it. The algorithm then moves to the next example. This procedure is repeated T times (epochs) over the training examples. The regular perceptron algorithm suffers from over-fitting problem. One solution to this problem is the averaged perceptron which sets the final weight vector to the average of all the parameter vectors seen in training. 2.1 Features We extract features from word input sequence and POS-tag output sequence. The feature set includes bigrams of surrounding words, a window of size 2 of the next 2

3 and previous words, POS-tag of the previous word, and orthographical features, as shown in Table 1. The orthographical feature set includes prefixes and suffixes of the words (up to 4 characters), and presence of a hyphen, digit, or an uppercase character. We do not restrict the orthographical features to rare 1 or unknown words, and we activate them for all words. This improves accuracy at the cost of some speed compared to the case that the orthographical features are activated only for rare or unknown words. Similar to Ratnaparkhi [Ratnaparkhi1997] or Roark et al. [Roark et al.2012], we restricted the search space to the tag dictionary for each word. For known words, the tag dictionary contains the tags occurred with the word in the training set, and for unknown or rare words, the tag dictionary contains all tags in the tag set. Use of tag dictionary speedups the tagger significantly without hurting the accuracy. Lexical Orthographical t i, t i 1 t i, w i t i, w i [0] t i, w i 1 t i, w i [0 1] t i, w i+1 t i, w i [0 2] t i, w i 2 t i, w i [0 3] t i, w i+2 t i, w i [n] t i, w i, w i+1 t i, w i [n-1 n] t i, w i, w i 1 t i, w i [n-2 n] t i, w i+1, w i+2 t i, w i [n-3 n] t i, w i 1, w i 2 t i, w i containing digit t i, w i containing hyphen t i, w i containing uppercase Table 1: Feature templates for POS tagging 2.2 Experimental Results We ran our experiments on the WSJ Penn Treebank corpus [Marcus et al.1999] using sections 2-21 for training, section 24 for development, and section 23 for testing. The decoding process is performed using a Viterbi search with Markov order-0 assumption. Table 2 shows the accuracy of POS tagging for English using our tagger. To asses how the results of our tagger will generalize to an independent test set, we used a k-fold cross-validation approach (k=20). All labeled examples (the 1 Rare words occur less than 5 times in the training data. 3

4 accuracy dev set 97.1% test set 97.3% Table 2: POS tagging accuracy on development (section 24) and test (section 23) data combination of the training and development sets) are sequentially partitioned into k disjoint subsets. At each fold, one of the subsets is used as the development set and the union of other subsets is used as the training set. Each of the k subsets are used exactly once as the development set. The cross-validation approach, determines the best performance on the development set and its corresponding number of epochs. The mean of these epochs is ī. Our final run on the test set (section 23) is then performed by training on all labeled examples for ī epochs with no development set. The averaged accuracy of the best performances over the 20 folds is 97.1% (σ = 0.2), and the accuracy of the final run on the test set is 97.3%. These results confirm the generalizability of our tagger to arbitrary test sets. 3 Distributed Perceptron In this section, we describe two distributed training strategies we used for distributed training of the perceptron algorithm, which are based on the strategies proposed in McDonald et al. 3.1 Parameter Mixing Parameter mixing method is the straight-forward strategy of training separate models on disjoint subsets of the training data in parallel, and then mixing all parameters as the final model. Figure 2 shows this algorithm [McDonald et al.2010]. First, we partition the training data into S shards, then we train S separate perceptron models on these shards in parallel, and finally we mix the parameter values of shards by taking average of those. In a map-reduce framework, training separate perceptron models is done in the map step and mixing (averaging) the parameter values is done in the reduce step. The advantages of this method is that it is easily scalable to very large data sets and it is resource efficient with respect to network usage. The disadvantage of this method is that it can be sub-optimal, which means that it does not necessarily return a separating weight vector, even when the training set is separable. 4

5 ParameterMix(T= {(x i, y i )} N i=1 ) Shard T into S parts T = {T 1,...,T S } ᾱ s = Perceptron(T s, 0, T) ᾱ = s µ sᾱ s return ᾱ Figure 2: Parameter mixing method for distributed perceptron 3.2 Iterative Parameter Mixing A slight modification to the parameter mixing method, called iterative parameter mixing, makes it optimal. Iterative parameter mixing finds a separating hyperplane (assuming that the training set is separable), and yields to comparable or better accuracies than serially trained perceptron, at the cost of increasing network usage. Similar to parameter mixing, we first shard the training data into S shards. Then, we train a separate single epoch perceptron on each shard and mix (average) the model weights. We train another single epoch perceptron on each shard, but this time with the mixed weight vector as the initial value for the perceptrons. The process repeats for T times. Figure 3 shows this algorithm [McDonald et al.2010]. In a map-reduce framework, training the single epoch perceptron models is done in the map step and mixing the parameter values and re-sending them to the shards is done in the reduce step. 3.3 Experiments We investigated distributed training of the perceptron algorithm in the POS tagging task. We used WSJ Penn Treebank sections 2-21 for training and section 24 for development. Note that we focus on the training phase results in this section. We IterativeParameterMix(T= {(x i, y i )} N i=1 ) Shard T into S parts T = {T 1,...,T S } Set ᾱ = 0 For t = 1..T ᾱ (s,t) = Perceptron(T s, ᾱ, 1) ᾱ = s µ (s,t)ᾱ (s,t) return ᾱ Figure 3: Iterative parameter mixing method for distributed perceptron 5

6 compared POS tagging accuracy and perceptron training time in three systems: 1. Serial: non-distributed perceptron on all training data. 2. Parameter mix: distributed perceptron using the parameter mixing method. 3. Iterative parameter mix: distributed perceptron using the iterative parameter mixing method. For all three systems we compared results for regular and averaged perceptron algorithms. For parallel systems (2 and 3) we used 10 disjoint equal-sized shards of training data, built by sequentially splitting the complete training data. Each shard contains around 3,900 POS tagged sentences. To mix the model weights, we used the uniform mixing strategy by taking the mean of the weight vectors, in both regular and averaged perceptrons. Note that the reported results in this section are not from an actual map-reduce such as Haoop implementation. Instead, we simulated a map-reduce framework by piplining the perceptron epochs and wight vector mixing procedures iterativelty. The training time reported for each training epoch is calculated by taking the maximum value among all parallel shards. 3.4 Results Results of the regular and averaged perceptrons are shown in Figure 4. Both distributed algorithms return the models much quicker in terms of wall clock as well as the number of training epochs compared to the serially trained perceptron, in both regular and averaged perceptrons. Parameter mixing method does not meet the performance of training serially on all data for averaged perceptron, neither for the regular perceptron except in the first few epochs. Iterative parameter mixing method achieves a better performance than the parameter mixing method in both regular and averaged perceptrons. It also achieves better accuracy compared to the serial scenario in the regular perceptron, and a comparable accuracy in the averaged perceptron. This happens because the parameter mixing has a similar effect as the averaged perceptron. Mixing parameters by taking the average of those, reduces the variances between different weight vectors, and produces a regularization effect. 4 Conclusion and Future Work In this paper, we described a discriminative training method to train a POS tagger using the perceptron algorithm in serial and distributed scenarios. Our POS 6

7 Regular Perceptron Averaged Perceptron POS tagging accuracy Serial Parameter Mix Iterative Parameter Mix POS tagging accuracy Serial Parameter Mix Iterative Parameter Mix Time (ms) Time (ms) Figure 4: Accuracy versus time for regular and averaged distributed perceptron in POS tagging tagger achieves over 97% accuracy for English. Training features are languageindependent and the tagger can be used for arbitrary languages. To reduce the training time, we tried two distributed training strategies for the structured perceptron, previously proposed in [McDonald et al.2010], in the task of POS tagging. Both parameter mixing methods are very quick and accurate, however, in a simple parameter mixing it is not guaranteed to produce an optimal model. Both distributed approaches reduce the time required to train the perceptron algorithm significantly. Similar results to the other structured prediction tasks, named entity recognition and dependency parsing, are hold for POS tagging. Our tagger is available for download in POStagger. Supporting other finite-state tagging tasks such as shallow parsing or hedge segmentation [Yarmohammadi et al.2014] will be available in the toolkit soon. Another future work is to implement a map-reduce version of distributed perceptron algorithms in the Hadoop framework. 7

8 References [Collins2002] M. Collins Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing, EMNLP 02, pages 1 8, Stroudsburg, PA, USA. [Marcus et al.1999] Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor Treebank-3. Linguistic Data Consortium, Philadelphia. [McDonald et al.2010] Ryan McDonald, Keith Hall, and Gideon Mann Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages , Stroudsburg, PA, USA. Association for Computational Linguistics. [Ratnaparkhi1997] Adwait Ratnaparkhi A maximum entropy model for part-of-speech tagging. In EMNLP [Roark et al.2012] Brian Roark, Kristy Hollingshead, and Nathan Bodenstab Finite-state chart constraints for reduced complexity context-free parsing pipelines. Computational Linguistics, 38(4): [Rosenblatt1958] Frank Rosenblatt The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6): [Yarmohammadi et al.2014] Mahsa Yarmohammadi, Aaron Dunlop, and Brian Roark Transforming trees into hedges and parsing with hedgebank grammars. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pages , Baltimore, Maryland, June. 8

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce Andrea Gesmundo Computer Science Department University of Geneva Geneva, Switzerland andrea.gesmundo@unige.ch