Feature Extraction and Loss training using CRFs: A Project Report

Size: px

Start display at page:

Download "Feature Extraction and Loss training using CRFs: A Project Report"

Charla Gibbs
5 years ago
Views:

1 Feature Extraction and Loss training using CRFs: A Project Report Ankan Saha Department of computer Science University of Chicago March 11, 2008 Abstract POS tagging has been a very important problem in the domain of Natural Language Processing. It has been approached at by different tools like Maximum entropy Models [3], Cyclic Dependency Networks [2] and Conditional Random Fields as well. My work mainly revolved around Conditional Random Fields, developing it as a loss function for an optimization solver and then training the loss function and using Viterbi algorithm to develop an estimator for the process. 1 Introduction Most Natural Language tasks involve the use of Parts of Speech (POS) tagging. Conditional Random Fields are a powerful machine learning tool [1], [4] which have been used for the purpose of POS tagging in NLP with better results than Maximum Entropy Models and other machine learning models. Conditional Random Fields is a framework for building probabilistic tools for segmenting and labeling sequenced data. These are undirected discriminative models which have an advantage over HMMs and other generative models because they calculate p(y x) instead of the joint probability p(y, x). Thus we can include richer and more informative features by using CRFs while remaining oblivious about the nature of p(x) which needs to be otherwise known for generative models. Thus CRFs do not need to make any Done as part of a Summer Internship Project at NICTA Australia, May-July, 2007

2 independence assumptions about the inputs (x), rather they make the assumptions on the labels. Using the CRFs we handle the traditional problem of POS tagging. The novel part of our approach lies in the modularity of the structure. The solver acts as a separate system which calls the CRF loss module for providing the loss and the gradient value everytime for minimizing the weight vector for each iteration. It also calls the estimator separately to predict the labels of the test data. 2 My Work My work has mainly consisted of two parts. The first part was the extraction of features from the given input data. We used the Wall Street Journal corpus of Penn treebank 3 as our input data set. The extraction of features was the most important part because the results of POS tagging experiments can be improved by good selection of features. 2.1 Feature Extraction Initially the training data is read and all possible labels are extracted and stored in a file named Label.txt. Then we generate the features and store them in Train Feature List.txt. The format of the file is a collection of three columns under headers: feature name feature id feature count where each feature is unique and id and count of the features refer to the number of the feature in the list and the number of times it has occured in the training data set. Features are generally binary expressions that take the value 1 when certain conditions are satisfied. For example a typical feature f 1 may be turned on if the word at position i is make and the corresponding is VB. Similarly another feature f 2 maybe 1 when y i = VB and label y i 1 = NN etc. Another thing that we extract while generating features, are the Xfeatures in the file Train XFeature list.txt in the same format as the features. These are similar to context predicates as described by a similar work done by FlexCRF tagger [5]. These are identical to the features except that they do not have the labels that are present in the features. Finally we store the Sparse matrix corresponding to the input dataset in the file Train Sparse Matrix.txt. The format of the file is as follows (Each of the titles are headers for corresponding columns of data): word(of the dataset) current label prev label prev prev label <matrix><list of integers>

3 Corresponding to each word, we have the list of features generated corresponding to it in the same line of the file. These files are then used as input by the training files which calculate the CRF Loss. 2.2 Description of features Each feature consists of a number of values separated by hash(#) which acts as the delimiter for the various values in the feature. We have collected 8 main kinds of features: 1.< t 0, w 0 > State feature type 1 2.< t 0, w 1 > State feature type 1 3.< t 0, w 1 > State feature type 1 4.< t 0, t 1 > Edge feature type 1 5.< t 0, t 1, t 2 > Edge feature type 2 6.< t 0, t 1, w 0 > State feature type 2 7.< t 0, w 0, w 1 > State feature type 1 8.< t 0, w 0, w 1 > State feature type 1 Here t stands for the tag(label) and w stands for the word(token). The subscript 0 refers to the current word/label being looked at, accordingly -1 and 1 refer to the previous and next entry respectively. In each case we store the feature type number ( 1 through 8 ) followed by the rest of the attributes of the feature separated by hashes. Rare Features or Orthographic features These were the special features that were developed for improved training of the CRF. These features are activated only for those words whose count in the dataset is below a certain threshold( in our case 8). I haven t used them in training as of now, but these features are neverthless extracted and can be used for training with probably improvised weightage. The rare features that are extracted are: 9. HasDigit The word has a digit or not) 10.IsNumber(The word is a number or not) 11. Hyphen 12. Mixed capitals 13. All capitals 14. All capitals + ends with S 15. word is first in sentence and Mixed capitals 16. not of 15

4 17. Mixed Capitals + ends in an s 18. First in Sentence + mixed capitals + ends with s 19. (Not first in sentence) + mixed capitals + ends with s 20. Suffixes (upto length 4 or wordlength-2, whichever is smaller) 21. Prefixes (upto length 4 or wordlength-2, whichever is smaller) Each of the rare feature has the feature type number, the label of the word and the token interspaced by hashes. There are two passes made on the input file-the first one reads the input file, stores the preliminary features and counts the frequency of the words. The second pass generates the rare word features. 3 Training of the CRF loss function Conditional random fields(crfs) define a conditional probability distribution given by where p w (y x) = exp w.f(y,x) Z w (x) Z w (x) = y exp w.f(y,x) is the Partition function which is a normalizing term. The conditional distribution of CRF follows the Markov property using which it can be decomposed as the product of probabilities over dependant cliques of the independence graph.( Hammersley and Clifford 1971). The optimal label sequence is obtained by maximizing the conditional probability over al possible sequences of labels [4]. The CRF loss function is trained by maximizing the log likelihood of the loss function given by : (1) L w = k [w.f(y k, x k ) log(z w (x k ))] (2) where: w: the weight vector which is modified by the optimizer to minimize the loss k: position in the input sequence y k, x k : label and token at the k t h position log(z) : log of the partition function The corresponding gradient is given by: L w = k [F(y k, x k ) EF(y, x k )] (3)

5 where the Expectation of F(y, x k ) is calculated by the forward backward algorithm using alpha and beta vectors, details in [4]. 3.1 Usage of second order labels We use second order labels in order to handle feature type 5 and 6. Thus the training uses second order linear CRFs which need to use second order labels. Second order labels consist of a pair of primary labels and the second order label of every word in the input dataset is the label of the previous word coupled along with its own label. While using second order labels, we need to define a previous label corresponding to the first word of any sequence. We call this label as NA which occurs only for the first word of a sequence. While using second order labels any entry M[ij][jk] actually refers to the entries (i*number of primary labels+j) and (j*no. of primary labels + k) respectively in the matrix. 3.2 Input Format The training data is read by the loss calculating file in the following format: <data>: A list of data sequences <Data Sequence>: A vector of observation strings <Observation string>: A word of the sequence along with its label, its previous label, the label prior to it & a list of numbers denoting the id of the features which are on for that particular data entry. 3.3 Directory Structure crf_data_storage.hpp : header file crf_data_storage.cpp : reads in the data from Train Sparse matrix.txt and creates the first order and second-order labels. crf_feature_handler.hpp : header file

6 crf_feature_handler.cpp : Includes crf data storage.cpp, reads in the xfeatures and features from Train XFeature List and Train Feature List respectively. Generates the features and stores them in a map. State features are handled in a special manner. They are given special values based on the formula f.value = frequency of feature/frequency of corresponding xfeature. We also create a data structure storing all the state features corresponding to a particular xfeature which is used later by the viterbi, while generating weights for different labels based on the xfeatures from the test data. CRFLoss.hpp : Includes crf feature handler.hpp. The compute loss gradient function which is called by the main function in bmrm-train.cpp calculates the CRF loss and gradient value which is supplied to the optimizer. It calculates the beta and the alpha vectors( calculating the transition matrix M) [4] for every position in a given sequence for all the sequences in the input dataset. The beta vectors are all calculated together and stored in a vector the size of the length of the sequence. The alpha values are calculated one by one, the next value replacing the current value at the end of every iteration. The expectation value of the features is used to calculate the gradient value. At the end, the gradient and the loss are negated since the CRF calculates the expressions with the aim of maximizing the log likelihood, while our solver minimizes the log likelihood. The compute trans matrix calculates the transition matrix M between any two labels [1] for each position in the input sequence. We use a special vector St for calculating the values for the State features. Theoretically the transition matrix M has a number of entries added on to particular columns: The St vector stores these entries which otherwise would have to be added to all the elements of a column in M. This helps in developing efficiency as we just need to make a component wise multiplication with the St vector. The code currently developed is for CRFs of 2nd order, however I also have created an alternate CRF 1order module, which has the same code without feature type 5 and 6. 4 Future Work: Estimation Testing involves mainly calling the viterbi algorithm that maximizes the value of a particular label p(y x) using the edge transition Matrix and the

7 function value. This algorithm makes use of the data structure created while training which stores the features corresponding to any xfeature along with its appropriate weight. We maximize the value of the probability at the last entry of any sequence and then backtrack our path to find the corresponding labels at each of the previous positions. There is a vector of vector of stucts which stores all the viterbi information. The viterbi information is a vector, size of the length of the sequence, each of the elements being a vector of the size of labels. Each of these inner vectors stores the value corresponding to the state vector and the previous label that led to it. The compute edge matrix function stores the edge feature transition matrix whereas the compute state matrix function calculates the matrix corresponding to state features at a particular position. Code Details: viterbi_testing.hpp : header file viterbi_testing.cpp : The file where viterbi is applied, the edge matrix and the state vectors are calculated and the maximization and the backtracking is done. testing.hpp, testing.cpp (Not completed) Developing as an interface to call viterbi testing.cpp References [1] Andrew McCallum John Lafferty and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), [2] Christopher D. Manning and Kristina Toutanova. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pages 63 70, 2000.

8 [3] Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, [4] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, pages , [5] Le-Minh Nguyen Xuan-Hieu Phan and Cam-Tu Nguyen. Flexcrfs: Flexible conditional random field toolkit

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques