Improving the Performance of Text Categorization using N-gram Kernels

Size: px

Start display at page:

Download "Improving the Performance of Text Categorization using N-gram Kernels"

Lenard Simmons
5 years ago
Views:

1 Improving the Performance of Text Categorization using N-gram Kernels Varsha K. V *., Santhosh Kumar C., Reghu Raj P. C. * * Department of Computer Science and Engineering Govt. Engineering College, Palakkad, Kerala, India {varshavenugopal9, pcreghu}@gmail.com Machine Intelligence Research Lab Department of Electronics and Communication Engineering Amrita Vishwa Vidyapeetham, Coimbatore, Tamil Nadu, India cskumar@cb.amrita.edu ABSTRACT: Kernel Methods are known for their robustness in handling large feature space and are widely used as an alternative to external feature extraction based methods in tasks such as classification and regression. This work follows the approach of using different string kernels such as n-gram kernels and gappy-n-gram kernels on text classification. It studies how kernel concatenation and feature combination affects the classification accuracy of the system. It also explores how the kernel combination algorithms work on the system. The kernels are implemented as rational kernels, which satisfies the Mercer s Theorem ensuring the kernel matrices to be positive definite symmetric. The rational kernels are computed with a general algorithm of composition of weighted transducers which help in dealing with variable length sequences. These kernels are then used with SVM formulating efficient classifier for text categorization. Both one-stage and two stage algorithms are applied for kernel combination which were successful in achieving better system performance compared to that given by individual kernels. Keywords: Gappy-n-gram kernels, Text Classification, Kernel Methods Received: 28 September 2014, Revised 2 November 2014, Accepted 8 November DLINE. All Rights Reserved 1. Introduction The area of Natural Language Processing(NLP) and Bioinformatics largely need to analyze the similarity between the strings. Kernel Methods (KM) are powerful Machine Learning tools, which can alleviate the data representation problem. They substitute feature-based similarities with similarity functions, i.e., kernels, directly defined between training/test instances[4]. Hence they are considered as the best alternates for external feature extraction based classification systems. Additionally, the composition or adaptation of several kernels, facilitates the design of effective similarities required for new tasks, which also makes them worth to explore. A standard approach (Joachims, 1998) to text categorization makes use of the classical text representation technique (Salton et al., 1975), and was successful with Support Vector Machines. String Kernels are found to be successful in the area of text classification [1], which considers the document just as a long sequence. In kernel based methods the choice of the kernel, has been traditionally entirely left to the user. This paper uses learning kernel algorithms [2] which require the user, only to specify 8 International Journal of Computational Linguistics Research Volume 6 Number 1 March 2015

2 a family of kernels. This family of kernels can then be used by a learning algorithm to form a combined kernel and derive an accurate predictor. The rational kernels are the family of kernels including string kernels, which constructs the kernels in terms of transducers [3]. The concept of kernel combination is also an area, which can enhance the individual kernel performance [4], [5]. Many algorithms are there which can help in achieving good embedding of candidate kernels to get better accuracy [4], [6] This paper is built on a string kernel based classification system, which classifies the documents in terms of the continuous or discontinuous n-gram they share. Different kernel combination algorithms are applied on the system in order to get better performance. The behavior of the system with feature combination and kernel concatenation is also analyzed. 2. Kernel Methods Getting the similarity measures between the documents is the fundamental task of text classification. The Kernel Methods ( KMs) naturally induces the similarity between two documents in terms of their dot product in the feature space. Given an input document X, kernel can be defined [6] as a function k that returns the inner product over the feature space X X.For every x, y in X, satisfies k (x, y) = k (y, x) and Σ n i = 1 Σn j = i c i c j k (x i, x j ) 0 (1) for any c N, {c i } n i = 1 Rn and {x i } n i = 1 X n the matrix formed by all the k ij is called the Kernel Matrix. As the kernel values are computed using inner products all the values in the kernel matrix are positive. In terms of feature space the kernel function returns the dot product of the feature vectors. Thus there exist a mapping function ø which maps the input document X to a feature space F, as Ø : X X R applying kernel function on the feature space will return the inner product of the feature vectors. k (x, y) = Øx, Øy (2) This inner product serves as the similarity measure in Kernel Methods. When the value of this measure increases the similarity also becomes more. thus the kernel matrix which contain all these n n similarity measures serves as the reference for document similarity. Kernel methods can be readily be used with SVM classifier. SVMs are a class of algorithms that combine the principles of statistical learning theory with optimization techniques and the idea of a kernel mapping [6].Given a sample of N independent and identically distributed training instances {( x i, y i )} N where x is the D-dimensional input vector and y {1, +1} is its class i = 1 i i label, SVM basically finds the linear discriminant with the maximum margin in the feature space induced by the mapping function Ø : R D R S. The resulting discriminant function is f (x) = w, Φ (x) + b. (3) The classifier can be trained by solving the following quadratic optimization problem [7]. 3. String Kernels The representation and computation of rational kernels is based on weighted finite-state transducers. 3.1 Weighted Transducers A weighted transducer can be considered as a linear automaton with augmented output label and some real-valued weight that may represent a cost or a probability [7]. Input (output) labels are concatenated along a path to form an input (output) sequence. The weights of the transducers considered here are non-negative real values. Definition 1 [7]: A weighted finite-state transducer T over a semi-ring K is an 8-tuple T = ( Σ, Δ, Q, I, F, E, λ, ρ) where: Σ is the finite input alphabet of the transducer; Δ is the finite output alphabet; Q is a finite set of states; I Q; the set of initial states; F Q the set of final states; E Q ( Σ {ε} ( Δ {ε}) K Q a finite set of transitions; λ : I K the initial weight function; and ρ : F K the final weight function mapping F to K. International Journal of Computational Linguistics Research Volume 6 Number 1 March

3 Any path from initial state to accepting state is called accepting path. The weight of each accepted path is the sum of the product of the constituent transition weights. For input and output strings a common alphabet Σ is chosen. The weight associated by a weighted transducer T to a pair of strings (x, y) Σ Σ is denoted by T (x, y) and is obtained by summing the weights of all accepting paths with input label x and output label y. There are mainly two operations on transducers which is used for kernel computation, they are inverse and composition respectively. The inverse of a transducer T is given by just swapping the input and output symbols of the transducer, thus T 1 (y, x) = T (x, y) for any x, y in Σ. The composition T 1 ο T 2 is defined as [3], [8] (T 1 ο T 2 ) (x, y) = Σ z Σ T 1 (x, z) T 2 (z, y) (4) where x and y are the input sequences. The composition of the transducers over x and y will give the count of common sequence Z Σ they share, such as if the sequence z is absent in one of the input strings, then the counting term corresponding to that particular z will go zero. This concept is used in getting the similarity of two input string. 3.2 Rational Kernels The computation of rational kernels is done with the help of weighted transducers. The definitions are followed from [3], [8]. The rational kernels are the family of kernels that can be defined through weighted transducers. Most of the kernels widely used in classification belong to this family. They can be defined as the kernel K such that K ( x, y) = U ( x, y) (5) for every x, y X, where U is the weighted transducer. The theorem following [3] act as the main theorem which will help in solving problem of Positive Definite Symmetric(PDS) kernels for kernel learning Theoram1 [3]: Let T be an arbitrary weighted transducer. Then, the function defined by the transducer U = T ο T -1 is a PDS rational kernel. Thus, we will refer by PDS rational kernels to the rational kernels K defined by a transducer U = T ο T 1. To ensure that the finiteness of the kernel values, we will also assume that T does not admit any cycle with input. This implies that for any x Σ there are finitely many sequences z Σ for which T (x, z) = 0. 1) Algorithm for constructing rational kernel: Let K be a rational kernel and let T be the associated weighted transducer. Let A and B be two acyclic weighted automaton that represent just two strings x, y Σ or may be any complex weighted automaton. By definition of rational kernels (Theoram1) and the shortest-distance [3], K (A, B) can be computed by: Constructing the composed transducer N = A ο T ο B. Computing w [N], by determining the shortest-distance from the initial states of N to its final states using the shortest-distance [3] Computing ψ (w [N]) (where ψ is a function : K R such that K (x, y) = ψ (x, y)) [3]. 3.3 N-gram Kernel The n-gram kernels generate similarity by taking into account the count of common n-grams shared by the documents. The similarity is calculated in terms of the sum of the products of the common continuous n-gram shared. The n-gram kernels can be efficiently built from their corresponding n-gram count transducers. To construct the n-gram kernel the algorithm described on the above section is enough the only modification needed is that the transducer T should be an n-gram counting transducer. The count based similarity with an n-gram count transducer T n can be given as A ο T n : Expected count of n-grams in A T -1 n ο B: Expected count of n-grams in B A ο T n ο B: Expected count of matching sequences in A and B Thus the similarity based on shared n-grams can be efficiently found out. 3.4 Gappy-n-gram Kernel 10 International Journal of Computational Linguistics Research Volume 6 Number 1 March 2015

4 The gappy-n-gram kernel work similar to n-gram kernels but in a wider context. This kernel takes consideration of the discontinuous n-gram shared between the document as the measure of similarity. Thus n-grams with internal gaps are also taken into account. For this kernel in addition to the ngram length another parameter is also there which is the decay factor λ. The value of lambda varies between zero and one, for each gap the count get multiplied with this decay factor. Thus the greater the gap incorporated in n-gram the less important it is considered to be. The gappy-n-gram kernels can also be created with the transducers provided there will be extra self loops for each state with weight equal to decay factor. This is done in-order to include the gaps in the n-gram kernel. The rest of the kernel construction and similarity measures are same as with n-gram kernels. The computational cost is much higher for gappy-ngrams since it computes large feature space. Consider three strings cat, car, cast. The feature space generated by the two string kernels are given as Bigram features ca at ar as st car cat cast Gappy bigram features ca at ar as st cr ct cs car λ 0 0 cat λ 0 cast 1 λ λ 2 λ Thus the words cat and cast are made similar in terms of single bigram by n-gram kernel, but it has wider elaboration through gappy n-gram kernel. It considers the influence of discontinuous bigram also in similarity measures, with the decay factor penalizing for each gap. 4. String Kernel Classification System String kernel based classification system is supervised system. It processes the documents with the help of string kernels and classifies with SVM. The different steps in constructing the system is given below. 4.1 Preparing the Data Every document is converted to its Finite State Transducer representation. This conversion is necessary since the kernel computations are done in terms of transducers composition. Every transition in each FST represents transition from one ascii character to another. The weights for each transition is calculated in terms of negative log probabilities. The alphabet is taken as the entire character set. 4.2 Creating the String Kernels Both n-gram and gappy-n-gram kernels are created from the entire dataset. Here the n-gram kernel can be formed with the help of n-gram transducers which will count on every accepting n-gram. And using the transducer composition the corresponding n- gram kernels can be generated. The text documents which are converted to FSTs are then composed with these transducers inorder to get the kernel values. Thus each document gets mapped to both n-gram and gappy-n-gram feature space. 4.3 Evaluating the Kernels for the Dataset The evaluation of the n-gram kernel can be defined otherwise as the creation of kernel matrices. By applying the kernel function on N input strings which are automata we can generate the N N kernel matrix. The matrix is generated by simply taking the dot product of the feature vector corresponding to the input strings. 4.4 Training and Testing using SVM International Journal of Computational Linguistics Research Volume 6 Number 1 March

5 For the training of the system SVM can readily be used. The classification takes place according to the structural risk minimization algorithm and maximum marginal criterion[10]. The training and testing is done in a transduction way [4]. In this setting, optimizing the kernel K corresponds to choosing a kernel matrix formed using the entire dataset. This matrix consist of trainingdata block, mixed training, testing data block, and testing-data block as in [2]. In transduction setting, the training and test-data blocks are entangled: tuning training-data entries in K (to optimize their embedding) imply that test-data entries are automatically tuned in some way as well [2]. This can be achieved by constraining the search space of possible kernel matrices: the capacity of the search space of possible kernel matrices in order to prevent over fitting and achieve good generalization on test data. 4.5 Evaluation Measures After getting the predicted label values for the testing documents, the testing accuracy is used as an evaluation measure. Other than that the F1 measure is taken into account. The F1 measure is a trade of between the precision and recall of the entire system. We can calculate F1 measure as F1 = (2 Precision Recall) (Precision + Recall). The goodness of the system in classification indicates high precision and recall value thus high F1 value. 5. Multiple Kernel Learning Multiple Kernel Learning(MKL) learns a (linear or non linear) combination of kernels, in purpose of achieving better results, comparing to learn with a single kernel All kernel based methods can be potentially extended to MKL framework. Given a training set S = {(x 1, y 1 ),..., (x n, y n )}, Given a set of basic kernels {K 1,...,K M }, K k R n n, K k positive semi definite. The objective of MKL to optimize a cost function Q (K, S) where K is a combination of basic kernels, for example K = Σ M k = 1 μ k K k μ 0. [2] k In MKL, the combined kernel is a kernel matrix corresponding to the entire dataset which is learned, which optimizes a certain cost function that depends on the available labels. The available labels are used to learn a good embedding, apply it to both the labeled and the unlabeled data. The resulting kernel matrix can then be used in combination with support vector machine (SVM). There are one-stage and two-stage algorithms used in MKL. One-stage method consists of minimizing an objective function both with respect to the kernel combination parameters and the hypothesis chosen [2].The two-stage algorithms [7] learn kernels in the form of linear combinations of p base kernels K k, k [1, p]. In all cases, the final hypothesis learned belongs to the reproducing kernel Hilbert space associated to a kernel K μ = Σ p k = 1 μ k K, where the mixture weights are selected subject to k the condition μ k 0, which guarantees that K is a PDS kernel, and a condition on the norm of μ, μ = Λ 0, where Λ is a regularization parameter [7]. In the first stage, these algorithms determine the mixture weights. In the second stage, they train a kernel-based algorithm. Three MKL algorithms used for kernel combination described below. 5.1 Uniform Combination (unif) The kernels are combined with uniform weights. In this most straight- forward method, equal mixture weights are chosen,thus Λ the combined kernel matrix is K = ρ Σ p k = 1 K k. [7] 5.2 Alignment based Combination (align) This method uses the training sample to independently compute the alignment between each kernel matrix K k and the target kernel matrix K Y = yy T, based on the labels y, and to choose each mixture weight μ k proportional to that alignment. Thus, the p resulting kernel matrix is: K α Σ p k = 1 ρ (K k ; K Y ) K k [7] 5.3 Linear Combination (lin1) In this algorithm positive linear combination of kernels [4] are taken, and the regularization restricts the kernel matrix trace. Let {K 1,...,K m } be the kernels to be combined, the combination can be given as K = Σ m i = 1 μ i K i, where K 0, trace (K) c. The set {K 1,...,K m } could be a set of initial guesses of the kernel matrix, with different kernel parameter values. Instead of fine-tuning the kernel parameter for a given kernel using cross-validation, we can now evaluate the given kernel for a range of kernel parameters and then optimize the weights in the linear combination of the obtained kernel matrices. 12 International Journal of Computational Linguistics Research Volume 6 Number 1 March 2015

6 N-gram-kernel Category N-gram F1 Precision Recall Accuracy(%) acq 5, corn 4,5, crude 4, earn 5, Gappy-n-gram Kernel acq corn 3, crude 3, earn Table 1. The result on subset of Reuters21578 dataset n-gram and gappy-n-gram kerenl acq corn crude earn N-gram kernel 3-gram gram gram gram Gappy-n-gram kernel 3-gram gram gram gram Experiments and Results Table 2. Classification accuracy(%) with individual kernels For the experiments done on string kernel, a subset of reuters dataset with ModeApte split is used. The dataset contains a total of 466 document over four categories acquisition, corn, crude, and earn. From the 466 documents 377 documents were selected for training(including 154 earn,114 acq,76 crude, 38corn) and the remaining 89(including 42 earn,26 acq,15 crude, 10 corn) documents constitute the test set. The string kernels used are gappy-ngram kernel and n-gram kernel with length of n-gram varying in 3,4,5,6,7,8 were constructed on the dataset. The decay parameter for the gappy n-gram kernel was set to 0.5. The results for n-gram and gappy-n-gram classification is given in table 1.Only the best n-gram performance is reported. The classifier parameter is set as 1. The classification accuracy is found to be decreasing when the string length of the kernel increased. The decay parameter when increased the accuracy decayed. The feature combination and the weighted combination of kernels does not give significant improvements in the classification accuracy. The technique of Kernel concatenation gave improvement to the accuracy. The results with individual kernels are given in Table II. The improvement in accuracy by concatenation of both n-gram and gappyn-gram kernels is given in Table III. Through concatenation all categories mark significant change in accuracy. The kernel combining algorithms used belong to both one stage( lin1) and two-stage(unif, align) learning algorithms. Before the International Journal of Computational Linguistics Research Volume 6 Number 1 March

7 Kernels acq corn crude earn N-gram kernel 3, ,4, ,4,5, Gappy-n-gram kernel 3, ,4, ,4,5, Table 3. Classification accuracy(%) with combined kernels N-gram kernel Category unif lin1 align acq corn crude earn Gappy-n-gram kernel acq corn crude earn Table 4. The result obtained by applying kernel combination algorithm algorithms are applied, each base kernel is centered and normalized to have trace equal to one. The results are reported with table IV. The one-stage algorithm does not bring any improvement in accuracy, but the rest of the algorithm showed improvement over individual kernels. For this set of experiments only the 3,4,5 gram kernels are used. The reason is that in combining kernels the rest of the kernels seemed less contributing. 7. Conclusion The n-gram kernel and gappy-n-gram kernel based classification system delivered good performance. The performance of both the string kernels are found to be comparable, thus gappy-n-gram kernel is found worth analyzing text documents in wider context. The results achieved on the Reuters subset dataset were comparable to those reported in. [1] Few differences exists although, since the exact documents which are used in [1] are not used in this work, also the preprocessing done on is not used here. The classification accuracy of the system is found to be increased using the kernel concatenation and the algorithmic combination of string kernels. The experiments conducted using algorithms for kernel combination shows, the two stage algorithms to be more efficient than one-stage algorithms on this dataset. References [1] Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianin, N., Watkins, C. (2002). Text classification using string kernels, J. Mach. Learn. Res., 2, p , Mar [Online]. Available: 14 International Journal of Computational Linguistics Research Volume 6 Number 1 March 2015

8 [2]Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming, J. Mach. Learn. Res., 5, p , Dec. [Online]. Available: [3] Cortes, C., Haffner, P., Mohri, M., Bennett, K., Cesa-bianchi, N. (2004). Rational kernels: Theory and algorithms, Journal of Machine Learning Research, 5, p [4] Martins, A. (2006). String kernels and similarity measures for information retrieval, Tech. Rep. [5] Cortes, C., Mohri, M., Rostamizadeh, A. (2008). Learning sequence kernels, Oct. p [Online]. Available: /mlsp [6] Ben-Hur, A., Weston, J. (2010). A user s guide to support vector machines, Methods in Molecular Biology, 609, p , [Online]. Available: /bib/ben-hur/ben2010user/howto.pdf [7] Cortes, C., Mohri, M., Rostamizadeh, A., Two-stage learning kernel algorithms. [8] Cortes, C., Mohri, M. (2009). Learning with weighted transducers, In: Proceedings of the 2009 Conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7 th International Workshop FSMNLP Amsterdam, The Netherlands, The Netherlands: IOS Press, 2009, p [Online]. Available: International Journal of Computational Linguistics Research Volume 6 Number 1 March

Text Classification using String Kernels

Text Classification using String Kernels Huma Lodhi John Shawe-Taylor Nello Cristianini Chris Watkins Department of Computer Science Royal Holloway, University of London Egham, Surrey TW20 0E, UK fhuma,