The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification algorithm is one of the most important research fields in data mining. Accurate classification of text data is an important basis for information processing and text retrieval technology, and has a wide range of applications. The traditional text categorization models based on knowledge engineering and expert system are lack of flexibility, so it is of great theoretical and practical significance to study the performance of some machine learning algorithms in text categorization. In this paper filter spam methods using several machine learning algorithms are discussed. The Python language was applied to classify the text data. Then, the performance of different algorithms in text classification, such as polynomial model of naive Bayesian algorithm, Bernoulli model of naive Bayesian algorithm, the support vector machine algorithm and the K-nearest neighbor algorithm, were compared. In order to filter out some noise words that appear frequently but without valid information in the text, we proposed Chi-square test method to reduce the feature dimensions which improved the classification algorithm performance and classifier running speed. Furthermore, the accuracy, recall rate and F1-score of the above four algorithms with different feature dimensions are all compared. The numerical example showed that the support vector machine algorithm had higher accuracy in text categorization, but it runed slowly. The naive Bayesian algorithm was simple and fast, and had obvious superiority. Keywords naive Bayesian algorithm; text classification; support vector machine; K-nearest neighbor; Chi-square test method I. INTRODUCTION With the rapid development of computer related technology, the internet and its derivative resources produce huge amounts of text data. How to classify logically the text data according to the needs of consult, storage and application has become an increasingly important issue. Therefore, data mining and classification technology based on text content has gradually become the focus of attention. The function of text classification algorithm is to determine its category according to some characteristics of the text and the given category mark set in advance. The traditional text classification method is based on knowledge engineering and expert system, and has great defects in flexibility and classification effect. It is more and more unsuitable for the demand of increasingly complicated text data classification system. Since 1990s, the application of machine learning in text classification has received extensive attention [1-7]. A variety of machine learning algorithms have been widely used in text classification research, such as decision tree, support vector machine, naïve Bayesian algorithm, K-nearest neighbor algorithm, boosting algorithm, random forest algorithm, etc. The general rule of text classification algorithm is that the system summarizes the regularity of classification according to the data of some samples in the known classification, and establishes the discrimination rules of classification. When the new text data is encountered, the classification of the text is determined according to the identified rules. That is, automatic text classification can construct classifiers through supervised learning, so as to categorize automatically for new given text. This paper first introduced the classification rules and verification process of naive Bayesian algorithm, support vector machine, K-nearest neighbor algorithm; then used spam classification example to compare the performance and running speed of naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm, K-nearest neighbor algorithm. II. ALGORITHM CLASSIFICATION RULE Suppose the input space X R n be the set of n dimensional vectors, the output space c 1,c 2,,c k be the set of classification. Input feature vector X X, and the output variable Y is a class label. A. Naive Bayesian Algorithm Naive Bayesian algorithm is a common classification algorithm and is easy to implement and has higher efficiency in learning and prediction. Let X be a random vector defined on the input space, Y be a random variable defined on the output space, and P X,Y is joint probability distribution of X and Y. The training data set T x 1, y 1, x 2, y 2,, x N, y N is produced independently and identically distributed by P X,Y. The algorithm learns the joint probability distribution of the training data set according to the assumption of the characteristics conditional independence. For a certain input vector x, the posterior probability is calculated according to the Bayesians theorem, and the classification with the largest posterior probability is used as the output classification. Wang xin volume III issue xi nov 2017 Page 42
y arg max P Y c k X x k B. K-Nearest Neighbor Algorithm P X x Y c k P Y c k k P X x Y c k P Y c k k 1, 2,, N The K-nearest neighbor algorithm is one of the simplest machine learning algorithms, and its basic idea is to find the nearest K samples from the training data set T x 1, y 1, x 2, y 2,, x N, y N, and the input instance x will be classified as the largest proportion of classification in the K samples. The common method of calculating distance is L 1 p p n x, x j x ( l ) x ( l ) p i i j l 1 Where x i ( l ) is the lth component of the vector x i and p 1 ; L p is Euclidean distance when p 2. When K equals the number N of training data, any input instance will be classified as the category with the largest proportion of training data set. When K=1, the input instance x will be classified as be the category of the nearest instance of x. In general, we usually choose a smaller and more appropriate K value by cross testing. C. Support Vector Machine Algorithm Support vector machine (SVM) is a commonly used algorithm in machine learning. The proposed algorithm is applied to the two classification problem. After many years of research, it has been applied in the multi-classification problem. The basic principle is to find the maximum separation hyperplane in the feature space and divides the samples of training data set into two categories. Furthermore, the minimization of the empirical risk and confidence interval are realized by seeking to improve the minimum structure risk of machine learning generalization ability. It can also achieve the purpose of obtaining good statistical rules when the sample size is small [8]. For a training data set T x 1, y 1, x 2, y 2,, x N, y N, i where y 1, 1, the following optimization problem is constructed and solved: min 1 w 2 w, b 2 s.t. y i w x i b 1 0, i 1, 2,, N Then, the optimal decision w *,b * is obtained, and the classification decision function can be expressed as follows f x sign w* x b* D. Evaluation Indexes of Algorithm For classification algorithms, especially for two classification algorithms, some evaluation indexes, such as accuracy, recall rate and comprehensive evaluation index (F1- Measure), are commonly used. The total numbers of four cases predicted by the classifier on the test data set are recorded respectively as: : the number that instances with positive class are classified as the positive class; FN: the number that instances with positive class are classified as the negative class; FP: the number that instances with negative class are classified as the positive class; TN: the number that instances with negative class are classified as the negative class. The accuracy indicates the proportion of positive classes that are correctly classified in all the quantities that are predicted to be positive: P FP The recall rate indicates the proportion of positive classes that are correctly classified in all the quantities that are originally positive: R FN The comprehensive evaluation index is the weighted average of accuracy and recall rate: F 2 1 P R 2 P R For simplicity, it's generally advisable that 1, namely F1- Measure index. The confusion matrix is often used to visually observe the classification results of the classifier, and the confusion matrix is as follows: True result TABLE I. FN THE CONFUSION MATRIX Prediction result III. CONSTRUCTION OF CLASSIFIER In order to evaluate the performance of the model better, it is necessary to validate the model. Before using the training model, the total data set is divided into training data set and test data set to solve the error brought by simple cross validation. In this paper, we use 10-fold cross-validation, namely the data set is divided into ten subsets randomly. A total of ten tests are done for the classification model. During the ten testing processes, 9 subsets of 10 subsets are used as the training sets, and the 1 subset is used as the testing subset each time. The accuracy, recall rate and F1-index are FP TN Wang xin volume III issue xi nov 2017 Page 43
calculated, and then the average value of each index from 10 testing results is as the evaluation index of model performance. In practice, the K-fold method of the model selection module in the third party library scikit-learn of Python is used to conduct the 10-fold cross-validation. Further, metrics module in scikit-learn is used to form confusion matrix and analyze the performance of classifier model. In the process of text classification, a very serious problem is that taking words of text as features will result in the curse of dimensionality when the sample size is too large. Consequently, the training speed is too slow. So it is necessary to reduce the feature dimension and enhance the accuracy of algorithm. The dimensionality reduction method in this paper is mainly implemented by the Chi-square method, which is in the feature selection module of Python third party library scikit-learn. The process of verification and application of the algorithm is shown in Figure 1 and Figure 2, respectively: Raw data set Regular filtering text segmentation Chinese data set Vectorization data set Data matrix Matrix Form stop wordlist dimension reduction Training set Training data set Classifier Prediction sample Regular filtering text segmentation Chinese sample Filter words and vectorization samples Prediction matrix Prediction matrix Classification result Raw data set Fig. 2. Algorithm application process Training data set Classifier formation Regular filtering text segmentation Chinese data set Data matrix Vectorization data set Dimension reduction matrix Training data using classification algorithms Matrix dimension reduction Classification of test data sets Model evaluation 10-fold cross validation Testing data set Calculation evaluation index Fig. 1. Algorithm validation process A. Numerical Example In recent years, e-mail has replaced traditional mail as tool for people's daily communication because of its advantages of simplicity, convenience, fast propagation and wide dissemination. However, according to "2016 spam and phishing attacks report of the Kabasiji laboratory in [9], about 20% of the spam e-mails spread ransomware Trojan. Spam e-mails not only occupy memory space, but also mix with commercial advertising, fraud information, even with a virus, which seriously affect people's life. Therefore, we uses spam discrimination as example of text classification algorithm. The numerical example uses 16000 e-mails as the training data. Because the format of the e-mail is complex, we can filter Chinese character by Python regular library. Because of repeated sending of spam mail, there may be duplicate items. So there are 7062 non duplicate e-mails retained in data set after deleting duplicate items. Word segmentation for each e-mail is done by the library, and the key words are extracted to form the feature matrix. The part of the feature matrix is shown in Table II: TABLE II. THE FEATURE MATRIX integr dema form soft delet cons meet Spon ation nd ware e ult ing Sor 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 Wang xin volume III issue xi nov 2017 Page 44
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 According to the algorithm application process shown in Fig. 2, naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm and the K- nearest neighbor algorithm are applied to the training data set, and the performances of the four algorithms are compared. The confusion matrices of the four algorithms are shown in Fig. 3- Fig. 6. Fig. 5. The confusion matrix of support vector machine algorithm The accuracy of the SVM algorithm on this training data set is 0.978, the recall rate is 0.975, and the F1-measure is 0.977, which is calculated by Fig. 5. Fig. 3. The confusion matrix of Naive Bayesian polynomial model By Fig. 3, it can be calculated that the accuracy of naive Bayesian polynomial model on this training data set is 0.963, the recall rate is 0.961, and the F1-measure is 0.962. Fig. 6. The confusion matrix of the K-nearest neighbor algorithm It can be calculated by Fig. 6 the accuracy of the K-nearest neighbor algorithm on this training data set is 0.967, the recall rate is 0.882, and the F1-measure is 0.922. The training times for the four algorithms are shown respectively in Table III: TABLE III. THE REQUIRED TRAINING TIME FOR THE FOUR ALGORITHMS Algorithms Required Time (seconds) naive Bayesian polynomial model 6 naive Bayesian Bernoulli model 11 support vector machine 780 the K-nearest neighbor 447 Fig. 4. The confusion matrix of Naive Bayesian Bernoulli model The accuracy of naive Bayesian Bernoulli model on this training data set is 0.948, the recall rate is 0.964, and the F1- measure is 0.956, which are seen by Fig. 4. According to the classification results, the three performance indexes of SVM algorithm are better than those of other models, but the training time of this algorithm is too long, which will bring bad user experience for the larger amount of data in practical applications. Naive Bayesian polynomial model has higher precision, lower recall rate, higher F1-measure and shorter training time than the naive Bayesian Bernoulli model. So naive Bayesian polynomial model is better than naive Bayesian Bernoulli model. Moreover, there is have a large gap between the K-nearest neighbor algorithm and other algorithms on each index Wang xin volume III issue xi nov 2017 Page 45
B. Comparison of Algorithms in Different Dimensions The Chi-square test dimensionality reduction algorithm has a very good theoretical basis of statistics. It is a widely used hypothesis testing method, which is often used to test the correlation between two classification variables. The basic idea is to calculate the Chi-square value between the theoretical value and the observation value. The smaller the Chi-square value is, the smaller the deviation between the observation value and the theoretical value; otherwise, the deviation between the observation value and the observation value is greater. In Fig.7-Fig.9 the horizontal coordinates represent the ordinate dimension, and the vertical coordinates represent accuracy, recall rate and F1 metric respectively. It can be seen that naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm are all almost not affected by the dimension size and have stable performance. The K-nearest neighbor algorithm is sensitive to the dimension size, which may be because the K-nearest neighbor algorithm is affected by K-value in different dimensions. In this paper K=4 and 8000-28000 dimensions are chosen, which is relatively stable and has better effect. Fig. 9. The F1-measure of four algorithms under different dimensions IV. CONCLUSON In the two models of the naive Bayesian algorithm, the polynomial model has higher accuracy and faster speed in text classification. Spam classification should pay more attention to accuracy, and ensure that users can receive normal mail when they get rid of spam more effectively. The naive Bayesian algorithm has a simple training process, although it is less accurate than SVM, but it is much faster than SVM. When the amount of data is huge, the advantage of fast speed is particularly obvious. The Chi-square test can help the naive Bayesian algorithm to reduce the dimensionality and improve the performance of the algorithm. It can achieve the effect of filtering noisy words and unintentional words, and can make the algorithm more effective. ACKNOWLEDGMENT This research was supported by the National Natural Science Foundation of China (71501016) and Qin Xin Talents Cultivation Program (QXTCP B201705), Beijing Information Science & Technology University. Fig. 7. The accuracy of four algorithms under different dimensions Fig. 8. The recall rate of four algorithms under different dimensions REFERENCES [1] Y.Yang. An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol 1, pp. 69-90, 1999. [2] F.Sebastiani. Machine learning in automated text categorization, Journal of ACM Computing Surveys, vol 34, pp. 1-47, 2002. [3] J. Sun,J.Xiao. Study on feedback learning of SVM-based chinese text classification. Control and desicion, vol 19, pp. 927-930, Augest 2004. [4] H.Kim, P. Howland, H.Park. Dimension reduction in text classification with support vectormachines, Journal of Machine Learning Research, vol.6, pp. 37-53, 2005. [5] Z. Yang. Research on text classification algorithms based on machine learning, University of Guangxi, 2007. [6] X. Zhang. Review of machine learning in Automatic text categorization, Jouranl of the China society for scientific and technical information, vol.25, pp. 730-739, December 2006. [7] J. Lai. Simulation Research of Text Categorization Based on Data Mining, Computer simulation, vol 28, pp. 195-198, December 2011. [8] H. Li. Statistical learning method, Beijing: Tsinghua University Press, 2012. [9] http://news.kaspersky.com.cn/news2017/02n/170220.htm. [2017-02-20]. Wang xin volume III issue xi nov 2017 Page 46