The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Similar documents
Bayesian Spam Detection System Using Hybrid Feature Selection Method

Discovering Advertisement Links by Using URL Text

An Improved KNN Classification Algorithm based on Sampling

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Prediction of traffic flow based on the EMD and wavelet neural network Teng Feng 1,a,Xiaohong Wang 1,b,Yunlai He 1,c

Application of Support Vector Machine Algorithm in Spam Filtering

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Unknown Malicious Code Detection Based on Bayesian

SNS College of Technology, Coimbatore, India

Naïve Bayes for text classification

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Image and Video Quality Assessment Using Neural Network and SVM

K- Nearest Neighbors(KNN) And Predictive Accuracy

Video annotation based on adaptive annular spatial partition scheme

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Network Traffic Classification Based on Deep Learning

Supervised Learning Classification Algorithms Comparison

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

Network Traffic Measurements and Analysis

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Chapter 3: Supervised Learning

Machine Learning Classifiers and Boosting

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Support vector machines

Evaluating Classifiers

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

Performance Evaluation of Various Classification Algorithms

A Data Classification Algorithm of Internet of Things Based on Neural Network

Python With Data Science

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Contents. Preface to the Second Edition

International Journal of Advanced Research in Computer Science and Software Engineering

An Abnormal Data Detection Method Based on the Temporal-spatial Correlation in Wireless Sensor Networks

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Random projection for non-gaussian mixture models

Keyword Extraction by KNN considering Similarity among Features

A Novel Identification Approach to Encryption Mode of Block Cipher Cheng Tan1, a, Yifu Li 2,b and Shan Yao*2,c

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

2. On classification and related tasks

Automatic Shadow Removal by Illuminance in HSV Color Space

A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

ECLT 5810 Evaluation of Classification Quality

Fast or furious? - User analysis of SF Express Inc

Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

Link Prediction for Social Network

CSE 158. Web Mining and Recommender Systems. Midterm recap

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Predicting Popular Xbox games based on Search Queries of Users

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

An Empirical Study on Lazy Multilabel Classification Algorithms

Classification. Slide sources:

The Design and Implementation of Disaster Recovery in Dual-active Cloud Center

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Unsupervised Feature Selection for Sparse Data

String Vector based KNN for Text Categorization

Information Management course

Automatic Domain Partitioning for Multi-Domain Learning

Large Scale Data Analysis Using Deep Learning

Data mining: concepts and algorithms

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Applying Supervised Learning

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Lecture 9: Support Vector Machines

Supervised vs unsupervised clustering

Data Preprocessing. Data Preprocessing

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Research on Design and Application of Computer Database Quality Evaluation Model

Business Club. Decision Trees

An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Multi-label classification using rule-based classifier systems

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Research on outlier intrusion detection technologybased on data mining

Random Forest A. Fornaser

SYS 6021 Linear Statistical Models

MSA220 - Statistical Learning for Big Data

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,

KBSVM: KMeans-based SVM for Business Intelligence

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Text Classification for Spam Using Naïve Bayesian Classifier

Generative and discriminative classification techniques

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Identifying Important Communications

Chapter 6 Evaluation Metrics and Evaluation

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Data mining techniques for actuaries: an overview

Transcription:

The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification algorithm is one of the most important research fields in data mining. Accurate classification of text data is an important basis for information processing and text retrieval technology, and has a wide range of applications. The traditional text categorization models based on knowledge engineering and expert system are lack of flexibility, so it is of great theoretical and practical significance to study the performance of some machine learning algorithms in text categorization. In this paper filter spam methods using several machine learning algorithms are discussed. The Python language was applied to classify the text data. Then, the performance of different algorithms in text classification, such as polynomial model of naive Bayesian algorithm, Bernoulli model of naive Bayesian algorithm, the support vector machine algorithm and the K-nearest neighbor algorithm, were compared. In order to filter out some noise words that appear frequently but without valid information in the text, we proposed Chi-square test method to reduce the feature dimensions which improved the classification algorithm performance and classifier running speed. Furthermore, the accuracy, recall rate and F1-score of the above four algorithms with different feature dimensions are all compared. The numerical example showed that the support vector machine algorithm had higher accuracy in text categorization, but it runed slowly. The naive Bayesian algorithm was simple and fast, and had obvious superiority. Keywords naive Bayesian algorithm; text classification; support vector machine; K-nearest neighbor; Chi-square test method I. INTRODUCTION With the rapid development of computer related technology, the internet and its derivative resources produce huge amounts of text data. How to classify logically the text data according to the needs of consult, storage and application has become an increasingly important issue. Therefore, data mining and classification technology based on text content has gradually become the focus of attention. The function of text classification algorithm is to determine its category according to some characteristics of the text and the given category mark set in advance. The traditional text classification method is based on knowledge engineering and expert system, and has great defects in flexibility and classification effect. It is more and more unsuitable for the demand of increasingly complicated text data classification system. Since 1990s, the application of machine learning in text classification has received extensive attention [1-7]. A variety of machine learning algorithms have been widely used in text classification research, such as decision tree, support vector machine, naïve Bayesian algorithm, K-nearest neighbor algorithm, boosting algorithm, random forest algorithm, etc. The general rule of text classification algorithm is that the system summarizes the regularity of classification according to the data of some samples in the known classification, and establishes the discrimination rules of classification. When the new text data is encountered, the classification of the text is determined according to the identified rules. That is, automatic text classification can construct classifiers through supervised learning, so as to categorize automatically for new given text. This paper first introduced the classification rules and verification process of naive Bayesian algorithm, support vector machine, K-nearest neighbor algorithm; then used spam classification example to compare the performance and running speed of naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm, K-nearest neighbor algorithm. II. ALGORITHM CLASSIFICATION RULE Suppose the input space X R n be the set of n dimensional vectors, the output space c 1,c 2,,c k be the set of classification. Input feature vector X X, and the output variable Y is a class label. A. Naive Bayesian Algorithm Naive Bayesian algorithm is a common classification algorithm and is easy to implement and has higher efficiency in learning and prediction. Let X be a random vector defined on the input space, Y be a random variable defined on the output space, and P X,Y is joint probability distribution of X and Y. The training data set T x 1, y 1, x 2, y 2,, x N, y N is produced independently and identically distributed by P X,Y. The algorithm learns the joint probability distribution of the training data set according to the assumption of the characteristics conditional independence. For a certain input vector x, the posterior probability is calculated according to the Bayesians theorem, and the classification with the largest posterior probability is used as the output classification. Wang xin volume III issue xi nov 2017 Page 42

y arg max P Y c k X x k B. K-Nearest Neighbor Algorithm P X x Y c k P Y c k k P X x Y c k P Y c k k 1, 2,, N The K-nearest neighbor algorithm is one of the simplest machine learning algorithms, and its basic idea is to find the nearest K samples from the training data set T x 1, y 1, x 2, y 2,, x N, y N, and the input instance x will be classified as the largest proportion of classification in the K samples. The common method of calculating distance is L 1 p p n x, x j x ( l ) x ( l ) p i i j l 1 Where x i ( l ) is the lth component of the vector x i and p 1 ; L p is Euclidean distance when p 2. When K equals the number N of training data, any input instance will be classified as the category with the largest proportion of training data set. When K=1, the input instance x will be classified as be the category of the nearest instance of x. In general, we usually choose a smaller and more appropriate K value by cross testing. C. Support Vector Machine Algorithm Support vector machine (SVM) is a commonly used algorithm in machine learning. The proposed algorithm is applied to the two classification problem. After many years of research, it has been applied in the multi-classification problem. The basic principle is to find the maximum separation hyperplane in the feature space and divides the samples of training data set into two categories. Furthermore, the minimization of the empirical risk and confidence interval are realized by seeking to improve the minimum structure risk of machine learning generalization ability. It can also achieve the purpose of obtaining good statistical rules when the sample size is small [8]. For a training data set T x 1, y 1, x 2, y 2,, x N, y N, i where y 1, 1, the following optimization problem is constructed and solved: min 1 w 2 w, b 2 s.t. y i w x i b 1 0, i 1, 2,, N Then, the optimal decision w *,b * is obtained, and the classification decision function can be expressed as follows f x sign w* x b* D. Evaluation Indexes of Algorithm For classification algorithms, especially for two classification algorithms, some evaluation indexes, such as accuracy, recall rate and comprehensive evaluation index (F1- Measure), are commonly used. The total numbers of four cases predicted by the classifier on the test data set are recorded respectively as: : the number that instances with positive class are classified as the positive class; FN: the number that instances with positive class are classified as the negative class; FP: the number that instances with negative class are classified as the positive class; TN: the number that instances with negative class are classified as the negative class. The accuracy indicates the proportion of positive classes that are correctly classified in all the quantities that are predicted to be positive: P FP The recall rate indicates the proportion of positive classes that are correctly classified in all the quantities that are originally positive: R FN The comprehensive evaluation index is the weighted average of accuracy and recall rate: F 2 1 P R 2 P R For simplicity, it's generally advisable that 1, namely F1- Measure index. The confusion matrix is often used to visually observe the classification results of the classifier, and the confusion matrix is as follows: True result TABLE I. FN THE CONFUSION MATRIX Prediction result III. CONSTRUCTION OF CLASSIFIER In order to evaluate the performance of the model better, it is necessary to validate the model. Before using the training model, the total data set is divided into training data set and test data set to solve the error brought by simple cross validation. In this paper, we use 10-fold cross-validation, namely the data set is divided into ten subsets randomly. A total of ten tests are done for the classification model. During the ten testing processes, 9 subsets of 10 subsets are used as the training sets, and the 1 subset is used as the testing subset each time. The accuracy, recall rate and F1-index are FP TN Wang xin volume III issue xi nov 2017 Page 43

calculated, and then the average value of each index from 10 testing results is as the evaluation index of model performance. In practice, the K-fold method of the model selection module in the third party library scikit-learn of Python is used to conduct the 10-fold cross-validation. Further, metrics module in scikit-learn is used to form confusion matrix and analyze the performance of classifier model. In the process of text classification, a very serious problem is that taking words of text as features will result in the curse of dimensionality when the sample size is too large. Consequently, the training speed is too slow. So it is necessary to reduce the feature dimension and enhance the accuracy of algorithm. The dimensionality reduction method in this paper is mainly implemented by the Chi-square method, which is in the feature selection module of Python third party library scikit-learn. The process of verification and application of the algorithm is shown in Figure 1 and Figure 2, respectively: Raw data set Regular filtering text segmentation Chinese data set Vectorization data set Data matrix Matrix Form stop wordlist dimension reduction Training set Training data set Classifier Prediction sample Regular filtering text segmentation Chinese sample Filter words and vectorization samples Prediction matrix Prediction matrix Classification result Raw data set Fig. 2. Algorithm application process Training data set Classifier formation Regular filtering text segmentation Chinese data set Data matrix Vectorization data set Dimension reduction matrix Training data using classification algorithms Matrix dimension reduction Classification of test data sets Model evaluation 10-fold cross validation Testing data set Calculation evaluation index Fig. 1. Algorithm validation process A. Numerical Example In recent years, e-mail has replaced traditional mail as tool for people's daily communication because of its advantages of simplicity, convenience, fast propagation and wide dissemination. However, according to "2016 spam and phishing attacks report of the Kabasiji laboratory in [9], about 20% of the spam e-mails spread ransomware Trojan. Spam e-mails not only occupy memory space, but also mix with commercial advertising, fraud information, even with a virus, which seriously affect people's life. Therefore, we uses spam discrimination as example of text classification algorithm. The numerical example uses 16000 e-mails as the training data. Because the format of the e-mail is complex, we can filter Chinese character by Python regular library. Because of repeated sending of spam mail, there may be duplicate items. So there are 7062 non duplicate e-mails retained in data set after deleting duplicate items. Word segmentation for each e-mail is done by the library, and the key words are extracted to form the feature matrix. The part of the feature matrix is shown in Table II: TABLE II. THE FEATURE MATRIX integr dema form soft delet cons meet Spon ation nd ware e ult ing Sor 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 Wang xin volume III issue xi nov 2017 Page 44

0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 According to the algorithm application process shown in Fig. 2, naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm and the K- nearest neighbor algorithm are applied to the training data set, and the performances of the four algorithms are compared. The confusion matrices of the four algorithms are shown in Fig. 3- Fig. 6. Fig. 5. The confusion matrix of support vector machine algorithm The accuracy of the SVM algorithm on this training data set is 0.978, the recall rate is 0.975, and the F1-measure is 0.977, which is calculated by Fig. 5. Fig. 3. The confusion matrix of Naive Bayesian polynomial model By Fig. 3, it can be calculated that the accuracy of naive Bayesian polynomial model on this training data set is 0.963, the recall rate is 0.961, and the F1-measure is 0.962. Fig. 6. The confusion matrix of the K-nearest neighbor algorithm It can be calculated by Fig. 6 the accuracy of the K-nearest neighbor algorithm on this training data set is 0.967, the recall rate is 0.882, and the F1-measure is 0.922. The training times for the four algorithms are shown respectively in Table III: TABLE III. THE REQUIRED TRAINING TIME FOR THE FOUR ALGORITHMS Algorithms Required Time (seconds) naive Bayesian polynomial model 6 naive Bayesian Bernoulli model 11 support vector machine 780 the K-nearest neighbor 447 Fig. 4. The confusion matrix of Naive Bayesian Bernoulli model The accuracy of naive Bayesian Bernoulli model on this training data set is 0.948, the recall rate is 0.964, and the F1- measure is 0.956, which are seen by Fig. 4. According to the classification results, the three performance indexes of SVM algorithm are better than those of other models, but the training time of this algorithm is too long, which will bring bad user experience for the larger amount of data in practical applications. Naive Bayesian polynomial model has higher precision, lower recall rate, higher F1-measure and shorter training time than the naive Bayesian Bernoulli model. So naive Bayesian polynomial model is better than naive Bayesian Bernoulli model. Moreover, there is have a large gap between the K-nearest neighbor algorithm and other algorithms on each index Wang xin volume III issue xi nov 2017 Page 45

B. Comparison of Algorithms in Different Dimensions The Chi-square test dimensionality reduction algorithm has a very good theoretical basis of statistics. It is a widely used hypothesis testing method, which is often used to test the correlation between two classification variables. The basic idea is to calculate the Chi-square value between the theoretical value and the observation value. The smaller the Chi-square value is, the smaller the deviation between the observation value and the theoretical value; otherwise, the deviation between the observation value and the observation value is greater. In Fig.7-Fig.9 the horizontal coordinates represent the ordinate dimension, and the vertical coordinates represent accuracy, recall rate and F1 metric respectively. It can be seen that naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm are all almost not affected by the dimension size and have stable performance. The K-nearest neighbor algorithm is sensitive to the dimension size, which may be because the K-nearest neighbor algorithm is affected by K-value in different dimensions. In this paper K=4 and 8000-28000 dimensions are chosen, which is relatively stable and has better effect. Fig. 9. The F1-measure of four algorithms under different dimensions IV. CONCLUSON In the two models of the naive Bayesian algorithm, the polynomial model has higher accuracy and faster speed in text classification. Spam classification should pay more attention to accuracy, and ensure that users can receive normal mail when they get rid of spam more effectively. The naive Bayesian algorithm has a simple training process, although it is less accurate than SVM, but it is much faster than SVM. When the amount of data is huge, the advantage of fast speed is particularly obvious. The Chi-square test can help the naive Bayesian algorithm to reduce the dimensionality and improve the performance of the algorithm. It can achieve the effect of filtering noisy words and unintentional words, and can make the algorithm more effective. ACKNOWLEDGMENT This research was supported by the National Natural Science Foundation of China (71501016) and Qin Xin Talents Cultivation Program (QXTCP B201705), Beijing Information Science & Technology University. Fig. 7. The accuracy of four algorithms under different dimensions Fig. 8. The recall rate of four algorithms under different dimensions REFERENCES [1] Y.Yang. An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol 1, pp. 69-90, 1999. [2] F.Sebastiani. Machine learning in automated text categorization, Journal of ACM Computing Surveys, vol 34, pp. 1-47, 2002. [3] J. Sun,J.Xiao. Study on feedback learning of SVM-based chinese text classification. Control and desicion, vol 19, pp. 927-930, Augest 2004. [4] H.Kim, P. Howland, H.Park. Dimension reduction in text classification with support vectormachines, Journal of Machine Learning Research, vol.6, pp. 37-53, 2005. [5] Z. Yang. Research on text classification algorithms based on machine learning, University of Guangxi, 2007. [6] X. Zhang. Review of machine learning in Automatic text categorization, Jouranl of the China society for scientific and technical information, vol.25, pp. 730-739, December 2006. [7] J. Lai. Simulation Research of Text Categorization Based on Data Mining, Computer simulation, vol 28, pp. 195-198, December 2011. [8] H. Li. Statistical learning method, Beijing: Tsinghua University Press, 2012. [9] http://news.kaspersky.com.cn/news2017/02n/170220.htm. [2017-02-20]. Wang xin volume III issue xi nov 2017 Page 46