On Effective Classification via Neural Networks

Size: px

Start display at page:

Download "On Effective Classification via Neural Networks"

Marvin Sparks
5 years ago
Views:

1 On Effective Classification via Neural Networks Bin Cui 1, Anirban Mondal 2, Jialie Shen 3, Gao Cong 4, and Kian-Lee Tan 1 1 Singapore-MIT Alliance, National University of Singapore {cuibin, tankl}@comp.nus.edu.sg 2 University of Tokyo Japan anirban@tkl.iis.u-tokyo.ac.jp 3 University of New South Wales Australia jls@cse.unsw.edu.au 4 The University of Edinburgh UK gao.cong@ed.ac.uk Abstract. For addressing the growing problem of junk on the Internet, this paper proposes an effective classifying and cleansing method in this paper. Incidentally, messages can be modelled as semi-structured documents consisting of a set of fields with pre-defined semantics and a number of variable length free-text fields. Our proposed method deals with both fields having pre-defined semantics as well as variable length free-text fields for obtaining higher accuracy. The main contributions of this work are two-fold. First, we present a new model based on the Neural Network (NN) for classifying personal s. In particular, we treat files as a particular kind of plain text files, the implication being that our feature set is relatively large (since there are thousands of different terms in different files). Second, we propose the use of Principal Component Analysis (PCA) as a preprocessor of NN to reduce the data in terms of both size as well as dimensionality so that the input data become more classifiable and faster for the convergence of the training process used in the NN model. The results of our performance evaluation demonstrate that the proposed algorithm is indeed effective in performing filtering with reasonable accuracy. 1 Introduction The ever-increasing number of Internet users coupled with the widespread proliferation of as one of the fastest and most economical forms of communication have resulted in dramatically increasing number of junk s during the past few years. Consequently, users typically need to spend a non-trivial portion of their valuable time in order to delete junk s. Additionally, junk s can also fill up file server storage space quickly, especially at large sites with thousands of users, who may all be receiving duplicate copies of the same junk mail. To address the problem of growing volumes of junk s, automated methods for filtering are now beginning to be deployed in many commercial products, which typically allow users to define a set of logical rules for filtering junk s. However, these solutions have two serious drawbacks. First, they require the users to be savvy enough to create a set of robust rules for filtering purposes. Second, they require the users to constantly tune and refine the filtering rules in response to the K.V. Andersen, J. Debenham, and R. Wagner (Eds.): DEXA 2005, LNCS 3588, pp , c Springer-Verlag Berlin Heidelberg 2005

2 86 B. Cui et al. changes in junk s over time. Understandably, the problem of filtering junk E- mails is challenging in practice due to the dynamically changing nature of junk and the tremendously large number of s. In essence, an effective filter, which requires minimal manual work from the user, has now become a necessity. The problem of dealing with junk s has been extensively researched. Existing approaches to filtering junk s involve the deployment of data mining techniques [6,7], the usage of addresses [9] and the application of text classification techniques [5,1]. In the realm of text classification, an message is viewed as a document and a judgement of its ness is viewed as a class label associated with the document. While text classification has beenextensivelyresearched [4,3,12], empirical study on the document type of and the features of building an effective personal filter within the framework of text classification has received relatively little attention. In this regard, the main contributions of this paper are two-fold: 1. We present a new model based on the Neural Network (NN) for classifying personal s. In particular, we treat files as a specific kind of plain text files, the implication being that our feature set is relatively large (since there are thousands of different terms in different files). 2. We propose the use of Principle Component Analysis (PCA) as a preprocessor of NN to reduce the data in terms of both size and dimensionality so that the input data becomes more classifiable. This facilitates fast convergence of the training process used in the NN model. Notably, PCA only pre-processes the input to NN. The results of our extensive performance evaluation on a real dataset of personal s demonstrate that our proposed method is indeed effective in providing reasonable performance in terms of recall, precision and total accuracy rate, especially for s. To our knowledge, this is the first work on classification that uses a neural network-based strategy for classifying s. The remainder of the paper is organized as follows. Section 2 discusses existing related works, while Section 3 provides the details of our design for the NN method to classify s. Section 4 reports the results of our performance evaluation. Finally, we conclude in Section 5 with directions for future work. 2 Related Work This section reviews related works in the area of junk filtering [2,5,13], Neural Networks [8] and Principle Component Analysis [10]. The Bayesian approach to filtering junk [13] considered domain specific features as well as the raw text of messages and enhanced the performance of a Bayesian classifier by hand-crafting and incorporating many features that are indicative of junk . Representing each individual message as a binary vector, the proposal in [13] detects junk mail in a straightforward manner using a given pre-classified set of training messages. In [2], the authors compare methods for learning text classifiers focussing on the kinds of classification problems that might arise in filtering personal messages. In [5], the documents to be classified are regarded as semistructured textual documents comprising two parts. While one part is a set of structured

3 On Effective Classification via Neural Networks 87 fields with well-defined semantics, the other is a number of variable-length sections of free text. However, not many text classifiers take both portions into consideration. Additionally, conventional classification techniques may not be effective when dealing with variable-length free text. Notably, our work differs from the proposals in [2,5,13] in that these works focus on language processing, while we focus on general electronic text classification. Neural networks [8,11] have been widely adopted in various fields of applications such as pattern recognition and identification. A neural network consists of simple elements i.e., the neurons operating in parallel and the connections between them. The training of a neural network is to adjust the weights of the connections for minimizing the difference between the output of the neural network and the target of the training data. We adopt the supervised back-propagation neural network in our classification system. The advantage of a neural network arises from its computing power (due to its massively parallel distributed structure), its ability to learn and more importantly, its capability to generalize. We only need to design the network structure and then input the training data. The results may be affected by the selection of the network structure and the input attributes, the training data and the stopping criteria. PCA [10] is a widely used method for applications in signal/image filtering and pattern classification. It can transform data in the original space into another feature space, reduce the dimensionality of the input data, while keeping the most significant information. It examines the variance structure in the dataset and determines the directions along which the data exhibits high variance. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Working as a pre-processor of neural networks, it can make the input data more classifiable and reduce the dimensionality of the training and validation dataset by using only the first several features, thereby speeding up the convergence of the training process. 3 System Design This section discusses the design of our proposed system. 3.1 Model of NN for Classification This section discusses how the NN method is used for filtering. The processing via the NN method involves three steps, namely data pre-processing, training and testing, as depicted in Figure 1. The FeatureExtractionrefers to the data pre-processing, the details concerning which will be discussed shortly. For the training dataset, once we obtain the selected features, we feed them into the neural network and generate an E- mail classifier. For each test , we use the classifier to verify the efficiency of NN model. We adopt the error Back-propagation training algorithm in our model since it is one of the simplest as well as one of the most useful neural network algorithms. Now let us discuss the method used for presenting data to the network for training as well as for testing. We employ cross-validation and early stopping. The available data is divided into the following three disjoint sets:

4 88 B. Cui et al. Training: data Feature Feature vectors NN Learning Testing: data Extraction Feature vectors Classifier Output Fig. 1. The overview of NN method 1. Training set: This data is used to train the network. 2. Validation set: The error of the network averaged over this data is used to decide when the training algorithm has found the best approximation to the data without overfitting. 3. Testing set: The best network, determined by the validation test, is applied to the test set and the performance over the test set is given. Training is accomplished by calculating the derivative of the network s error with respect to each weight in the network when presented a particular input pattern. This derivative indicates which direction each weight should be adjusted to reduce the error. Each weight is modified by taking a small step in this direction. With a nonzero momentum factor, a fraction of the previous weight change is added to the new weight value. Notably, this accelerates learning in some cases. The patterns in the training set are traversed one-by-one. A pass through all the training patterns is called an epoch. The training data is repetitively presented for multiple epochs, until a specified number of epochs have been reached. After each epoch, the error of the network applied to the validation set of patterns is calculated. If the current network scores the lowest error so far on the validation set, this network s weights are saved. At the conclusion of training, the network s best weights are used to calculate the network s error on the testing set. Data Pre-processing. In general, messages are semi-structured documents that possess a set of structured fields with predefined semantics and a number of variablelength free text fields. The headers of s are structured fields and usually contain information pertaining to the document, such as sender, date, domain etc. The major contents of the are variable-length free text fields, such as subject and the body. Understandably, both the structured fields and the free text portion could contain important information which could help in determining the class to which an belongs. Therefore, an effective classifier should be able to include features from both the structured fields and the free text. To make optimal use of the information in an message, we generate two kinds of features for each which we use in our NN model. The details are as follows. Structured features: Features represented by structured fields in the header part of an document. In our model, seven structured features were generated.

5 On Effective Classification via Neural Networks 89 Attachment: Attachment occurs in the , true; else false. Content type: If the content type of is plain text, true; else false. Sender Domain: We extract the sender domain from header, if it contains edu, 1; contains com 2; else 3. FW: Subject of header starts with word FW, true; else false. Re: Subject of header starts with word Re, true; else false. To group: The is sent to a group, not a single person, true; else false. CC. The content of CC in the header of s is not empty, true; else false. Textual features: We use general text processing method for dealing with the textual features. The terms occurring in the body of a given and in the Subject of the are extracted and pre-processed, and are regarded as the features of the body of s. We use Document Frequency Threshold to remove those features that have little influence on classification work. Finally, the feature values of the terms are the term weights calculated by the simple idf.tf method. We also use Document Frequency Threshold to remove some features that have little influence on classification work. The threshold should not be too low since we should try to find fewer important features for NN according to the characteristic of NN itself. Training of NN Model. With the results of data pre-processing, we can do the training and generate a classifier to filter the un After processing both the header and body of all the training samples, a 2- D array of feature vector is obtained for and un s. The feature vector includes the features extracted from both the header as well as the body of the s. The class labels of the s are also recorded in a 2- D array. Class label is 1 for s and 0 for un s. 2. Use the Error Back-propagation algorithm for generating a classifier from the feature vectors and then the classifier is returned as a file. Testing of NN Model. In this stage, we will test the efficiency of our classifier. 1. Generate the feature vector for the header of each , group the feature vector of header and body together for each , and form a complete feature vector for each . The process is same as training part. 2. According to the feature vector generated for each new , use classifier, which has been built in training stage, to compute the output score for each . If the score > threshold (specified by the users), this can be regarded as the , or else un Principal Component Analysis Finding principal components (PCs) is basically a mathematical problem of finding the principal singular vectors of the input dataset using the well-known singular value decomposition method. The PCA transformation has two steps. First, we do principal component analysis of the training and validation data set. In this step, we transform the training and validation dataset into the singular vector space and get the eigenvectors

6 B. Cui et al. and eigenvalues of each dimension. Second, we transform the testing data set into the same space as the training and validation data. This is done by simply multiplying it with the eigenvector matrix produced in the first step. In order to have some idea of the data after the PCA transformation, we plot the first two principal vectors of a training dataset with 747 features after transformation in Figure 2. The points are the vectors of the s, while the circles are the vectors of the un E- mails. From the figure, we observe that by using only the first two features, the vectors corresponding to the and un s can be distinguished relatively easily. Figure 3 depicts the eigenvalues of the principal components. It is clear that the first few eigenvalues are much more important than the later ones Eigenvalue No. of dimensions Fig. 2. Plotting of the first two PCsof the 747-features dataset Fig. 3. Plotting of the eigenvalues of the 747- features dataset 4 Performance Evaluation This section reports the performance evaluation of our proposed NN model for classification. All the experiments have been conducted on a SUN E450 machine with SUN OS 5.7. We have used a total of 2000 real personal s as the dataset for our experiments. Notably, a dataset of 2K s is considered large in our context primarily because of the difficulty associated with the collection of personal s. We manually label each as or un i.e., s are divided into two classes for our experiments. For the dataset used in our experiments, the number of s and the number of un s were 1500 and 500 respectively. During the experiments, the whole dataset was divided into three portions randomly, namely training data, validation data and testing data. While the training data is used to train a classifier, the validation data scores the error of training and provides the best model, which is then classified by the testing data. We have used precision and recall as the metrics for evaluating the performance of our proposed classification approach because all the methods provide very fast response. Although the training stage of NN is time-consuming, the filtering stage of NN is extremely efficient and typically requires less than one second, hence we do not present response time as a metric in this paper. The definitions of recall and precision

7 On Effective Classification via Neural Networks 91 for and un s are quite similar. Here, we provide the definition for s only: recall = Nii Nii N, precision = N i ;wheren is the number of total s, N i is the number of s classified as,andn ii is the number of s classified as. The output of neural network is a float from 0 to 1, and we classify an as only when the output is greater than some pre-defined threshold. We label the 1 and un 0, hence when we test the , the output of approaches 1 and the output of un approaches 0 if we make a correct classification. Initially, we set the threshold as a default value 0.5. In all the experiments, we set a large epoch number in the training stage and only use the best weights in testing phase. 4.1 Effect of Neuron Number First, we test the effect of neuron number of our neural network model on the classification performance. Basically, the neural network model has two hidden layers, and we can tune the number of neurons in the both of hidden layers of the network. The neuron number in the first layer must be larger than 1; while the neuron number of the second layer can be 0, in this case the neural network has only one layer. Denote neuron pair as n1/n2,wheren1 is the number of first layer neuron and n2 is the number of second layer neuron. precision(%) recall(%) un 1/0 1/1 1/5 5/0 5/1 5/5 10/0 10/1 10/5 No. of neurons (a) Precision un 1/0 1/1 1/5 5/0 5/1 5/5 10/0 10/1 10/5 No. of neurons (b) Recall Fig. 4. Effect of neuron number on different layers The first experiment aims at the performance of the classifier when different neuron pairs are used. The performance of NN of different neuron number is shown in Figure sets of features and 9 neuron pairs are used in our experiment to test which pair of neurons can perform well with high precision and high recall. For each neuron pair, Figure 4 (a) and (b) show the average precision and average recall of all these feature sets, respectively. A few observations can be made from this experiment. First, the precision and recall curves of s are better than the ones of un s no matter which neuron pair is used. Second, it is obvious that neuron pairs 1/5 and 10/0 can perform better than other neuron pairs for both the s and

8 92 B. Cui et al. un s. The average recall for both and un s is above %. Finally, it is easier for neural network to identify files in our experiment. The reason could be that we select more files than un ones. Hence they can be clearly characterized by some features. In the following experiments, we will fix the structure of the NN model according to the performance shown in Figure 4. The NN has two hidden layers: the first layer has one neuron and the second layer has 5 neurons. 4.2 Effect of Feature Selection As mentioned earlier, we extracted two kinds of features from s, structured features and textual features. 8 structured features and up to 00 textual features are generated for files. One question under exploration is how important these features are in classification. From the experiment, we can see how many features are better and enough to classify s and how these features influence the recall, precision and accuracy rate of classification. The performance of NN for different feature selection is shown in Figure 5. precision(%) recall(%) un No. of features (a) Precision un No. of features (b) Recall Fig. 5. Effect of feature selection 10 feature sets are selected to study the performance of NN classifier. The sizes of 10 feature sets are 8, 28, 37, 50, 92, 121, 243, 508, 747, and 1313 respectively. Each feature set consists of all 8 structured features and some textual features selected by tf.idf formula. For example, 28 stands for 8 structured features plus 20 textual features. Figure 5 presents the precision and recall of different features for and un s. The results show that the performanceof classification is very unstable when the number of features we use is fewer than 200. Especially, when we only use structured features, the recall of un s is surprisingly high, %; on the contrary, the recall of s is very low, 79%. However, the overall recall and precision for classification increase and become more stable when we use more features. With the number of features increasing, the accuracy rate increase gradually. From the figures, we can see that although 1313 features provide the best performance, the 508 features perform only a little worse and it can save more than

9 On Effective Classification via Neural Networks 93 half space to save the features and weights, and the execution time is less than half. So based on the figures, we select 508 features as an optimal strategy. The results of the experiment show that the feature selection plays an important role in classification. We will use the PCA method to do the feature selection below and see how PCA influences the performance of classification more clearly. 4.3 Effect of PCA The Performance of NN after PCA processing is shown in Figure 6. We use 1313 features of dataset, and do the principal component analysis and generate the new feature space with different dimensions (PCs). precision(%) recall(%) un No. of dimensions (a) Precision Fig. 6. Effect of PCA un No. of dimensions (b) Recall Using PCA, most of the information in the original space is condensed into a few dimensions along which the variances in the data distribution are the largest. PCA method aims to transform the feature space into a new space there the first dimensions are more important, so we can easily select the more important features to do the classification effectively and efficiently. From the results, we can see that the average precision and recall rate is more than 93%, compared with about 85% when the PCA method is not used, for 8 features. Additionally, the number is low, say less than 200, the performance after PCA processing is more stable and better. Although in the second experiment, we select the most important words for each feature selection, the PCA can capture more information. When the number is 37, the performance is optimal and the space and time cost is only 10% of NN without PCA. Furthermore, we also find that fewer features selected by PCA can describe the characteristic of s as well as the whole original features. It shows that adding many unimportant features does not necessarily enhance the performance of classification. 4.4 Comparison with other Schemes We also compared our NN model with the decision tree and Naive Bayesian Classifier methods used in [5,13]. Because of different feature selections, we only compare the optimal performance for three methods. To simplify the comparison, we adopt the error number of false classification rate, which is number of classfied messages. The error rate of proposed NN model is only 4%, which is more than 50% better than two competitive methods.

10 94 B. Cui et al. 5 Conclusion In examining the growing problem of dealing with junk , we provide NN model which embeds PCA as a preprocessor to eliminate junk from a user s mail stream. The efficiency of such filters can also be greatly enhanced by considering not only the full text of the messages, but also a set of structural features. Different ways of feature selection for the model were evaluated. Performance of the classifier was compared with respect to feature selection, parameter tuning and PCA processing. Our experiment results show that our models provide good performance in filtering junk s. In the near future, we plan to incorporate other techniques (e.g., address analysis, real-time user feedback) into our proposed NN model. Moreover, we intend to examine the cost-effective integration of our classification scheme into existing systems. References 1. X. Carreras and L. Marquez. Boosting trees for anti-spam filtering. In Proc. Recent Advances in Natural Language Processing, W. W. Cohen. Learning rules that classify . Proc. the AAAI Spring Symposium on Machine Learning in Information Access, W.W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. Proc. SIGIR, M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the world wide web. Proc. the 15th National Conference on Artificial Intelligence, Y. L. Diao, H. J. Lu, and D. K. Wu. A comparative study of classification based personal filtering. Proc. the 4th PAKDD, T. Fawcett. in vivo spam filtering: A challenge problem for data mining. In KDD Explorations vol.5 no.2, K. R. Gee. Using latent semantic indexing to filter spam. In ACM Symposium on Applied Computing, Data Mining Track, S. Haykin. Neural networks: A comprehensive foundation. International Ed., Prentice-Hall, 2nd Ed, J. Ioannidis. Fighting spam by encapsulating policy in addresses. In Proc. Network and Distributed Systems Security Conference (NDSS), I. T. Jolliffe. Principle Componet Analysis. Springer-Verlag, S. Y. Kung. Digital neural networks. Prentice-Hall, D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval,, M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesian approach to filtering junk . Proc. AAAI Workshop Learning for Text Categorization, 1998.

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University