Kyaw Thet Khaing. University of Computer Studies, Yangon Abstract

Size: px

Start display at page:

Download "Kyaw Thet Khaing. University of Computer Studies, Yangon Abstract"

Roland Morris
5 years ago
Views:

1 Enhanced Features Ranking and Selection using Recursive Feature Elimination(RFE) and k-nearest Neighbor Algorithms in Support Vector Machine for Intrusion Detection University of Computer Studies, Yangon Abstract Today, as the increasing the amount of using internet, there are so most information interchanges are performed in that internet. So, the methods used as intrusion detective tools for protecting network systems against diverse attacks are became too important. The available of IDS are getting more powerful. Support Vector Machine was used as the classical pattern reorganization tools have been widely used for Intruder detections. There have some different characteristic of features in building an Intrusion Detection. Conventional SVM do not concern about that. Our enhanced SVM Model proposed with an Recursive Feature Elimination (RFE) and k- Nearest Neighbor (KNN) method to perform a feature ranking and selection task of the new model. RFE can reduce redundant & recursive features and KNN can select more precisely than the conventional SVM. Experiments and comparisons are conducted through intrusion dataset: the KDD Cup 1999 dataset. Key words: Enhanced features ranking and selection, recursive feature elimination, k-nearest neighbor algorithms, intrusion detection system 8

2 1 Introduction Various intrusion detection systems are studied and proposed to meet the challenges of a vulnerable internet environment [1], [3]. It is not a statement that an intrusion detection system is a must for a modern computer system. Intrusion detection technologies can be classified into two groups: misuse detection and anomaly detection [1]. A misuse detection system detects intrusion events that follow known patterns. These patterns describe a suspect set of sequences of actions or tasks that may be harmful. The main limitation of this approach is that it cannot detect possible novel intrusions, i.e., events that have never happened and captured previously. An anomaly detection based system analyzes event data and recognizes patterns of activities that appear to be normal. If an event lies outside of the patterns, it is reported as a possible intrusion. It is considered as a self-learning approach. We focus on anomaly intrusion detection in this study. Many artificial intelligence techniques have been used for anomaly intrusion detection. Lee et al. [9] established an anomaly detection model that integrates the association rules and frequency episodes with fuzzy logic to produce patterns for intrusion detection. Mohajeran et al. [10] developed an anomaly intrusion detection system that combines neural networks and fuzzy logic to analyze the KDD dataset. Wang et al. [14] applied genetic algorithms to optimize the membership function for mining fuzzy association rules. Support Vector Machines (SVM) has become one of the popular techniques for anomaly intrusion detection due to their good generalization nature and the ability to overcome the curse of dimensionality [3], [13]. Although there are some improvements, the number of dimensions still affects the performance of SVM-based classifiers [3]. Another issue is that an SVM treats every feature of data equally. In real intrusion detection datasets, many features are redundant or less important (John et al, 1994) It would be better if we consider feature weights during SVM training. Recursive Features Elimination (RFE) has the advantages to reduce the redundant and recursive features. This paper presents a study that incorporates RFE theory to SVM for intrusion detection. We propose a new SVM algorithm for considering weighting levels of different features and the dimensionality of intrusion data. Experiments and comparisons are conducted through intrusion dataset: the KDD Cup 1999 dataset1 that was recorded from the trace of systems calls coming from a UNIX system2. 2 Feature ranking with support vector machines 2.1 Support Vector Machines (SVM) To test the idea of using the weights of a classifier to produce a feature ranking, we used a state-of-the-art classification technique: Support Vector Machines (SVMs) [13]. SVMs have recently been intensively studied and benchmarked against a variety of techniques. They are presently one of the best-known classification techniques with computational advantages over their contenders [21]. Although SVMs handle non-linear decision boundaries of arbitrary complexity, we limit ourselves, in this paper, to linear SVMs because of the nature of the data sets under investigation. Linear SVMs are particular linear discriminated classifiers. An extension of the algorithm to the non-linear case can be found in the discussion section. If the training data set is linearly separable, a linear SVM is a maximum margin classifier. The decision boundary (a straight line in the case of a two-dimensional separation) is positioned to leave the largest possible margin on either side. A particularity of SVMs is that the weights wi of the decision function D(x) are a function only of a small subset of 8 the training examples, called support vectors. Those are the examples that are closest to the decision boundary and lie on the margin. The existence of such support vectors is at the origin of the computational properties of SVM and their competitive classification performance. While SVMs base their decision function on the support vectors that are the borderline cases, other methods base their decision function on the average case. We use soft-margin algorithm in this research. Training consists in executing the following quadratic program: 9

3 Algorithm SVM-train: Inputs:Training examples {x 1, x 2, x k, x l} and class labels {y 1, y 2, y k, y l}. Minimize over : J = (1/2) (1) subject to: 0 α k C and = 0 Outputs: Parameters. The summations run over all training patterns that are n dimensional feature vectors, xh.xk denotes the scalar product, encodes the class label as a binary value +1 or 1, is the Kronecker symbol ( =1 if h=k and 0 otherwise), and λ and C are positive constants (soft margin parameters). The soft margin parameters ensure convergence even when the problem is non-linearly separable or poorly conditioned. In such cases, some of the support vectors may not lie on the margin. Most authors use either l or C. We use a small value of l (of the order of 10-14) to ensure numerical stability. For the problems under study, the solution is rather insensitive to the value of C because the training data sets are linearly separable down to just a few features. A value of C=100 is adequate. The resulting decision function of an input vector x is: D(x) = w.x + b with w = and b= The weight vector w is a linear combination of training patterns. Most weights are zero. The training patterns with non-zero weights are support vectors. Those with weight satisfying the strict inequality 0<αk<C are marginal support vectors. The bias value b is an average over marginal support vectors. Many resources on support vector machines, including computer implementations can be found at: SVM-RFE Algorithm SVM-RFE is an application of RFE using the weight magnitude as ranking criterion. We present below an outline of the algorithm in the linear case, using SVM-train in Equation (1). Algorithm SVM-RFE: Inputs: Training examples X0 = [x 1, x 2, x k, x l] T Class labels y = [y 1, y 2, y k, y l] T Initialize: Subset of surviving features s = [1, 2, n] Feature ranked list r = [ ] Repeat until s = [ ] Restrict training examples to good feature indices X = X0(:, s) Train the classifier a = SVM-train(X, y) Compute the weight vector of dimension length(s) w = Compute the ranking criteria ci = (w i)2, for all i Find the feature with smallest ranking criterion f = argmin(c) 10

4 Update feature ranked list r = [s(f), r] Eliminate the feature with smallest ranking criterion s = s(1:f-1, f+1:length(s)) Output: Feature ranked list r. As mentioned before the algorithm can be generalized to remove more than one feature per step for speed reasons. 3 SVM-KNN Algorithm KNN Algorithm The 1-Nearest Neighbor(1NN) classifier is an important pattern recognizing method based on representative points (Bian et al. 2000). In the 1NN algorithm, whole train samples are taken as representative points and the distances from the test samples to each representative point are computed. The test samples have the same class label as the representative point nearest to them. The KNN is an extension of 1NN, which determines the test samples through finding the k nearest neighbors SVM-KNN Algorithm First, by analyzing the classifying process of SVM, a relationship between SVM and 1NN is found. This relationship is the theoretical basis of SVM-KNN and will be expatiated in Theorem 1. [Theorem 1] SVM classifier is equal to a 1NN classifier which chooses one representative point for the support vectors in each class. We examined distributions of wrong samples of SVM and found that they are almost always near the separating hyperplane. This prompts us that the information of hyperplane area should be used as much as we can in order to improve the classifying accuracy. We know that samples lying near the separating hyperplane area are basically support vectors. Instead of using SVM algorithm in which only one representative point is chosen for the support vector in each class and this representative point cannot represent efficiently the whole class, we use KNN to classify algorithm in this case, in which each support vector is taken as a representative point. That means more useful information can be utilized. Specifically, for samples far from the separating hyperplane (Region II in Fig. 1), the SVM classifying algorithm is available, while for samples close to the hyperplane (Region I), the KNN classifying algorithm is suitable. The main steps of the new classifying algorithm are as follows: step1 if Ttest _= Φ, get x Ttest, if Ttest = Φ, stop; step2 calculate g(x) = step3 if g(x) > ε, calculate directly f(x) = sgn(g(x)) as output; if g(x) < ε, put it into KNN algorithm to classify; step4 T T x, go to step1. 11

5 Figure. 1 The distances from the test sample φ(x) to two representative points φ(x)+ and φ(x) are calculated in a high dimension feature space, and the threshold ε and classifying algorithm are then decided. In the steps described above, Ttest refers to the test set and Φ represents the empty set. The distance threshold ε should satisfy 0 < ε < 1. Note that distance used in this algorithm is calculated in a high dimension feature space. The distance formula used here is based on the kernel function and takes the following form: 2 = k(x, x) 2k(x, xi) + k(xi, xi). (2) 4 Experiments and result analysis The dataset, KDD is used in experiments to evaluate the performance of the proposed new model. The KDD dataset consists of network connection records generated by a TCP/IP dump. It contains 4,940,000 connection records. There are 41 features in each record. 10% of the original data are training data with a label which identifies which category the record belongs. We only discuss binary classification. Four measures adapted from information retrieval [4] are used to evaluate the performance of an SVM model: precision=, recall=, false negative rate=, and false positive rate=. A, B, C, and D represent the number of detected intrusions, not intrusions but detected as intrusions, not detected intrusions, and not detected non-intrusions respectively. A false negative occurs when an intrusion action has occurred but the system considers it as a non-intrusive behavior. A false positive occurs when the system classifies an action as an intrusion while it is a legitimate action. A good intrusion detection system should perform with a high precision and a high recall, as well as a lower false positive rate and a lower false negative rate. To consider both the precision and false negative rate is very important as the normal data usually significantly outnumbers the intrusion data in practice. To only measure the precision of a system is misleading in such a situation. A poor intrusion detection system may have a high precision but a high false negative rate. There are four steps in our experiments. The first step is to remove redundant intrusion records. KDD dataset have more intrusion data than normal data. We filter the redundant intrusion records until the resulting datasets consisting of 1.5% intrusions and 98.5% normal records. There are no obvious feature-value pairs in the dataset. We use a mapping method to convert the dataset to feature-value format. The second step is to use RFE feature ranking and selection to calculate weights of each feature and delete unimportant features. After processing, the number of features of the KDD dataset is narrowed down from 41 to 16. The third step is to train the SVM. We generate one training set and three test sets for each of the datasets. For the KDD dataset, each set has 50,000 randomly selected records. The last step is to build a decision function to classify the test data. Experimental results for the dataset are presented in Table 1. 12

6 Table 1. Comparisons of the experimental results on the KDD dataset Nrecord Nfeature Precision (%) False Negative (%) CPU-Second Test set 1 Enhanced SVM Improvement Improvement 60.0% 0.4% 16.9% 66.0% Test set 2 Conventional SVM Enhanced SVM Improvement 60.0% 0.5% 16.2% 65.0% Test set 3 Conventional SVM Enhanced SVM Improvement 60.0% 0.3% 26.3% 66.0% Here are some observations from the experiments. The improvements of performance are consistent for all of the six test sets. This suggests that the new model has good generalization ability. The new model outperforms the conventional SVM in all three measures, namely, precision, false negative rate and CPU time for the KDD dataset. Although the improvement for precision is only 0.4% on average, the improvements for the other two are significant. The improvements for false negative rate are between 16.2% and 26.8%. The time used for the new model is only one third of the conventional SVM model. These results are similar to the results from other researchers with other methods on this dataset [29]. However, the CPU time is significantly reduced with the new model. 5 Conclusion We propose an enhanced SVM model for intrusion detection. The new model adopts Recursive Feature Elimination (RFE) to rank the features of intrusion detection data and k-nearest Neighbor (KNN) to improve more accuracy in classification. Only the important features will be counted when training an SVM. It is suggested that the proposed new model is effective for the KDD dataset. Although the precision levels of both the conventional SVM and the new model are not mostly different, the false negative rates of the new model are lower than the conventional SVM model. In addition, the time used to detect an intrusion of the new model is much less than the conventional SVM. The new model performed most perfectly in terms of accuracy. It also has an advantage, i.e., the running time is much less as fewer numbers of features are used for classification. References [1] Bace, R.G.: Intrusion Detection. Macmillan Technical Publishing. (2000). [2] Burge, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Datamining and knowledge discovery journal. 2(2) (1998) 121{167. [3] Dasarathy, B.V.: Intrusion detection, Information Fusion. 4(4) (2003) [4] Frakes, W.B., Baeza-Yates, R., Ricardo, B.Y.: Information Retrieval: Data Structures and Algorithms, Prentice- Hall, [5] Han, J.C., Sanchez, R., Hu, X.H.,: Feature Selection Based on Relative Attribute Dependency: An Experimental Study. RSFDGrC'05, I, LNAI (2005) [6] Hu, K., Lu, Y., Shi, C.: Feature Ranking in Rough Sets. AI Communications. 16 (2003) [7] Joachims, T.: Making large-scale SVM Learning Practical, Advances in Kernel Methods Support Vector Learning, MIT-Press, (1999). [8] John, G.H., Kohavi, R., P eger, K.: Irrelevant features and the subset selection problem. Proc. of the 11th Int. Conf. on Machine Learning. (1994) [9] Lee, W., Stolfo, S.J.: Data Mining Approaches for Intrusion Detection. The 7th USENIX Security Symposium. (1998) [10] Mohajerani, M., Moeini, A., Kianie, M.: NFIDS: A Neuro-fuzzy Intrusion Detection. Proc. of the 10th IEEE Int. Conf. on Electronics, Circuits and s. (2003) [11] Pawlak, Z., Grzymala-Busse, J., Slowinski, R., Ziarko, W.: Rough Set. Communications of the ACM. 38(11) (1995) [12] Qiao, Y., Xin, X.W., Bin, Y., Ge, S.: Anomaly Intrusion Detection Method Based on HMM. Electronics Letters. 38(13) (2002) [13] Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer (1995). [14] Wang, W.D., Bridges, S.: Genetic Algorithm Optimization of Membership Functions for Mining Fuzzy Association Rules. Proc. of the 7th Int. Conf. on Fuzzy Theory & Technology. (2000) [15] Warrender, C., Forrest, S., Pearlmutter, B.: Detecting Intrusions Using Calls: Alternative Data Models. Proc. of the IEEE Symposium on Security and Privacy. (1999) [16] Yao, J.T., Zhang, M.: Feature Selection with Adjustable Criteria. RSFDGrC'05, I, LNAI (2005)

7 [17] Yao, J.T., Zhao, S.L., Saxton, L.V.: A study on Fuzzy Intrusion Detection. Pro. of Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security, SPIE (2005) [18] Amari S., Wu S., 1999, Neural Networks, 12(6), 783 [19] Bian Z. Q., Zhang X. G., 2000, Pattern Recognition, Beijing: TsingHua Univ. Press [20] Chang C. C., Lin C. J., 2001, LIBSVM: a library for support vector machines (Version 2.3.1), http: //citeseer.ist.psu.edu/chang01libsvm.html [21] Cristianini N., Campbell C., Shawe-Taylor J., 1999, Neural Networks (ESANN) [22] Cui Y. M., Li R., Zhang L. Y., He Y. L, Wang H. N., 2006, Sol. Phys., 237, 45 [23] Gallagher P. T., Moon Y.-J., Wang H. M., 2002, Sol. Phys., 209, 171 [24] Gavrishchaka Valeriy V., Ganguli S. B., 2001, J. Geophys. Res., 106, [25] Li R., Ye S. W., Shi Z. Z., 2002, Chinese Journal of Electronics, 30(5), 745 [26] Lundstedt H., 1997, Geophys. Monogr., 98, 243 [27] Gong J. C., Xue B. S., Liu S. Q. et al., 2004, Chinese Astronomy and Astrophysics, 28, 174 [28] McIntosh P. S., 1990, Sol. Phys., 125, 251 [29] Wang J. L., 2000, Chinese Astronomy and Astrophysics, 24, 10 [30] Wheatland M. S., 2004, AJ, 609, 1134 [31] Zhang G. Q., Wang J. L., 1994, Progress in Geophysics, 9, 54 [32] Zhu C. L., Wang J. L., 2003, Chin. J. Astron. Astrophys. (ChJAA), 3, 563 [33] Wu M. R., 2000, The research on classifier design for pattern recognition problems of large scale, Ph.D, Beijing: Tsinghua University 14

A study on fuzzy intrusion detection

A study on fuzzy intrusion detection J.T. Yao S.L. Zhao L. V. Saxton Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 E-mail: [jtyao,zhao200s,saxton]@cs.uregina.ca