A Genetic-Fuzzy Classification Approach to improve High-Dimensional Intrusion Detection System

Size: px

Start display at page:

Download "A Genetic-Fuzzy Classification Approach to improve High-Dimensional Intrusion Detection System"

Millicent Stevenson
6 years ago
Views:

1 A Genetic-Fuzzy Classification Approach to improve High-Dimensional Intrusion Detection System Imen Gaied 1, Farah Jemili 1, and Ouajdi Korbaa 1 1 MARS Laboratory, ISITCom, University of Sousse, Tunisia gaiedimen@gmail.com Jmili_farah@yahoo.fr Ouajdi.Korbaa@centraliens-lile.org Abstract. With the increasing number of attacks and growing scalability of connected networks over the past few years, researchers are brought to find other alternatives to judge the relevance, severity and correlation of network attacks. The high-dimensional intrusion detection system seems a promising dynamic protection component in security fields. In this work we propose an optimized classification scheme that coordinates several techniques for generating fuzzy association rules based on a large data set. Our main task is to ameliorate the detection rate of attacks in a real-time environment by using the one-versus-one decomposition to minimize as much as possible the false alarm rate. In addition, we aim to reduce the loss of knowledge through a suitable n- dimensional overlap function in order to model the conjunction in fuzzy rules to provide enough classification accuracy. We can also opt for the aggregation method to obtain the final decision. To evaluate the performance of our approach, an experimental study is performed so as to achieve relevant results. The final outcome shows that our approach outperforms other classifiers by providing the highest detection accuracy, a low false alarm rate and time consumption. Keywords: Intrusion detection system, OVO decomposition, N-dimensional overlap function, Fuzzy rules associations, Detection rate, False positives. 1 Introduction There has been recently an exponential growth of network attacks and an increase in their severity. Therefore, developing an efficient Intrusion Detection System (IDS) becomes an important open problem that is receiving a noteworthy attention from the research community. The main focus of this research work is to put forward an intelligent and accurate intrusion detection using linguistic Fuzzy-Rule-Based Classification Systems (FRBCS) in synergy between the preprocessing techniques, the pairwise learning and the n-dimensional overlap functions. This approach aims to explore the large search space, achieve a more efficient result accuracy, maintain a high confidence and a good coverage of the proposed database and to provide the user with high-quality rules. To our knowledge, this is the first research paper which

2 suggests this approach for intrusion detection problems. To reach these goals, we have to deal at first with the imbalanced distribution of classes in high-dimensional problems. In fact, the classification accuracy of a classifier is directly affected by the quality of the training data used to construct the final model. Basically, we apply the preprocessing of datasets to remove or correct the noisy pattern. Then, we use the One-Versus-One (OVO) decomposition to decrease the complexity and the noise effect of an original problem [1]. To better decrease the effects caused by noise, we consider the fuzzy association rule learning algorithm known as FARC-HD (Fuzzy Association Rule-based Classification model for High-Dimensional problems) [2] in order to obtain the most accurate and highest quality rules. Nevertheless, the usage of the product as a T- norm in a baseline FARC-HD algorithm with an OVO decomposition results in low- variation values, especially when we use a higher number of antecedents of the fuzzy rules. These values are utilized in the aggregation step to get the last decision. Therefore, some robust aggregation methods are affected by this undesirable condition and so can lead to a poorer classification than the original FARC-HD. To address this problem, we have to substitute the product T- norm by a suitable overlap function [3][4]. The latter prior works defined n- dimensional overlap functions in order to obtain suitable outputs, which were in a wider range from the base classifiers. Hence, more information was maintained by a further aggregation process that could make a better improvement of the classification in the OVO scheme. To evaluate the performance of this new methodology in the intrusion detection problem, a comparison study was performed with another approach based on the original FARC-HD and OVO strategy [5]. Accordingly, we show the validity of such suggestions in ameliorating the accurate detection intrusion compared to other models. The remainder of this paper is organized as follows. Section 2 first describes the operating principle of FARC-HD algorithm and then the notion of classification via the pairwise techniques. After that, it specifies the binary tree of the SVM as an aggregation method, and finally draws a brief definition of the n-dimensional overlap functions. The proposed approach is presented in section 3. The obtained experiments and results are discussed in section 4. The conclusion and future work are given in section 5. 2 Proposed methodology 2.1 FARC-HD algorithm We use in this work the FARC-HD algorithm [2] in order to provide the high quality rules in a high dimensional problem. The model of the classification problem consists of m training examples with x p = (x p1,, x pn, C p ), p=1,2,...,m from m classes where x pi is the ith attribute value (i = 1,2,...,n). The fuzzy rule is depicted as following: Rule R j : If x 1 is A ji and... and x n is A jn then Class = C j with RW j (1) Where R j is the label of the jth rule, x = (x 1,, x n ) is an n-dimensional pattern vector, A ji is an antecedent fuzzy set, C j is the class label and RW j is the rule weight.

3 The learning process consists of three main steps which are: Fuzzy association rule mining for classification, Candidate rule prescreening, Genetic rule selection and lateral tuning. A search tree is built for each class to get the fuzzy rule base. We can reduce the search space by generating only the rules with a high support and a high confidence. To preselect the most strong rules, a rule assessment criterion is used by utilizing the "subgroup discovery" mechanism [8]. In order to decrease the computational costs, an evolutionary algorithm is carried out to obtain a lateral tuning of the fuzzy sets [9] so as to pick the final best rules from the rule base. To classify a new example, the FARC-HD applies a fuzzy reasoning method, called additive combination, composed of four steps: Matching degree, Association degree, Confidence degree and Classification. In the first step the strength of activation of the if-part for all rules in the Rule Base (RB) with the pattern x p is computed by means of using a conjunction operator (T-norm ). μ Aj (x p ) = T(μ Aj (x p1 ),, μ Aj (x pn )), j=1,...,l (2) The association degree of the pattern x p with each rule in the RB which has as a consequent class the rule R (k refers to class(r j )), is computed by using a combination operator h to combine the matching degree with the Rule Weight (RW): b j k = h (μ Aj (x p ), RW j k ), K = 1 M; j = 1 L (3) In the third stage we use an aggregation function f (the sum in the case of FARC- HD) whose positive association degrees computed in the previous stage are combined. y k = f( b j k, j = 1 L), K = 1 M, b j k > 0 (4) In the last step we apply a decision function F to determine the class that obtain the highest confidence degree is predicted. F(y 1 y M ) = argmax(y k ), [K = 1 M] (5) 2.2 Classification by using decomposition strategies: OVO In the OVO, the original multi-class is divided into m(m 1) binary sub-problems, which aim to distinguish a pair of classes C i, C j. When a new pattern is presented to each binary classifier, a pair of confidence degrees r ij, r ji is given in favor of two classes C i, C j (the class with the largest confidence is the output class of this classifier). All the outputs (confidence degrees) provided by all binary classifiers represented by a score matrix R are combined to make the final class prediction using aggregation models. 2

4 r 12 r 1m r 21 r 2m R = [ ] (6) r m1 r m2 In the literature [6] we distinguish different methods, which are the Max-Wins rule known as the Voting strategy (VOTE), the Winner Weighted Voting (WINW) whose validity was proven in [4], the methods based on preference relations (ND, LVPC). Please refer to [6][4][10] for more information. Otherwise, in this work we adopt an architecture of the Binary Tree of the SVM (BTS) [10] for the final decision. The BTS is an architecture of tree structure which can be extend easily to any type of binary classifier aiming to construct a recursively binary tree. At the beginning, its root node considers all classes in the list of classes and a binary classifier is selected randomly for training in order to get a separating plane. Then, all the samples in the node are assigned to two subnodes derived from the two classes. To complete the binary tree structure, we have to apply this classification strategy in every node until a leaf node containing only one class is reached. The decision is made when using the binary classifier to discriminate between two classes in order to distinguish the remaining other classes simultaneously. Thereby, more than one class can be removed from the list. In addition, to avoid any false assumption, especially when the leaf contains more than one class, the output is computed with the voting strategy. (a) Multiclass problem (b) Architecture of BTS Fig. 1. Six-class problem determined by BTS[10] Fig. 1 illustrates this concept applied on a six-class problem. The first node classifier discriminates classes 1 and 2. On the one hand, when class 1 is predicted, classes 4 and 6 are removed (when a testing sample is at position A, it does likely appertain to class 4 or 6). On the other hand, when class 2 is predicted, only class 1 is taken out. Therefore, classes 3 and 5 are maintained in the two next nodes by using probabilistic outputs, because class 3 is near the decision function (when a testing sample is at position B, it will have a chance to be classified to class 3) and because class 5 cannot be distinguished with the classifier in the root node.

5 2.3 Classification by using N-Dimensional Overlap Functions As we mentioned above, the confidence degrees provided by the original FARC- HD have low variations, which is not advantageous for the aggregation process performed in the OVO. This negative effect is explained by the use of the product T- norm to model the conjunction. To provide a better synergy between the FARC-HD and the OVO decomposition, a solution is proposed consisting in substituting the product in an association degree by an n-dimensional overlap function so as to obtain results in a wider range and solve the problem of penalization of the rules that have a great number of antecedents. In such a way, the outputs of the base classifiers become more suitable for the subsequent aggregation step. In our case we use a MIN operation as an overlap function that satisfies the property of discrimination capability of the FARC-HD and preserves the idempotence criteria. O n (x 1...x n ) = Min (x 1...x n ) (1) 3 Experimental framework In this section, we first describe the datasets picked for the experimental study (section 3.1). Next we provide details of the base classifiers used in the study by describing their configuration parameters and aggregation strategies in section 3.2. Finally, we present the measures employed to evaluate the performance of each classifier (section 3.3). 3.1 Datasets: KDD CUP 99 The experiments were implemented step by step by utilizing the KDD99 benchmark dataset which was used by many researchers to evaluate their IDS [5][7][12]. It has been the most widespread and complete dataset opted for data mining applications. It is divided into data labelled for a training phase (about 5,000,000 records) and data unlabelled for the test base named "corrected KDD" (about 311,000, it includes 14 new types of attacks). Each record contains TCP/IP connections composed of 41 attributes and comprising four main classes of attacks, namely: Probe, R2L, U2R and DOS. Unfortunately, this dataset might have been some disadvantages: the redundant nature of alerts and a broad base containing 41 attributes with an imbalanced distribution. Those dilemma can affect the detection of rare attacks rather than dominant ones. To resolve this problem, some researchers have used different sizes of data sets prepared by a random selection [5][7]. 3.2 Configuration parameters used for study In this section, we briefly present the FARC-HD algorithm used as a base classifier and different combination methods for the OVO decomposition scheme. We chose 5 labels per attributes for the fuzzy sets which are in the form of triangular MFs. We considered five aggregation methods in addition to our proposed BTS method, which are the VOTE, the Non-Dominance Criteria (ND), the Learning Valued Preference for

6 Classification(LVPC), the WinWV. As a conjunction operator, we selected the Product (PROD), the Geometric Mean (GM), the Harmonic Mean (HM) and the Minimum operator (MIN). As an inference procedure we set up the additive combination. Furthermore, we have fixed the minimum support to 0.05, the minimum confidence to 0.8, the maximum depth level of the tree to 3, the k parameter for the pre-screening to 2, the maximum evaluation to 4,327 and the population size to 50 [2]. 3.3 Performance measure of intrusion detection Unfortunately, it is not easy to design a satisfactory evaluation strategy to get a valuable conclusion. Indeed, the methods used previously show three major drawbacks : the use of a non-representative base of the test data, the lack of a rigorous evaluation methodology, and the utilization of improper metrics. We considered the following measures as an evaluation parameter for these reasons: Classification Rate (CR) : It is used as a classical metric, named also overall accuracy. It defines the fraction of instances predicted correctly (TP+TN) by the total number of instances (n). Cohen s kappa (Kappa) [11] : It is considered as an alternative measure to the CR. Since it scores the success instances independently for each class and aggregates them, compared to the CR that aims to score all the successes over all the classes. Kappa scoring is very convenient for measuring the classification s rate while it is less sensitive to the randomness caused by a different number of instances in each class. Mean F-Score : It denotes the average of the F-Score of each class. It represents the trade-off between precision and recall. These measures are commonly used to evaluate the rare class prediction. Detection Rate (DR) : It is an important measure for the intrusion detection. It is computed as a fraction of the number of correctly detected attacks (TP) by the total number of attacks ( TP+FN). False Positive Rate (FPR) : It is computed as a ratio between the number of normal examples detected as attacks (FP) and the total number of normal examples( FP+TN). Receiver Operating Characteristic (ROC): It is used for comparing the detectors together. The ROC curve can be generated by plotting TP(sensitivity) against FP(1- specificty) for each utilized threshold or decision cutoff. A data point in the upper left corner corresponds to an optimal high DR with low FPR. Area under ROC Curve (AUC): It is a performance measure used to compare the IDS, with an area of (0.1) representing a perfect test as it classifies all positive and negative cases correctly. 4 Experimental Study In this section, we discuss some conducted experimentations to show the performance of our suggested approach on large datasets. All the experimentations were implemented in Java language on an Intel Pentium IV personal computer with 2.40 GHz and 7.88GO RAM.

7 4.1 Datasets and pre-processing We proposed to apply three main preprocessing phases [7]. We removed all the repeated records in the entire KDD training and test set and retained only one copy of each record. In order to reduce the complexity of the large amount of a dataset, we processed the feature selection by applying the factorial multiple correspondence analysis [12] to extract the most relevant features of data. We used just 9 relevant attributes. To become adaptable to the fuzzy logic, a normalization step is necessary. It is determined by the Min-Max function. 4.2 Classifier structure We want to study here how the different overlap functions and the different aggregation strategy affect the rule base size on the one hand and the computation time of learning algorithm on the other hand. To do so, a 5-fold cross-validation model was considered. We used the whole of 10% KDD datasets and we split 90% random instances for training and 10% for testing. We considered the average number of rules and antecedents by rule for each overlap function. The results are presented in the Table 1. Table 2 shows the computation time of each proposition method. Here, we require developing an optimized IDS with low cost-rules, which is the case of the MIN conjunction function and the BTS aggregation method. Table 1. Number of rules and antecedents by rule for each conjunction function Conjunction function Number of rules Number of antecedents by rule PROD GM HM MIN Table 2. Computation time for each conjunction function Conjunction function Computation time (min) VOTE ND LVPC WinW BTS PROD GM HM MIN Intrusion detection performance comparison This section aims to analyze which method is considered the most robust strategy, accounting for the IDS problems. An experimental study was achieved in three steps. First, we studied the affect of our proposed overlap function and aggregation method, which are considered the most appropriate to improve the final performance of our suggested IDS. Second we emphasize the effectiveness of our accurate IDS in contrast with the other approaches. Finally, we adopted the ROC graphs and the AUC measure, which are considered a convenient way in comparing detectors together.

8 Table 3 shows the results for the test partition (10% of the datasets) of the baseline FARC-HD algorithm within the OVO strategies(farchd-ovo) for each aggregation method. Table 4, Table 5 and Table 6 represent the results of the FARCHD-OVO within the GM overlap function, HM overlap function and MIN overlap function respectively. It is important to notice that the function MIN reaches a good trade-off between DR and FPR with 95.80% as DR and 1.41% as FPR compared to other overlap functions: the function GM and HM with 94.25% as DR and 0.67% as FPR; with 94.95% as DR and 0.65% as FPR respectively. Table 3. Test evaluation of FARCHD-OVO-PROD approach Approach CR Kappa MFM DR FPR FARCHD-ND-PROD [5] FARCHD-VOTE- PROD FARCHD-LPVC- PROD FARCHD-WINW- PROD FARCHD-BTS- PROD Table 4. Test evaluation of FARCHD-OVO-GM approach Approach CR Kappa MFM DR FPR FARCHD-ND-GM FARCHD-VOTE-GM FARCHD-LPVC-GM FARCHD-WINW-GM FARCHD-BTS-GM Table 5. Test evaluation of FARCHD-OVO-HM approach Approach CR Kappa MFM DR FPR FARCHD-ND-HM FARCHD-VOTE-HM FARCHD-LPVC-HM FARCHD-WINW-HM FARCHD-BTS-HM Table 6. Test evaluation of FARCHD-OVO-MIN approach Approach CR Kappa MFM DR FPR FARCHD-ND-MIN FARCHD-VOTE- MIN FARCHD-LPVC- MIN FARCHD-WINW- MIN FARCHD-BTS- MIN When facing the results of our suggested FARCHD-BTS-MIN scheme, versus the other schemes, we see that there is a significant improvement in all instances, in particular the classification rate, the cohen s kappa and the detection rate. It be must

9 also pointed out that our approach is considered a specific solution to correctly detect the boundaries for all classes, because the average of the F-Score measure, regards to the FARCHD-VOTE-PROD method, achieves a high score, compared with other methods. These schemes prove their performance, to detect a rare attack with other categories of attacks. Additionally, it provides the most optimal trade-off of two important metrics : the detection rate and the false alarm rate with 95.80% and 1.41% respectively. It's important to stress that a little existing method of comparison in the quoted reference addressed this policy in intrusion detection. Among the studies, we mentioned this work [5] which opted for FARCHD-ND-PROD scheme, it aimed to prove its effectiveness to detect a rare attacks with other classes of attacks. Our method outperforms the latest method obviously. It has proven its effectiveness to detect a rare attack with 68.24% compared to FARCHD-ND-PROD method with 64.75%. Fig.2 illustrates the performance of our method using the ROC graphs and the AUC measure, compared to other various detection method used MIN operator, to complete our experimental study. For calculating the multi-class AUC "Total AUC", we have calculated the average AUC of each class. Considering the ROC curve of the FARCHD-BTS-MIN classifier, we notice that this curve always outperforms the other classifiers with AUC = The rest of the measures are mentioned in Table 7. Fig. 2. ROC Curve Table 7. AUC value for each aggregation method using MIN function overlap Aggregation method ND VOTE LVPC WINW BTS AUC Conclusion In this work we put forward an intelligent misuse intrusion detection based on interpretable model by means of using some linguistic rules of the FARC-HD algorithm, the pairwise learning that is precisely founded on an architecture known as the BTS method, and the MIN operator as a conjunction function. Our new policies is characterized by a good interpretation, a high attack detection with a low false alarm rate and a high computational speed. This conclusion is supported by the experiments

10 that have shown the effectiveness of our approach. Nevertheless, detecting rare attacks is still in need of improvement. References 1. Saez, J.A., Galar, M., Luengo, J., Herrera, F.: Analyzing the presence of noise in multiclass problems: alleviating its influence with the one-vs-one decomposition. Knowledge and information systems, vol. 38, no. 1, pp , (2014) 2. Alcala-Fdez, J., Alcala, R., Herrera, F.: A fuzzy association rule based classification model for high-dimensional problems with genetic rule selection and lateral tuning. IEEE Transactions on Fuzzy Systems, vol. 19, no. 5, pp ( 2011) 3. Elkano, M., Galar, M., Sanz, J., Bustince, H.: Fuzzy rule-based classification systems for multi-class problems using binary decomposition strategies: On the influence of n- dimensional overlap functions in the fuzzy reasoning method. Information Sciences, vol. 332, pp (2016) 4. Elkano, M., Galar, M., Sanz, J. A., Fernandez, A., Barrenechea, E., Herrera, F. Bustince, H.: Enhancing multiclass classification in FARC-HD fuzzy classifier: On the synergy between-dimensional overlap functions and decomposition strategies. IEEE Transactions on Fuzzy Systems, vol. 23, no. 5, pp (2015) 5. Elhag, S., Fernandez, A., Bawakid, A., Alshomrani, S., Herrera, F.: On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Systems with Applications, vol. 42, no. 1, pp (2015) 6. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, vol. 44, no. 8, pp , (2011) 7. Gaied, I., Jemili, F., Korbaa, O.: Intrusion detection based on Neuro-Fuzzy classification. International Conference on Computer Systems and Applications AICCSA 2015, vol. 5, pp. 1--8, November (2015) 8. Kavsek, B., Lavrac, N.: APRIORI-SD: Adapting association rule learning to subgroup discovery. Applied Artificial Intelligence, vol. 20, no. 7, pp , (2006) 9. Alcala, R., Herrera, F.: A proposal for the genetic lateral tuning of linguistic fuzzy systems and its interaction with rule selection. IEEE Transactions on Fuzzy Systems, vol. 15, no. 4, pp , (2007) 10. Fei, B. Liu, J.: Binary tree of SVM: a new fast multiclass training and classification algorithm. IEEE Transactions on Neural Networks, vol. 17, no. 3, pp , (2006) 11. Kohen, J. : A coefficient of agreement for nominal scale. Educational and Psychological Measurement, vol. 20, pp , (1960) 12. Jemili, F., Zaghdoud, M., Ben Ahmed, M.: Intrusion Detection based on Hybrid Propagation in Bayesian Networks. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics, pp , Dallas (2009)

2015. This manuscript version is made available under the CC-BY-NC-ND 4.0 license

2015. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ Fuzzy Rule-Based Classification Systems for Multi-class Problems Using