CHAPTER 5 CONTRIBUTORY ANALYSIS OF NSL-KDD CUP DATA SET

5 CONTRIBUTORY ANALYSIS OF NSL-KDD CUP DATA SET An IDS monitors the network bustle through incoming and outgoing data to assess the conduct of data usage thereby identifying any apprehensive activity and alerting with a sign of intrusion [5]. There are two types of intrusion detection techniques known as misuse and anomaly detection. Misuse detection is possible only for those attacks whose prior knowledge is present in the data set used for training the model [13]. The challenge is to develop an efficient model for real time intrusion detection which can be modeled for online data. Anomaly detection [14] also called profile based detection approach is one such technique that adapts to the normal behavior of the user/network and applies statistical measures to events or activities to decide whether the encountered event is normal or not [15]. Though there are a number of measures available to analyze the performance of IDS but the focus of this study is only on two key performance metrics: DR and FAR. The efficiency of IDS can be talked about in terms of these two metrics which can be depicted in the form of ROC curve [46]. 5.1 INTRODUCTION The ultimate aim in the development of IDS is to achieve highest accuracy. The two basic techniques used in intrusion detection have their own advantages and disadvantages. The misuse detection can very well detect known attacks with lower FAR but fail to identify novel attacks whereas the strength of anomaly detection technique is the ability to detect unknown attacks but suffers from the drawback of high FAR [70], [71]. KDD Cup data set has played a key role in studying and analyzing IDS whose attributes can be labeled in four classes. The objective of this study is to assimilate the contribution of attributes from each of these four classes in achieving high DR and low FAR. Machine learning algorithms are employed to study the classification of KDD Cup data set in two classes of normal and anomalous data. Different variants of KDD Cup data set are created with respect to four labels and each of these variants is simulated on a set of three algorithms. The results derived from the study of each data variant is analyzed and compared to derive a broad conclusion. This pragmatic study compiles the findings for DR and FAR in IDS with respect to data under each of the 64

four labels. The study contributes to the estimation of desired attributes for achieving maximum DR and minimum FAR simultaneously while adhering to the earlier findings signifying the obligatory connection of basic labeled attributes to intrusion detection. Further, in the study, an attribute ranking technique is used to rank the 41 attributes of KDD Cup data set in reference to IDS. The attribute label of each ranked attribute is recognized and the study of these ranked attributes is done with respect to the four attribute labels. The results are compared with the existing results of attribute label s study [43, 72] presented in this and the last chapter and the feature selection studies thus validating the contribution of each of the four attribute label for IDS. The study can be useful to the researchers experimenting in the area of feature selection/reduction. The results of this study can also be fruitful in case a new database is to be developed for IDS. This study does not focus on individual attribute because an attribute characteristic may change with platform and protocol instead the study clarifies the role of attribute labels which will remain almost same. The study can help reduce the data complexity while identifying major attributes of a particular label that are significant in getting high DR and low FAR at the same time. Attribute Class/Label Abbreviation Attributes Basic B 1-9 Content C 10-22 Traffic T 23-31 Host H 32-41 Table 5.1: Categorization of attributes with four labels. In this practical study, NSL-KDD Cup data set is used. As discussed in the last chapter, this data set has 42 attributes out of which 41 are classified under one of the following labels: Basic, Content, Traffic, or Host [31], [35]. The details of categorization of 41 attributes with four labels are specified in Table 5.1. The selected data set has many possible arrangements such that the records may be classified in binary classes as normal/anomalous or in five classes as normal, Denial of Service (DoS) attack, User to Root (U2R) attack, Remote to Local (R2L) attack and Probe attack. 65

5.1.1 FEATURE RANKING RULES Feature selection basically refers to the process of identifying prominent attributes with respect to their contribution in achieving the desired goal. Lower the dimensionality of the data set, lighter the system developed on top of the data set [5]. Though there is always loss of information associated when trying to reduce the number of attributes but it is necessary to understand the basic requirement from the developed system so that the results from the system with original number of attributes can be compared with the results from the system with reduced number of attributes. Some rules are designed to ascertain the significance of an attribute for IDS and are listed in Table 5.2. A stands for Accuracy, FP stands for False Positive and FN stands for False Negative. Considering increased accuracy, if FP and FN both decreases then the feature under study is concluded to be insignificant. Considering another case, if A decreases with increase in FP and FN then the feature is identified to be significant. Third case considers an increase in FN with constant values for A and FP; the feature is treated as important. In other cases, the feature is considered important. A FP FN Feature Significance Increases Decreases Decreases Insignificant Decreases Increases Increases Important Constant Constant Increases Important X X X Important Table 5.2: Rules to determine feature significance. 5.2 OBJECTIVE The objective of this research is to study and interpret the role of 41 attributes of NSL- KDD Cup data set with respect to four specified labels as in Table 5.1 on DR and FAR for IDS. The focus is not to analyze the contribution of each of the 41 attributes individually for feature selection purpose but study the cumulative effect as per the four labels. Though, the results of this study can be used to improve the process of feature selection at a later stage. The goal of any efficient IDS is to achieve maximum DR with minimum FAR [47]. Further, the objective of this study is also to validate the contribution of above mentioned labels done in previous studies [43], [72] for IDS. 66

This is done in two steps: first by ranking the individual attributes of the KDD Cup data set and converting the results as per four labels and second by comparing the previously observed label contributions with ranker results and already accessible feature selection results. This chapter aspires to deduce which categories of the four labeled attributes contribute significantly in achieving high DR and low FAR. The conclusions drawn from this empirical study can help overcome the limitation of training data which in the case of anomaly tries to over protect the network from intrusions thereby increasing the FAR. Hence the audit data used in anomaly detection to detect novel attacks can be enhanced so that FAR is negligible. Considering the misuse detection also known as signature based IDS, the performance is majorly based on the known signatures of the attack. These signatures are obtained from the data set used in the detection of intrusions. This data set is generally derived or obtained from the online data exchange over a period of time covering different types of possible intrusion attacks. Therefore, the quality of data used for detecting intrusion attacks is of utmost importance because the chances of detection would be high if the data set under reference by the IDS encompasses most of the attacks. Hence it can be said that the attributes of the data under the IDS reference for detecting attacks should be critically selected to ensure maximum coverage of attacks. It should be noted that the duplicated and unnecessary attributes also need to be identified and eliminated from the data set because this elimination will lead to low complexity of the data set and hence less time consumption in detecting the attack. The contribution of various attributes of the data set under reference by IDS for detecting attacks needs to be estimated. The study of contribution of each attribute for intrusion detection can lead to ranking these attributes in the order of their usability to detect intrusions effectively. The ranking can help eliminated the least important attributes with respect to IDS. This exclusion of attributes can lead to reduction in the dimensionality of the data set thereby adding efficiency to IDS. 5.2.1 DESIGN Fig. 5.1 shows the design of the proposed work. A systematic approach is used to make fifteen possible configurations of KDD Cup data set based on the four labels given to attributes. 67

List Fifteen different label configuration of data Prepare Training data file Prepare Test data file Testing Training Training and Simulation Confusion Matrix Repeat for fifteen data files Result Analysis Tabulate the results Figure 5.1: Architectural design of proposed work. Sr. No. Attribute class Combinations # Attributes B C T H 1 BCTH 41 2 BCT 31 X 3 BCH 32 X.4 BTH 28 X 5 CTH 32 x 6 BC 22 X X 7 BT 18 X X 8 BH 19 X X 9 CT 22 x X 10 CH 23 x X 11 TH 19 x X 12 B 9 X X X 13 C 13 x X X 14 T 9 x X X 15 H 10 x X X Table 5.3: Combinations of attributes with maximum four labels for KDD Cup data set. The total number of attribute labels is four (N=4) hence sixteen different combinations are possible (2 N ). The NULL combination comprising of nil label with zero attributes 68

is excluded. Hence, there are fifteen combinations possible to form different configurations of data set (2 N -1) [43]. The data set which includes training as well as the test files is preprocessed individually to develop fifteen configurations as per Table 5.3. Out of the total 41 attributes (excluding class attribute), the attributes not required for one of the fifteen selected configuration are removed from training and test data file. The last attribute Class which remains integral in all the fifteen configurations describes whether the instance is a normal record or an anomalous one. 5.3 EXPERIMENTAL SETUP The experimental setup presents the data set employed for the study, the tool used for simulation and the research methodology applied to conduct the test thus generating the results. Weka 3.7.11 [64], [67] is used for preprocessing and simulation of KDD Cup data set on the chosen classification algorithms. The KDD Cup data set files for training and testing are preprocessed in Weka. These fifteen data set configurations are simulated for three classification algorithms, Random Forest, OneR and Naïve Bayes. This study considers the binary classification data set whose details are listed in Table 5.4. Normal Class Instances Anomalous Class Instances Total KDDTrain+_20Percent (Training Data) 13449 11743 25192 KDDTest+ (Test Data) 9711 12833 22544 Table 5.4: Data instances of NSL-KDD Cup data set. Considering the validation study, the number of attributes under scrutiny is 41 and the last attribute is class which explains the classification result. The number of instances used in the data set is 25192 identified as KDDTrain-20Percent. Ranker algorithm [64] is used to rank the 41 attributes of KDD Cup data set. This algorithm ranks the attributes by their individual assessment. Ranker algorithm is basically the search technique and the InfoGainAttributeEval is the attribute evaluator. Higher the information gain better is the capability of the attribute to discriminate for classification. 69

5.3.1 EVALUATION METRICS The evaluation metrics help assess the performance of an IDS. Some of the evaluation metrics majorly used in measuring the efficiency of IDS are accuracy, DR, FAR, precision and F-score. All these metrics are derived from the four basic result elements of any classification algorithm presented in the form of confusion matrix which illustrates the actual instance classes versus predicted classification result. A good IDS tries to achieve maximum possible accuracy, F-score and DR with minimum FAR. 5.4 SIMULATION RESULTS 5.4.1 IMPLEMENTATION Implementation of the design presented in the last section is shown with the help of Weka tool snapshots. Figure 5.2: Preprocessing of data set. The implementation is same as presented in the last chapter, except that instead of Random Tree algorithm, the algorithms used are Naïve Bayes, Random Forest and One-R. As shown in the last chapter, Fig. 5.2 depicts the preprocessing of the data set. Fig. 5.3 shows the list of training files and Fig. 5.4 shows the list of test data files. 70

Figure 5.3: List of training files. Figure 5.4: List of test files. Fifteen training and testing files are again considered for implementation. The purpose is to test the results on another set of algorithms and further validate them with the existing studies. The number of instances for each of the fifteen training and test data files is same. 71

5.4.2 OBSERVATIONS The simulation results from the confusion matrix for fifteen configurations of data set are shown in Table 5.5 for Naive Bayes, Table 5.6 for Random Forest and Table 5.7 for OneR algorithm. The result comprises of the TP, TN, FP and FN values for each of the fifteen combinations with respect to the three selected classifiers. Sr. No. Attribute Class Combination Naïve Bayes TN FN FP TP 1 BCTH 9010 4582 701 8251 2 BCT 9005 4354 706 8479 3 BCH 9357 4861 354 7972 4 BTH 9021 4623 690 8210 5 CTH 8997 4546 714 8287 6 BC 9563 6250 148 6583 7 BT 9020 4502 691 8331 8 BH 9375 4929 336 7904 9 CT 8915 4386 796 8447 10 CH 9325 4683 386 8150 11 TH 9008 4574 703 8259 12 B 9625 9365 86 3468 13 C 7350 2619 2361 10214 14 T 8962 4411 749 8422 15 H 9340 4738 371 8095 Table 5.5: Result set for Naive Bayes algorithm. The summary of results for DR is presented in Table 5.8 and for FAR is presented in Table 5.9. The key metrics used in the study are DR and FAR. The classification results in the form of DR and FAR for all the fifteen cases of attribute class s combinations are presented for Random Forest, OneR and Naïve Bayes classifiers. These classification results are used to further compute the evaluation metrics thereby assessing and comparing the performance of IDS. 72

Sr. No. Attribute Class Combination Random Forest TN FN FP TP 1 BCTH 9446 4067 265 8766 2 BCT 8975 3616 736 9217 3 BCH 9218 4039 493 8794 4 BTH 9168 4297 543 8536 5 CTH 9043 5267 668 7566 6 BC 8859 3172 852 9661 7 BT 8890 3588 821 9245 8 BH 9434 4079 277 8754 9 CT 8973 5125 738 7708 10 CH 9001 5569 710 7264 11 TH 9026 5342 685 7491 12 B 8848 2608 863 10225 13 C 7349 2743 2362 10090 14 T 9081 5627 630 7206 15 H 8982 5475 729 7358 Table 5.6: Result set for Random Forest algorithm. Sr. No. Attribute Class Combination OneR TN FN FP TP 1 BCTH 9300 3652 411 9181 2 BCT 9300 3652 411 9181 3 BCH 9300 3652 411 9181 4 BTH 9300 3652 411 9181 5 CTH 9544 7187 167 5646 6 BC 9300 3652 411 9181 7 BT 9300 3652 411 9181 8 BH 9300 3652 411 9181 9 CT 9544 7187 167 5646 10 CH 9083 5878 628 6955 11 TH 9544 7187 167 5646 12 B 9300 3652 411 9181 13 C 7350 2619 2361 10214 14 T 9544 7187 167 5646 15 H 9083 628 5878 6955 Table 5.7: Result set for OneR algorithm. 73

Sr. No. Attribute Class Combination Detection Rate (%) Random Forest OneR Naïve Bayes 1 BCTH 68.31 71.54 64.30 2 BCT 71.82 71.54 66.07 3 BCH 68.53 71.54 62.12 4 BTH 66.52 71.54 63.98 5 CTH 58.96 44.00 64.58 6 BC 75.28 71.54 51.30 7 BT 72.04 71.54 64.92 8 BH 68.21 71.54 61.59 9 CT 60.06 44.00 65.82 10 CH 56.60 54.20 63.51 11 TH 58.37 44.00 64.36 12 B 79.68 71.54 27.02 13 C 78.63 79.59 79.59 14 T 56.15 44.00 65.63 15 H 57.34 91.72 63.08 Table 5.8: Detection rate for Random Forest, OneR and Naive Bayes algorithm. Sr. No. Attribute Class Combination False Alarm Rate (%) Random Forest OneR Naïve Bayes 1 BCTH 2.73 4.23 7.22 2 BCT 7.58 4.23 7.27 3 BCH 5.08 4.23 3.65 4 BTH 5.59 4.23 7.11 5 CTH 6.88 1.72 7.35 6 BC 8.77 4.23 1.52 7 BT 8.45 4.23 7.12 8 BH 2.85 4.23 3.46 9 CT 7.60 1.72 8.20 10 CH 7.31 6.47 3.97 11 TH 7.05 1.72 7.24 12 B 8.89 4.23 0.89 13 C 24.32 24.31 24.31 14 T 6.49 1.72 7.71 15 H 7.51 39.29 3.82 Table 5.9: False alarm rate for Random Forest, OneR and Naive Bayes algorithm. 74

Attribute Rank Ranked Attribute (Highest on top) Average Merit 1 src_bytes 0.803 B 2 Service 0.672 B 3 dst_bytes 0.632 B 4 Flag 0.519 B 5 diff_srv_rate 0.516 T 6 same_srv_rate 0.507 T 7 dst_host_srv_count 0.473 H 8 dst_host_same_srv_rate 0.439 H 9 dst_host_diff_srv_rate 0.413 H 10 dst_host_serror_rate 0.404 H 11 logged_in 0.402 C 12 dst_host_srv_serror_rate 0.396 H 13 serror_rate 0.39 T 14 Count 0.382 T 15 srv_serror_rate 0.377 T 16 dst_host_srv_diff_host_rate 0.269 H 17 dst_host_count 0.195 H 18 dst_host_same_src_port_rate 0.192 H 19 srv_diff_host_rate 0.144 T 20 srv_count 0.094 T 21 dst_host_srv_rerror_rate 0.088 H Attribute Class Table 5.10: Simulation results on ranker algorithm (ranking 1-21). Attribute Rank Ranked Attribute (Highest on top) Average Merit 22 protocol_type 0.064 B 23 rerror_rate 0.057 T 24 dst_host_rerror_rate 0.054 H 25 srv_rerror_rate 0.052 T 26 Duration 0.033 B 27 Hot 0.011 C 28 wrong_fragment 0.01 B 29 num_compromised 0.006 C 30 num_root 0.004 C 31 num_access_files 0.002 C 32 is_guest_login 0.001 C 33 num_file_creations 0.001 C 34 su_attempted 0.001 C 35 root_shell 0 C 36 Land 0 B 37 num_shells 0 C 38 num_failed_logins 0 C 39 Urgent 0 B 40 num_outbound_cmds 0 C 41 is_host_login 0 C Attribute Class Table 5.11: Simulation results on ranker algorithm (ranking 22-41). 75

In this chapter, the results are analyzed with respect to DR and FAR individually, which emphasizes on one attribute, two attribute and three attribute class combinations of attributes for all the three classification algorithms under study. The conclusions are drawn only for those highlighted points on the plots where each of the three classification algorithm shows significant conduct. This emphasizes the dominant behavior of each of the label classes and thus its associated attributes. Hence, algorithms from different class of machine learning are considered to ensure that there is no biasing in the results and are in accordance. The results of the ranker algorithm simulated on the NSL-KDD data set attributes is listed in Table 5.10 and Table 5.11. The observations include the average merit of each attribute and the observed ranking. The table also presents the label of each ranked attribute. The focus of this validation part of the study is to do the analysis according to the four labels not individually. For this, the results for individual attribute are grouped under the four labels. 5.4.3 DISCUSSION Fig. 5.5 to Fig. 5.7 presents plot for the three classification algorithms with respect to one, two and three labeled attribute combinations respectively. Figure 5.5: Detection rate distribution considering single attribute class for Random Forest, OneR and Naïve Bayes. Considering the analysis of DR, Fig. 5.5 depicts the plot of DR with respect to single class of attributes. The green marked arrow highlights the high DR for all the 76

three classifiers for content labeled attributes and red marked arrow highlights the low DR for traffic labeled attributes. Hence it can be concluded from Fig. 5.5 that the content class attributes have significant contribution towards achieving high DR whereas the traffic class attributes deteriorate the same. Figure 5.6: Detection rate distribution considering two attribute classes for Random Forest, OneR and Naïve Bayes. Figure 5.7: Detection rate distribution considering three attribute classes for Random Forest, OneR and Naïve Bayes. In Fig. 5.6, red arrow highlights the combination of basic and host (BH) labeled attributes which reflects lower DR as compared to basic and traffic (BT) label for all 77

the three classifiers. Hence, attributes of host label show poor performance for DR as compared to traffic labeled attributes. Similarly, Fig. 5.7 shows the plot of all three labeled classes in comparison to the original set of four labeled attributes. The green arrow highlights the BCT label showing DR nearly equal to the BCTH labeled attributes. The red arrow highlights the CTH label depicting the absence of basic attributes and it can be observed that DR significantly falls for all the three classification algorithms. Considering FAR for the three classifiers, Fig. 5.8 to Fig. 5.10 is plotted with respect to one, two and three class attributes respectively. Fig. 5.8 shows FAR for the three classifiers with respect to single class of attributes. The green arrow highlights that the FAR is minimum for basic attributes for all the three classifiers whereas the red arrow highlights the contribution of content labeled attributes towards higher FAR. Also, it can be observed that FAR is on the lower side for traffic labeled attributes as compared to content labeled attributes. Figure 5.8: False alarm rate distribution considering single attribute class for Random Forest, OneR and Naïve Bayes. The arrow in Fig. 5.9 highlights the BH labeled attributes presenting better FAR as compared to BT class of attributes for two of the three classifiers whereas the third classifier OneR shows constant value. Similarly, arrow in Fig. 5.10 emphasizes on BCT labeled attributes which indicates absence of host attributes showing com- 78

paratively high FAR for the three classifiers hence it can be concluded that the host attributes have positive contribution in trimming down FAR. Figure 5.9: False alarm rate distribution considering two attribute classes for Random Forest, OneR and Naïve Bayes. Figure 5.10: False alarm rate distribution considering three attribute classes for Random Forest, OneR and Naïve Bayes. In this section, the results are discussed highlighting the key observations. The three classification algorithms under study are intentionally selected from different class of 79

machine learning algorithms to make sure that the outcome of simulation is independent of a particular classifier. It is observed from the analysis of results that the outcome of BC (22 attributes) labeled data set is on the moderate side whereas BTH and CTH configurations show poor result in comparison to BCTH data configuration. Another observation of prime concern is the contribution of BH (19 attributes) labeled attributes is almost equivalent to the contribution of BCTH (41 attributes) labeled data set configuration. Hence, it can be deduced that the BH labeled attributes give computationally similar results as compared to BCTH label with low cost as the number of attributes has reduced significantly. Table 5.12 concludes the studied behavior of DR and FAR with respect to the class-wise distribution of attributes for KDD Cup data set. It is observed that the Basic label attributes contribute maximum in achieving highest DR whereas the contribution of Host attributes is least. Similarly, the contribution of Basic label attributes is highest in achieving minimum FAR whereas the content label attributes has the least significant role in reducing FAR. Therefore, the four classes of attributes are ranked for high DR and low FAR separately with rank 1 depicting the maximum dominance. Ranking of Attribute Labels High Detection Rate Low False Alarm Rate 1 Basic Basic 2 Content Host 3 Traffic Traffic 4 Host Content Table 5.12: Summary of results. Hence, Table 5.12 summarizes the class wise contribution of 41 attributes in accomplishing high DR and low FAR which can help recognize significant attributes with respect to these four labels. The results of this study can be further used for feature selection purpose indicating that instead of trying all 41 attributes, feature selection can be applied on selective labels as well. It can be inferred from Table 5.10 and Table 5.11 that the basic labeled attributes need minimum feature reduction whereas traffic class attributes need maximum feature reduction followed by content and host classes. Considering the validation part of the study, Table 5.13 presents the number of attributes of each label contributing in the top 10 ranks, ranks 11 to 20, ranks 21 to 30 80

and ranks 31 to 41. Based on the ranking of attributes and their relevant labels given in Table 5.10 and Table 5.11, Fig. 5.13 is plotted. For example, in ranking 1 to 10, four attributes of basic label, two attributes of traffic label and four attributes of host label are present. Similarly, distribution of all four labels is presented for every ranking category. Basic Content Traffic Host Ranking 1-10 4 0 2 4 Ranking 11-20 0 1 5 4 Ranking 21-30 3 3 2 2 Ranking 31-41 2 9 0 0 Table 5.13: Distribution of ranked attributes in labels. Figure 5.11: Ranked attributes under four attribute labels. Figure 5.11 highlights the contribution of each of the four labels in the attribute ranking going from the highest to the lowest. It can be observed from the figure that 44% of the basic attributes lies in the top ten ranking with 40% of the host attributes and 22% of the traffic attributes. It is interesting to note that no attribute of content label stand in the top ten ranking. Another remarkable observation is that 69% of content attributes stand in the ranking category 31-41. All the traffic and host attributes stand within top 30 ranks. It can also be noted that 77% of basic attributes lie within the top 30 ranks. 81

It can be observed that the attributes ranked from 35 to 41 have zero average merit or the information gain. In other words, it can be said that these attributes do not contribute to the classification process. Out of these seven attributes, five attributes belong to the content class. That is, 38% of content class attributes have zero significance. Similarly, considering the rank range starting from 28 to 34, the contribution of attributes is least significant. Out of these seven attributes, six of them belong to the content class. From this discussion, it can be concluded that, out of 13 attributes of content class, 85% of attributes stand unimportant. Finally, the discussion can end worth mentioning that 80% of host and 78 % of traffic attributes stand in the top 20 ranking. Feature Selection Based Reduced Features Basic Content Traffic Host Ranker Variants 33 8 8 9 8 Multiple Feature Evaluation 11 5 2 3 1 BestFirst+CFSSubsetEval 11 7 1 2 1 GeneticSearch+CFSSubsetEval 17 6 1 4 6 GreedyStepwise+CFSSubsetEval 11 7 1 2 1 Table 5.14: Feature selection based studies. These results can be validated with the studies on four attribute labels and feature selection techniques. Table 5.14 lists results of some of the feature selection based studies for overall comparison of results. In case of ranker variants [26], different feature selection methods are used like information gain attribute evaluation, gain ratio attribute evaluation and correlation attribute evaluation with ranker algorithm. Using this, the features are reduced to 33 whose label detail is provided within the table. In the multiple feature evaluation technique [27], the focus is on removing those attributes that have no role in identifying an attack. This is done by preparing two lists of significant features, one identified by the classifiers and the other central to all attack classes. A common list of considerable features is identified on which Gradually ADD feature and Gradually DELETE feature technique is applied resulting in 11 reduced features. The other entries in Table 5.14 present the search methods BestFirst, GeneticSearch and GreedyStepwise with CFSSubsetEval [28]. The results obtained from BestFirst and GreedyStepwise are same not only in terms of reduced attributes but also the number of attributes selected from each label. 82

Fig. 5.12 is prepared with reference to the results presented in Table 5.14. The figure presents the percentage contribution of each label for each feature selection technique. For example, in case when variants of ranker algorithm are used, 89% of 9 basic attributes, 62 % of 13 content attributes, 100% of 9 traffic attributes and 80% of 10 host attributes are finally present in the reduced set of 33 attributes. It can be inferred from this figure that, in all the five evaluations, the minimum contribution comes from content label. In other words, it can be said that most of the content class attributes are not considered as the contributing elements, for example, consider the case of BestFirst, only 8% of content attributes are present in the reduced data set which means that 92 % attributes of this label are redundant. In four of the cases, the traffic label attributes prevail as compared to the host attributes except in the GeneticSearch. Finally, the basic label attributes count above all in the entire set of studies indicating definite inclusion of these attributes from 67% to 89% which is exemplary. Figure 5.12: Contribution of attribute labels in feature selection. According to the studies done on attribute labels presented in this chapter and the last chapter [43, 72] with different classification algorithms, the results are almost same as observed ranker study and other feature selection based studies. 83

The overall comparison of the work comprising of label based study, ranker results and Feature selection based study is presented in Table 5.15 thereby validating the label based studies. The results for the basic and content labels are exactly same for each style of label evaluation whereas there isn t any significant difference in the contribution of traffic and host labels. Ranking of Attribute Labels Ranker Based Label Based Feature Selection Based 1 Basic Basic Basic 2 Host Traffic Traffic 3 Traffic Host Host 4 Content Content Content Table 5.15: Ranking of attribute labels by feature selection and label based studies. 5.5 CONCLUSION KDD Cup data set was used to examine the behavior of DR and FAR metrics for IDS. The 41 attributes of the KDD Cup data set were classified under Basic, Content, Traffic and Host labels. This thesis explores the contribution of KDD Cup data set attributes with respect to these four labels in improving the value of detection and FAR. The study was done on Random Forest, OneR and Naïve Bayes classification algorithms. A significant contribution of basic class attributes was observed for IDS with remarkable observations with respect to attributes of other labels. Finally the four attribute labels were ranked for their dominance in enhancing the detection and reducing the FAR. Further, the study validates the contribution of four labels of KDD data set attributes by ranking the individual attributes. The validation study was appended with the research related to feature selection which focuses on the imperative attributes only. The attribute label evaluation and validation is done by analyzing the ranker results according to the attribute labels rather than individually thus comparing the attribute label dominance in IDS with related results of various feature selection techniques rearranged to labeled attribute results. It is concluded that the basic label attributes are the most significant and the content attributes are the least significant. The contribution of traffic and host attributes is quite close but difference is observable when detection and FAR are considered. 84

This study can help improve the data set by reducing the predisposition of results towards attributes of a particular label resulting in high FAR and hence enhance the data set to attain efficient IDS for anomaly detection. The results of this study can be used for selective feature selection on particular labeled attributes rather than on all the individual attributes. 5.6 SUMMARY This chapter analyses the contribution of four class labeled, NSL-KDD Cup data set with respect to detection and FAR. The process is implemented for three machine learning algorithms, Naïve Bayes, Random Forest, and One-R. The results for all the fifteen configurations are compared for each of the three algorithms. Chapter 6 presents a proposed Negative-Positive Ratio (NPR) metric whose aim is to provide generalized evaluation of the machine learning algorithm. This metric is designed keeping in view the imbalance in the data set used for IDS. 85