CHAPTER 4 DATA PREPROCESSING AND FEATURE SELECTION

55 CHAPTER 4 DATA PREPROCESSING AND FEATURE SELECTION In this work, an intelligent approach for building an efficient NIDS which involves data preprocessing, feature extraction and classification has been proposed and implemented. This type of new techniques are necessary since it is quiet complex to process huge amount of network traffic data in real time to detect intruders and take corrective actions. Therefore, the offline preprocessing of network data and extraction of most relevant features can be used to efficiently detect network attacks. Any decision making system which handles a very large volume of data requires effective data preprocessing. As the new ensemble soft computing techniques proposed in this research work directly handles huge volume of data on training, this preprocessing is very important for this work. Feature selection for an IDS (Mitra 00, Liu and Yu 005, Chen et al 006) helps to select the minimal subset of features essential for intrusion detection. Moreover, feature extraction reduces the dimensionality of instance and thus the overhead of the detection process is reduced. In this research work, GA is used to identify the most relevant features from dataset which are used in the classification of traffic data. On these selected items, proposed new soft computing paradigms have been applied to accurately capture the difference between intrusions and normal activities in order to detect intrusions effectively.

56 4.1 DATA SOURCE Regardless of the detection paradigm used, it is also vital to use relevant and essential features in order to build a NIDS. Normally, network traffic log data is not released by many organizations due to privacy concerns. Therefore, most IDSs are focused more on getting relevant data first since such systems lack good quality data. Many of the existing IDS overcome this issue by using an expert to program the system (Axelsson 1998) and simulate necessary data. Hence, in such systems it is the role of the expert to extract and refine relevant features to be provided to the IDS. Therefore, the accuracy of the output obtained is highly dependent on the individual who is the domain expert for providing data. One of the major limitations of this approach is that it is costlier and it doesn t consider novel attacks. Therefore, this proposed research work uses the benchmark dataset compiled for the 1999 KDD intrusion detection contest, by MIT Lincoln Labs (McHugh 000, Tavallaee et al 009). The main advantage of using this dataset is that the proposed research work is capable of providing significant data that are easily shared with other researchers and developers. The feedback obtained from other researchers allow to improve the result of this proposed work. The main reason for selecting KDD Cup 99 dataset is that currently, it is the mostly used comprehensive data set that is shared by many researchers. In this dataset, 41 attributes ( Table 4.1) are used in each record to characterize network traffic behavour. Among this 41 attributes, 38 are numeric and 3 are symbolic. Features present in KDD data set are grouped into three categories and are discussed below.

57 Table 4.1 List of Features Available in KDD Cup 99 Dataset S.No Feature Name Description Type 1. Duration length (number of seconds) of the connection Continuous. Protocol_type type of the protocol, e.g. tcp, udp, etc. Discrete 3. Service network service on the destination e.g. Discrete http, telnet, etc. 4. Src_bytes number of data bytes from source to Continuous destination 5. Dst_bytes number of data bytes from destination Continuous to source 6. Flag normal or error status of the Discrete connection 7. Land 1 if connection is from/to the same Discrete host/port; 0 otherwise 8. Wrong_fragment number of ``wrong'' fragments Continuous 9. Urgent number of urgent packets Continuous 10. Hot number of ``hot'' indicators Continuous 11. Num_failed_logins number of failed login attempts Continuous 1. Logged_in 1 if successfully logged in; 0 otherwise Discrete 13. Num_compromised number of ``compromised'' conditions Continuous 14. Root_shell 1 if root shell is obtained; 0 otherwise Discrete 15. Su_attempted 1 if ``su root'' command attempted; 0 Discrete otherwise 16. Num_root number of ``root'' accesses Continuous 17. Num_file_creations number of file creation operations Continuous 18. Num_shells number of shell prompts Continuous 19. Num_access_files number of operations on access control Continuous files 0. Num_outbound_cmd number of outbound commands in an Continuous s ftp session 1. Is_host_login 1 if the login belongs to the ``host'' list; Discrete 0 otherwise. Is_guest_login 1 if the login is a ``guest''login; 0 otherwise Discrete 3. Count number of connections to the same host Continuous as the current connection in the past two seconds

58 Table 4.1 (Continued) S.No Feature Name Description Type 4. Serror_rate % of connections that have ``SYN'' Continuous errors 5. Rerror_rate % of connections that have ``REJ'' Continuous errors 6. Same_srv_rate % of connections to the same service Continuous 7. Diff_srv_rate % of connections to different services Continuous 8. Srv_count number of connections to the same Continuous service as the current connection in the past two seconds 9. Srv_serror_rate % of connections that have ``SYN'' Continuous errors 30. Srv_rerror_rate % of connections that have ``REJ'' Continuous errors 31. Srv_diff_host_rate % of connections to different hosts Continuous 3. Dst_host_count count of connections having the same destination host Continuous 33. Dst_host_srv_count count of connections having the same Continuous destination host and using the same service 34. Dst_host_same_srv_r % of connections having the same Continuous ate destination host and using the same service 35. Dst_host_diff_srv_ra % of different services on the current Continuous te host 36. Dst_host_same_src_ port_rate % of connections to the current host having the same src port Continuous 37. Dst_host_srv_diff_ho % of connections to the same service Continuous st_rate coming from different hosts 38. Dst_host_serror_rate % of connections to the current host that have an S0 error Continuous 39. Dst_host_srv_serror_ rate % of connections to the current host and specified service that have an S0error 40. Dst_host_rerror_rate % of connections to the current host that have an RST error 41. Dst_host_srv_rerror_ % of connections to the current host rate and specified service that have an RST error Continuous Continuous Continuous

59 a. Basic Features: Basic features comprises of all the attributes that are extracted from a TCP/IP connection. These features are extracted from the packet header and includes src_bytes, dst_bytes, protocol etc b. Content Features: These features are used to evaluate the payload of the original TCP packet and looks for suspicious behavior in the payload portion. This includes features such as the number of failed login attempts, number of file creation operations etc. Moreover, most of the and UR attacks don t have any frequent sequential patterns. This is due to the fact that DoS and Probing attacks involve many connections to some host(s) in a very short duration of time but the and UR attacks are embedded in the data portions of the packets, and generally involves only a single connection. So to detect these kinds of attacks, content based features are used. c. Traffic Features: These include features that are computed with respect to a window interval and are divided into two categories i) Same host features: These features are derived only by examining the connections in the past seconds that have the same destination host as the current connection, and compute statistics related to protocol behavior, service etc. ii) Same service features: These features examine only the connections in the past seconds that have the same service as the current connection. The above two types are called time based traffic features.

60 Apart from these, there are various slow probing attacks that scan the hosts or ports using time interval greater than seconds. As a result, these types of attacks do not generate intrusion patterns with a time window of seconds. To overcome this problem, the same host and same service features are normally re-computed using a connection window of 100 connections. These types of features are called connection-based traffic features. The distribution of records in this dataset is provided in Table 4.. Table 4. Distribution of Data in KDD Cup 99 Dataset Attacks Data Normal Probe Dos UR Training Data 19.69% 0.83% 79.4% 0.01% 0.3% Test Data 19.48% 1.34% 73.9% 0.07% 5.1% 4. DATA PREPROCESSING The need for data preprocessing can be seen from the fact that redundant data and insignificant features may often confuse the classification algorithm, leading to the discovery of inaccurate or ineffective knowledge. Moreover, the processing time will increase when all features are used. Finally, preprocessing helps to remove the redundant data, incomplete data and transforms the data into a uniform format. The preprocessing module of the proposed system performs the following functionalities: i. Performs redundancy check and handles null values ii. Converts categorical data to numerical data

61 4..1 Redundancy Check The major limitation with KDD Cup 99 dataset is the presence of redundant records. The occurrence of redundant instances causes the learning algorithm to be biased towards frequent records and unbiased towards infrequent records. After redundancy check, it has been found that most of the redundant records are present in the anomaly class than in the normal class. The detection accuracy has been increased when these redundant records have been removed. For instance, the learning algorithm is unbiased towards class, as the percentage of records in class is very less in the KDD Cup 99 dataset and due to the redundant and enormous records present in other classes like DoS. Table 4.3 Distribution of Records in Training and Test Dataset Before and After Redundancy Removal Class Total No. of Records in KDD Cup 99 Training Set No. of Unique Records after Redundancy Removal Test Set Total No. Of Records in KDD Cup 99 No. of Unique Records after Redundancy Removal Known Unknown Known Unknown Normal 97781 81814 60593-47911 - DoS 3883370 4597 398 6555 5741 1717 Probe 4110 11656 377 1789 1106 1315 116 995 5993 10196 199 555 UR 5 5 39 189 37 163 Table 4.3 shows the number of records present in the KDD Cup 99 dataset and the number of records obtained after redundancy removal for both training and test dataset. It has been inferred that there is a large reduction in number of records for DoS attack when compared to other classes. Additionally, there are no duplicate records present in the training set for UR attack.

6 4.. Influence Calculation The 41 features in the KDD Cup 99 dataset are converted into a standardized numerical representation. Since there are categorical and numerical attributes available in the KDD Cup 99 dataset, the categorical attribute values have been converted into numerical one. This process involves calculation of influence value for each of the categorical attributes using the influence calculation formula given in Equation (4.1) so that the classification can be carried out effectively due to uniformity in data. # AttributeA bnormal Influence( I) (4.1) # Abnormal where #AttributeAbnormal is the number of abnormal records in which the attribute type is present and #Abnormal is the total number of abnormal records. For example, to find the influence value of service type HTTP, the number of abnormal records in which HTTP is present is divided by total number of abnormal records. The influence value calculated is more for the service type which occur frequently, as the attacker uses this service type to attack the network more often when compared to other service types. Table 4.4, Table 4.5 and Table 4.6 list the sample influence values calculated for service type, flag and protocol. Table 4.4 Influence Values Obtained for Protocols Using the Proposed Method Protcol Influence Value UDP 0.006 TCP 0.1837 ICMP 0.8137

63 Table 4.5 Sample Influence Values Obtained for Flags Using the Proposed Method Flag Influence Value Rstr 0.0017 S0 0.1706 Rej 0.0073 Table 4.6 Sample Influence Values Obtained for Service Types using Proposed Method Service Type Influence Value Ecr_i 0.813 ftp_data 0.007 Eco_i 0.0015 Private 0.169 4.3 GENETIC FEATURE SELECTION Most real life problems need an optimal and acceptable solution rather than calculating them precisely at the cost of degraded performance, time and space complexities. Therefore, it is necessary to carry out the analysis using selected features. The problem of selecting significant features from KDD Cup 99 dataset for intrusion detection cannot be represented in terms of formula since it is too complex. Moreover, when all the features are used without feature selection, it takes very longer time to calculate a solution precisely. Therefore the feasible approach is to use a heuristic method which performs feature selection effectively. GA (Goldberg 1989) is a heuristic,

64 which means that it estimates a solution and generates optimized results. Among various heuristic methods, GA (Stein et al 005) is more promising since it differs in many ways from other heuristics. First, GA works on population of possible solutions, while other heuristic methods use a single solution in their iterations. Second, most heuristics are probabilistic or stochastic, in nature and hence they are not deterministic. On the other hand, each individual in the GA population contributes well to obtain a possible solution to the problem. In GA, the algorithm starts with a set of possible solutions represented by chromosomes called population. Potential solution to specific problem is encoded in the form of chromosome. Solutions from one population are taken and used to form a new population. Solutions which are selected to form new solutions called offspring and are selected according to their fitness value. The more suitable they are the more chances they have to reproduce. Finally, GAs are more suitable in reducing the search space. Therefore, the convergence of the algorithm is faster when GA is employed. 4.3.1 Proposed Feature Selection Technique Using GA Genetic based feature selection algorithm has been used in this work in order to select suitable subset of features so that they are potentially useful in classification. Another advantage of GA based feature selection in this work is that it finds and eliminates the redundant features if any because these redundant features may misguide in clustering or classification. The reduction in number of features reduces the training time and ambiguousness. So in this thesis work, a weighted sum genetic feature selection algorithm has been proposed which has increased global search capability and is better in attribute interaction when compared to other algorithm like greedy method.

65 4.3. Proposed Framework for Genetic Feature Selection The framework for the genetic feature selection proposed and implemented in this research work is given in Figure 4.1. 4.3..1 Random Subset Generation Subset generation (Curry et al 007) is a method of heuristic search, in which each instance in the search space specifies a candidate solution for subset evaluation. The decision process of this method is determined by some basic issues. Initially, the search starting point must be decided since it controls the direction of search. Feature selection search starts either with null set where features are added one by one or it starts with a full set of features and is eliminated one by one. But these methods have drawback of being trapped into local optima (Doak 199). Therefore, in order to avoid this, the proposed work employs a random search. Preprocessed KDD Dataset with 41 Features Random Subset Generation Evaluation of Subset using Fitness Function Newly Generated Subset Unoptimized Feature Set Genetic Operations Relevant Features Optimal Feature Set Validation of Result Figure 4.1 Proposed Framework for Genetic Feature Extraction Sub-Module

66 Next, a search strategy is decided. A dataset with N features have N candidate subsets. This value is very large for moderate and large value of N. In the proposed case, there are 41 candidate subsets which is quite large. There are three different types of search strategies. They are complete, sequential and random. Complete search like branch and bound are exhaustive search. Sequential search like greedy hill climbing add or remove features one at a time and find optimal feature. Random search generates the subset in a completely random manner i.e., it does not follow any deterministic rule. When compared to above two approaches, the utilization of randomness helps to escape local optima in the search space and optimal subset is obtained. 4.3.. Evaluation of Subset After the subset is generated, it is evaluated using an evaluation criterion. The best or optimal subset of features obtained using one criterion may not be optimal according to another criterion. Based on the dependency of evaluation of subset on classification or clustering algorithm applied at the end, feature subset evaluation criterion can be classified into independent criterion or dependent criterion. Commonly used independent criteria are distance measures, information measures, dependency measures, and consistency measures. If a feature incurs greater difference which is computed using the above criteria than other features then the feature that incurs greater difference is considered. This evaluation criterion uses the intrinsic characteristics of the dataset without applying any classification or clustering algorithms. On the other hand, dependent criterion uses the performance of the classification or clustering algorithm on the selected feature subset in identifying essential features. This approach gives superior performance as it selects features based on the classification or clustering algorithm applied.

67 The approach proposed in this thesis uses dependent criterion for selecting significant features which are used in the detection process. Here predictive accuracy and feature count are used as the primary measures. Even though the computational complexity of this approach is higher when compared to independent measure, it provides more detection accuracy. Since feature selection is performed offline, the complexity involved in this is not related to the detection process and hence the time taken is immaterial. 4.3..3 Stopping Criteria A stopping criterion determines when the feature extraction algorithm should stop. The proposed algorithm terminates, when any one of the following condition is met. i. The search completes when the maximum number of iteration is reached ii. When a good subset is selected i.e., the difference between previous fitness and current fitness is less than the given tolerance value 4.3..4 Validation of Results One direct way of result validation is based on the prior knowledge about the data. But in real-world applications, such prior knowledge is not available. Hence, the proposed approach relies on indirect method which monitors the change of detection algorithm performance with the change of features. Experiments have been conducted with full set of features and selected subset of features to compare the performance of classifier. From these experiments, it has been found that the detection accuracy is almost same in both the cases. Therefore, feature selection can be carried out to improve the performance of the system.

68 4.3.3 Proposed Algorithm for Genetic Feature Selection Algorithm: Feature set selection using weighted sum GA. Input: Network traffic pattern (All features), Number of generations, Population size, Crossover probability (P c ), Mutation probability (P m ). Output: Set of selected features. Genetic_Feature_Selection( ) { 1. Initialize the population randomly with the size of each chromosome as 41.Each gene value in the chromosome can be 0 or 1. A bit value of 0 represents that the corresponding feature is not present in chromosome and 1 represents that the feature is present.. Initialize the weights W 1 = 0.7, W = 0.3, N (total number of records in the training set), P c and P m. 3. For each chromosome in the new population { a. Apply uniform crossover with a probability P c. b. Apply mutation operator to the chromosome with a probability P m. c. Evaluate fitness = W 1 * Accuracy + W * (1/ Count of Ones) } 4. If (Current_fitness Previous_fitness < 0.0001) then exit 5. Select the top best 60% of chromosomes into new population using tournament selection. 6. If number of generations is not reached, go to line 3. }

69 4.3.4 Experimental Topology Experiments have been conducted using KDD Cup 99 dataset for effective feature extraction and this dataset contains different attack types. It contains more instances of attacks than normal patterns and the attack types are not represented equally. The different attack types present in both KDD Cup 99 training and test dataset, and their associated class is listed in Table 4.7. In addition to the different attack types present in the KDD Cup 99 dataset, 17 new attack types are included in the KDD Cup 99 test set. Therefore, KDD Cup 99 test set contains both known and unknown types of attack. Those attacks that are present in the KDD Cup 99 training set and also in the test set are known attacks whereas those attacks that are not present in the KDD Cup 99 training set and present in the test set are unknown attack types. The inclusion of known and unknown types of attack in KDD Cup 99 test set makes intrusion detection more realistic. In addition, the training and test dataset are not of same probability distribution. Different feature sets have been obtained for three different intrusion detection modules proposed in this research work for effective decision making. Table 4.7 lists the specific known class types and their number of records present in Train and Test set before and after redundancy removal. Table 4.8 lists the unknown attack types and their associated class label in the test dataset. It also shows the number of records present in the test set after redundancy removal. The presence of these repeated records in the test set will cause the validation and test results to be biased by the algorithms which have better accuracy on the frequent records.

70 Table 4.7 Different Known Attack Types Present in KDD Cup 99 Dataset S.No Specific Class Types Class Total No. of Samples Unique Samples Train Test Train Test 1 Normal Normal 97781 60593 81814 47911 Smurf Neptune Back Teardrop Pod Land 3 Satan Ipsweep Portsweep Nmap 4 Warezclient Guess_passwd Warezmaster Imap Ftp_write Multihop Phf Spy 5 Buffer_overflow Rootkit Loadmodule Perl DoS DoS DoS DoS DoS DoS Probe Probe Probe Probe UR UR UR UR 807886 107017 03 979 64 1 1589 1481 10413 316 100 53 0 1 8 7 4 30 10 9 3 164091 58001 1098 1 87 9 1633 306 354 84 0 4367 160 1 3 18 0 646 4114 956 89 01 18 3633 3599 931 1493 890 53 0 11 8 7 4 665 4657 359 1 41 7 735 141 157 73 0 131 944 Total 4898431 9300 871444 56994 13 30 10 9 3 1 3 18 0 0 13

71 Table 4.8 Different Unknown Attack Types Present in Test Dataset S.No Attack Types Class Total No. of Samples Unique Samples 1 Mailbomb DoS 5000 93 Processtable DoS 759 685 Apache DoS 794 737 Udpstorm DoS Mscan Probe 1053 996 Saint Probe 736 319 3 Snmpgetattack 7741 178 Snmpguess 406 331 Named 17 17 Xsnoop 4 4 Worm Xlock 9 9 Sendmail 17 14 4 HTTPtunnel UR 158 133 Ps UR 16 15 Xterm UR 13 13 Sqlattack UR Total 1879 375 4.3.5 Chromosome Representation Each network traffic pattern is represented as a vector of 41 features, which are the signatures of the respective network behavior. Every chromosome in the population has 41 genes. Each feature is linked with one bit in the chromosome. If the i th bit is 1, then the i th feature is selected and

7 used in classification of pattern for intrusion detection, otherwise, that feature is not selected. Each chromosome thus represents a different subset of features. A sample chromosome is shown below. 10011110000000000001111110101011111000111 4.3.6 Initial Population The initial population is generated randomly. The number of 1 s for each individual is generated randomly, to form different subset of features. Then, the 1 s are randomly placed in the chromosome. 4.3.7 Weighted Sum Fitness Evaluation The aim of weighted sum fitness evaluation is to use fewer features to attain similar or better performance. Fitness of a chromosome is evaluated based upon the accuracy from the validation dataset and number of features present in a chromosome. Accuracy is calculated using the formula (TP+TN)/(P+N) where TP and TN are the number of records correctly classified in normal and abnormal classes respectively. P and N are the total number of records in normal and abnormal classes respectively. Each feature subset contains a list of features. If two subsets attain the same performance, while having different number of features, the subset with fewer features have been chosen. Among accuracy and number of features, accuracy is the key concern, so more weightage to accuracy (W 1 =0.7) is given than number of features (W =0.3) to be selected. The fitness function is obtained by combining the above terms: fitness = 0.7 X Accuracy + 0.3 X (1/ Count of Ones) (4.)

73 where Accuracy is the classification rate that an individual achieves on validation dataset and Count of Ones is the number of ones in the chromosome. The number of ones ranges from 1 to 41 where 41 is the size of the chromosome. Among the 41 bit in the chromosome, it is assured that no all bits are zero as; at least one feature is required in classification of normal and anomalous pattern. In general, higher accuracy implies higher fitness. Also, fitness increases if less number of features used i.e., if less number of 1 s present in a chromosome. A point to be noted is that chromosome with higher accuracy would outweigh chromosome with lower accuracy, independent of number of features present. 4.3.8 Crossover and Mutation Operators Crossover operator explores the combinations of current chromosome while mutation operator generate new chromosome. Various types of crossover operator include single point crossover, two point crossover and uniform crossover. There are 41 features present in the traffic pattern and these features may be independent or dependent on each other. If dependent features are away from each other in the chromosome, it is possible that single point crossover may destroy the schemata. To overcome this difficulty, uniform crossover is used. In uniform crossover, bits are randomly copied from the first or from the second parent chromosome depending on the value of mask. A mask is generated randomly with length equal to the length of the chromosome used for crossover. The mask determines which bits are copied from one parent and which bits from the other parent. Mutation inverts a bit in the population with a probability P m. The role of mutation operator is to restore the lost genetic material. The parameters P c and P m are adjusted to achieve good results for the experiments conducted.

74 4.3.9 Selection Operator Selection operator selects chromosome from population of individuals for next generation. Various selection operators include roulette selection, rank selection, tournament selection and random selection. This proposed work utilizes tournament selection to select the fittest chromosome. Tournament selection selects subgroup of chromosomes from the initial population where individuals within each subgroup compete against each other. This is because only one chromosome from each subgroup is chosen for next generation. 4.3.10 Features Selected In this work, a wrapper approach has been followed in order to select subset of features from the original feature set. This wrapper approach based feature selection is based on cascading of weighted sum GA and neurotree or neuro-genetic or genetic-x-means. The predictive accuracy of the classification algorithm (neurotree or neuro-genetic) or the clustering algorithm (genetic-x-means) is used as a metric in extracting significant features. The importance of the features selected by GA is evaluated based on neurotree or neuro-genetic or genetic-x-means. Therefore, different feature subsets are generated for neurotree, neuro-genetic and genetic-x-means paradigms. Seventeen and nine significant features are generated by weighted sum GA for neurotree and neuro-genetic classification algorithm respectively. Similarly, 13 relevant features are generated by weighted sum GA from the 41 features using genetic-x-means as the clustering algorithm to control the selection of features. From the experiments conducted using these features, it has been observed that feature selection reduces the training and testing time and at the same time produces similar accuracy as that of full feature set.