Network Intrusion Detection Using Fast k-nearest Neighbor Classifier

Network Intrusion Detection Using Fast k-nearest Neighbor Classifier K. Swathi 1, D. Sree Lakshmi 2 1,2 Asst. Professor, Prasad V. Potluri Siddhartha Institute of Technology, Vijayawada Abstract: Fast k Nearest Neighborhood algorithm (FkNN) is used to find patterns of Network Intrusion Detection (IDS) and experimental task carried with KDD CUP99 dataset. FkNN algorithm is applied on texture image classification problems. The objective of this paper is to implement this algorithm along with traditional knn classification algorithm on KDDCUP 99 dataset. FkNN algorithm experimental results are provided and performance compared with knn algorithm. The method reduces the computational and time complexities over the k Nearest Neighbor Classifier (knn) algorithm. The result shows that FkNN is more accurate than other method. KEY WORDS: k Nearest Neighbor classifier, Intrusion Detection. 1. Introduction Data Mining is concerned with extracting useful insights from large and detailed collections of data. With the increasing possibilities in modern society for institutions and industries to acquire data cheaply and efficiently, has become of increasing importance. This interest has inspired a rapidly maturing research field with developments both on a theoretical, as well as on a practical level with the availability of a range of commercial tools. The growth of storage capacity in recent computers allows collection of more and more data. Thus the analysis of such amount of data without computational techniques is almost impossible. Methods described as Knowledge Discovery in Databases could be used for this purpose, especially, the main step of it called Data Mining. There are several strategies/techniques that could be used in clustering, classification, or pattern discovery. Among these strategies the author has chosen classification technique for Intrusion Detection System (IDS). IDS provide a large amount of intrusion data from various sources through internet, that can kept in storage media. The aspect of the problem that Intrusion Detection addresses is to alert users or networks that they are under attack or is the case with web application may not even involve any malware but is based on abusing a protocol. Today s IDS s are very far from the kind of level of performance that makes such comparisons relevant. Many security and privacy problems cannot be optimally solved due to their complexity. In these situations, heuristic approaches should be used and data mining has proven to be extremely useful and well-fitted to solve these problems. Data mining is a 85 www.ijdcst.com

vast field which goes from the rather "primitive" to the very sophisticated. More recently there has been interest in the more sophisticated approaches like knowledge base approach to data mining. As a result of that situation, most of the attempts to introduce data mining in intrusion detection consisted in trying to apply existing tools developed or used in data mining. They are used in many places both inside and outside security perimeters and in many different ways. Always quintessential is that the information collected through detection can be made into powerful intelligence if put to use to strengthen computer security in the areas of intrusion prevention, preemption, deterrence, deflection, and countermeasures. Understandably, a protected system or network is only as secure as its defenses are strong. In the intrusion detection systems that we focus on in this paper, we show how pattern matching is a critical ability, and that it must be the strength of the system. Security in networks is becoming more and more challenging as the network usage is increased drastically. Apart from encryption and defense mechanisms, Intrusion Detection System (IDS) is also playing a vital role in network security. As audit-data of security as well as complex and dynamic properties of intrusion behaviors are available in large volumes, the optimized performance of IDS receiving the attention from the research community. IDS monitor for critical security events and detect attacks and malicious users. Whenever a suspicious activity is identified by IDS it calls for investigation and through analysis is made into the details for the suspicious activity. IDS is generally used as a secondary line of defense, as it cannot be completely relied. Misuse detection and Anomaly detection are the main categories in IDS. Misuse detection aims to detect known attacks by characterizing the rules that govern these attacks. Anomaly detection attempts to find data patterns that are deviations in that they do not conform to expected behaviors. These deviations or non-conforming patterns are the anomalies. knn Classification Approach: A more sophisticated approach, k-nearest neighbor (knn) classification is to find a group of k Patterns in the training set that are closest to the test pattern, and bases the assignment of a label on the predominance of a particular class in this neighborhood. This addresses, in many data sets, it is unlikely that one pattern will exactly match another, as well as the fact that conflicting information about the class of a pattern may be provided by the patterns closest to it. There are many key elements of this model: (i) the set of labeled patterns to be used for evaluating a test pattern s class, (ii) a distance or similarity metric that can be used to compute the closeness of patterns (iii) the value of k, the number of nearest neighbors, and 86 www.ijdcst.com

(iv) the method used to determine the class of the target pattern based on the classes and distances of the k nearest neighbors. standard Euclidean distance d(x, y) between two instances x and y is defined as In its simplest form, knn can involve assigning a pattern of the class of its nearest neighbor or of the majority of its nearest neighbors. Generally, knn is a special case of instance-based learning and is also an example of a lazy learning technique, that is, a technique that waits until the query arrives to generalize beyond the training data. Although knn classification is a classification technique that is easy to understand and implement, it performs well in many situations. Also, because of its simplicity, knn is easy to modify for more complicated classification problems. For instance, knn is particularly well-suited for multimodal classes as well as applications in which an object can have many class labels. The complexity storage space using this algorithm is O(n), where n is the number of training patterns. The time complexity is also O(n), since the distance needs to be computed between the target and each training object. Thus, knn is different from most other classification techniques which have moderately to quite expensive model-building stages, but very inexpensive O(constant) classification steps. There are several techniques for classification such as Bayesian Classification, Decision Tree Induction, and Neural Networks. knn classifier has been widely used in classification problems[4]. knn classifier is based on a distance function that measures the difference or similarity between two instances. The where is the i th feature of the instance and is the total number of features in the data set. When all the attributes are of nominal, the distance can be measured as where if and if. DARPA dataset that contains only network data is termed as KDDCup 99 dataset. It contains seven weeks of training data and two weeks of test data. KDD dataset is widely used as a benchmark dataset for offline network traffic, which helps the researchers to test and implement their algorithms [7]. 10% of KDD Cup 99 dataset was chosen for training and test datasets. This KDD Cup 99 data set contains 41 features. As class labels are provided, this data set is widely used for classification algorithms. Each sample is labeled as either normal or attack. Denial of Service (DOS), Probe, U2R and R2L are the categories of attacks available. Even though, knn Classifier is one of the efficient data mining techniques that give best accuracy this method suffers from severe problem, especially when 87 www.ijdcst.com

we are using the large datasets like KDDCup 99. To find distance for all the available samples for a given request, store and sort those distances kills the performance. Our approach aims at implementing a knn Classifier and Fast knn Classifier algorithms to predict the attacks on KDD CUP 99 dataset. We also perform a comparative study for both of these classifiers in terms of performance. The remainder of the paper is organized as follows: Section 2 includes Related Work on IDS. Section 3 briefly introduces the knn and Fast knn classification algorithms. Section 3 elaborates the methodology and Section 4 describes the Results and the discussion on the results. Section 5 gives conclusion and remarks. storage becomes poor when large datasets are used [6]. Many researchers have focused on performance improvement of knn method and proposed alternatives such as Fast knn, and modified knnclassifiers[5]. Some of the researches on knn are WeightedkNN Classifier, Class based knn Classifier, Variable knn Classifier were concentrated on increase in the accuracy of the algorithm [4]. 3. Methodology As the KDD data set is of categorical, some of its features are in categorical and should be converted into numeric. In data preprocessing stage we have converted the KDD data set into numeric. 2. Related Work Intrusion detection can be thought of as a classification problem: we wish to classify each audit record into one of a discrete set of possible categories, normal or a particular kind of intrusion [9]. There are many variations of KNN Classifiers to reduce the time as well as increase the accuracy. As the KDD CUP99 is large data set, to identify the intrusion using knn classifier is time consuming. One of the variations of knn classifiers is known as Fast knn classifier that reduces the processing time as well as the storage space. The knn classifier mentioned above classifies a new data object by finding its k number of nearest neighbors with respect to a suitable distance function. Although the knn classifier method to solve the classification problem quite fast, its performance and Simple knn A k-nearest neighbor (knn) classifier does not build a classifier in advance. That is what makes it suitable for data streams. When a new sample arrives, knn finds the k neighbors nearest to the new sample from the training space based on some suitable similarity 88 www.ijdcst.com

or distance metric. The plurality class among the nearest neighbors is the class label of the new sample. Fast knn This approach of Fast knn classifier will work just as knn classifier but the difference is that it doesn t store all the distances that are calculated instead it just stores and sorts the first k distances. From k+1 th instance, each distance is compared with the k th distance and if the k th distance is greater, k th distance is replaced by the new distance and again sorted otherwise the new distance is discarded. In this approach only k distances are stored at any time which reduces the storage space and sorting will be done on only k elements. This algorithm is widely used especially for larger data sets. Step 2.1: Compare C yloc2+l withc max. If C YLoc2+l C max is satisfied, then Y Loc2+l to Y S can be deleted, go to step 3. Otherwise, go to step 2.2. Step 2.2: Perform the WKPDS algorithm. If d(x, Y Loc2+l ) <d min, update and reorder the k distances as [4]. Set d min = d k, C min =const.d min, C min =C x -const.d min. Go to step 3. Step 3: If Loc1 l < 1 or the vector Y Loc1-l to Y 1 have been deleted, go to step 4. Otherwise check Y Loc1-l. Step 3.1: Compare C yloc1 l with C min. If C yloc1-l C min is satisfied, then Y Loc1-l to Y 1 can be deleted, go to step 4. Otherwise, go to step 3.2. Step 3.2 : Perform the WKPDS algorithm. If d(x, Y Loc1-l ) <d min, update and reorder the k distances as step 2.2. Set d min = d k, C max = C x + const. d min, C min = C x const. d min. Go to step 4. Fast knn Classifier Algorithm To find k closest vectors, sort the vectors according to their approximate coefficients. Step 1: Set p=argmin C x -C yi. Initialize the k current closest distances as d i =d(x,y ji ), i=1,2,,k j i =S-k+i. Let Loc1=j i and Loc2=j k. Then, reorder the k distances such that d 1 d 2.... d k. Set l=1, Step 4: Set l = l+1. If Loc2 +l > S and Loc1 l < 1 or all vector have been deleted, terminate the algorithm. Otherwise go to step 2. Finally we get knumber ofnearest neighbors for sample y j, the class label for y j is y t is considered as the class c which has the highest count from the k nearest samples is denoted by the formula const=, d min = d k, C max = C x + cost.d min, C min = C x - cost.d min. Step 2: If Loc2 +l > S or the vector Y Loc2+l to Y S have been deleted, go to step 3. Otherwise check Y Loc2+l. 4. Results & Discussion 89 www.ijdcst.com

We have implemented the algorithm using Java. KDD CUP 99 data set is separated as training data and test data set and selected as shown below. Figure 1: Selecting the training and test data sets Figure 2: The original data of KDD cup 99 0,3,19,10,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8, 0,0,0,0,1,0,0,9,9,1,0,0.11,0,0,0,0,22 0,3,19,10,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0,0,0,0,1,0,0,19,19,1,0,0.05,0,0,0,0,22 Figure 3: After data transformation After the transformation process, classification algorithm will be applied on the training dta set based on the test samples inorder to predict the class of each test set. By providing the k value and applying the algorithm of Fast knn Classification the actual class and predicted class labels for each data sample is shown in figure 4. After the selection of the data sets data preprocessing is performed, in which all the categorical data is converted into numeric, for example the class label normal is converted into 22 and service type tcp is converted into 3. The following is the example of 2 data samples figure 2 shows the original KDD data samples and figure 3 shows the transformation of the categorical values into nominal values. 0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal. 0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00, 0.05,0.00,0.00,0.00,0.00,0.00,normal. Figure 4: Results after providing the k Value We have provided the algorithm with four different sizes of training data sets. For test data set 35 normal and various types of attacks instances and 20 as the k value. The experiment yields accuracy of 90 www.ijdcst.com

No. of iterations Dept of Computer Science, P B SIDDHARTHA COLLEGE OF ARTS & SCIENCE, VIJAYAWADA. 91.4%. The total number of iterations taken by both KNN Classifier and Fast KNN classifier for four different sizes of training data sets are tabulated as shown in table 1. Table 1: Comparison of two algorithms in terms of number of iterations 1.01 1 0.99 0.98 0.97 0.96 Probe U2r r2l dos Figure 6: Accuracy Fast knn knn S.No No. of Instances knn Classifier Fast knn Classifier 1 271 201495 13173 2 516 381570 17664 3 1362 1003380 16909 4 1977 1455405 17246 The results are shown that the fast knn classifier has taken very less number of iterations when compared to the knn classifier algorithm. 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 knn Classifier 201495 381570 1003380 1455405 13173 17664 16909 17246 271 516 1362 1977 No. of Instances Fast knn Classifier 1.5 1 0.5 0 Fast knn knn Probe U2r r2l dos Figure 5: Kappa Statistics Figure 7: Graph showing the comparison of two algorithms 5. Conclusion and Future Work In this paper we proposed Fast knn Classifier algorithm for intrusion detection on large, mixed data set. Analysis of result gives a better prediction of result for different data set in KDD, but also suffered problem in Alarm generation. The processing speed of this algorithm is shown in terms of the number of 91 www.ijdcst.com

iterations, and is compared with general knn Classifier. Two algorithms complexity is also shown. Rough Set Theory and the Support Vector Machine is used as a tool to enhance the accuracy of the present intrusion detection algorithms [10]. As a Future work we are planning to use Rough-set Theory for detecting intrusions. References: [1] Stephen Northcutt, Judy Novak Network Intrusion Deteciton, Third Edition, New Riders Publishing. [2] Jiawei Han and MichelineKamber data mining Concepts and techniques Morgan Kaufmann Publishers, an imprint of Elseiver, ISBN 978-1-55860-901-3. [3] RamasamyMariappan, An intelligent approach for intrusion detection system using KNN classifier 355-360, ISDA 2004 IEEE 4 th International Conference on Intelligent Systems Design and Application August 26-28, 2004. [4] Zacharias Volulgaris and George d. Magoulas, Extensions of the k Nearest Neighbor Methods for Classification Problems, AIA '08 Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications, 2008. Algorithm, IEICE Trans, Fundamentals, Vol. E87-A,No4, April 2004. [6] Abidin, T. and Perrizo, W. SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining, Proceedings of ACM SAC-06, Dijon, France, ACM Press, New York, NY, pp. 536-528 April 23-27, 2006. [7] R. Shanmugavadivu, Dr. N. Nagarajan Network Intrusion Detection System Using Fuzzy Logic, Indian Journal of Computer Science and Engineering (IJCSE), 1998. [8] Dr. S.SivaSathya, Dr. R. GeethaRamani and K. Sivaselvi, Discriminant Analysis based Feature Selection in KDD intrusion Dataset, International Journal of Computer Applications, Vol 31, No.11, October 2011. [9] Wenke Lee, Salvatore J. Stolfo and Kui W. Mok, A Data Mining Framework for Building Intrusion Detection Models, Proceedings of the 1999 IEEE Symposium pp. 120-132 May on 1999. [10] Shailendra Kumar Shrivastava and Preeti Jain, Effective Anomaly based Intrusion Detection using Rough Set Theory and Support Vector Machine, International Journal of Computer Applications (0975 8887), Volume 18 No.3, March 2011. [11] http:/kdd.ics.uci.edu/databases/kddcup99/kddc up99.html [5] Jeng-Shyang, Yu-Long, and Sheng-He SUN A Fast K Nearest Neighbors Classification 92 www.ijdcst.com

93 www.ijdcst.com