RULE BASED CLASSIFICATION FOR NETWORK INTRUSION DETECTION SYSTEM USING USNW-NB 15 DATASET Dr C Manju Assistant Professor, Department of Computer Science Kanchi Mamunivar center for Post Graduate Studies, Lawspet, Puducherry ABSTRACT: Communication plays a vital role in information technology. It involves transfer of data from one place to another. An intrusion detection system is used to detect and manage internal and external attacks and other threats such as botnets, phishing spoofing etc. Here in this paper, evaluation of Network Intrusion Detection Systems is dealt with using USNWNB 15 dataset and rule based classifiers. Direct and Indirect method of analysis is done using Ripper, One-R, RIDOR, Decision Table and PART procedures. After evaluation and Analysis, it is found that PART classifier which is an indirect method of rule based classifiers is best in accuracy and error reduction compared to other classifiers. Keywords: Intrusion Detection System, USNW-NB15 dataset, Rule Based Classifiers, Direct Method, Indirect Method [1] INTRODUCTION Security in information technology is very important when transmission of data is involved. IDS deals in detecting and managing various attacks that happen during the process of communication. IDS can be classified as Host based and Network based. Host based concerned with local attacks where as Network based IDS on overall network activities [1][2]. Dr C Manju 130
RULE BASED CLASSIFICATION FOR NETWORK INTRUSION DETECTION SYSTEM USING USNW-NB 15 DATASET IDS can be modelled using analysis approach which monitors against predetermined attack list or signatures. It is based on matching signature system hence can be focused only on known attacks. Next is anomaly based approach which makes use of state of network traffic and report whether it has normal traffic or anomaly in it. Main aim of IDS is to generate and integrate Network and Host based approaches for better detection. Many IDS schemes can be developed for detecting novel attacks more than individual incantations. Evaluation of network data is done using various available data sets. [2] DATA SET DESCRIPTION Evaluation of network intrusion data system was done by using KDD98, KDDCUP99, NSDLKDD benchmark data set. These data sets are very old and cannot take care of current topology and traffic in the network. KDDCUP [3] dataset contains a large number of redundant records and also multiple missing data. NSDL is another data set which is a modification of KDDCUP but they cannot be used as perfect data set in modern network and traffic environment [4]. The Australian center for cyber security research group created data set called UNSW-NB15 [4] data set to evaluate NIDS. The IXIA perfect system tool is utilized in cyber range lab ACCS to create a modern and abnormal dataset. The dataset contains 49 fields and nine different types of attacks namely Fuzzers, Backdoors, Analysis, DOS, Exploits, Generic, Reconnaissance,shell code and Worms,[4].The dataset data can be categorized into details of flow features(which contains source,destination address, port address, Transmission protocol),basic features(data transfer details, load, services),time features, connection features and labelled features. [3] RULE BASED CLASSIFICATION Mining is the process of extracting knowledge from available datasets. Analysis of data can be used for extracting models, specifying classes or to predict what will happen. Classification can be used to analyse categorized labels and used to predict what will happen. Different classification models are available such as Statistical models, Fuzzy models, Rule based models, Ensemble method and Probabilistic method [5]. The rule based model generates a set of rules for prediction of output. A rule is actually a condition of the form (Condition) - y where condition is conjunction of attribute tests and y is a class label. A rule r covers an instance x if the attribute of instance satisfies the condition of rule. Main advantages of rule based classifiers are it is highly expressive as decision trees and easy to interpret and generate. New instances can be easily classified by rule based system. The rules can be mutually exclusive in which classifier contains rule that is independent of each other and exhaustive which accounts for every combination of attribute values [6]. There are two ways of building rules. They are direct method and indirect method. Direct method extracts rules directly from data and indirect method from other classification models. In this paper an analysis of direct method and indirect method is done by using classification algorithms and evaluation is done on the result. [3.1] Direct Method It starts with empty set of rules and the rules are generated directly from data. The rules are then pruned and simplified. Quality of classification rule can be evaluated by coverage and accuracy. The coverage is fraction of records that satisfy the antecedent of a rule and accuracy is fraction of records covered by rule that belongs to class on RHS. Various methods of classifiers are available under method. Here, RIPPER, RIDOR, One-R classifiers are studied and analyzed. Dr C Manju 131
A. One-R This method is used for finding relations between various variables in datasets. The method creates rule for each predictor and makes the rule assign value of each target class, It also calculates total error of rule of each predictor. Rules generated are as below < 131580.5 -> Exploits < 131883.5 -> Generic < 131895.5 -> Fuzzers >= 131895.5 -> Generic (55580/81694 instances correct) B. RIDOR Ridor is ripple down rule leaner, which generates a default rule first and use incremental reduced error pruning is it used to find exceptions with smallest error rate [7]. Except (id > 123179.5) and (id <= 123213.5) => attack_cat = Generic (3.0/0.0) [2.0/1.0] Except (id > 123520.5) and (id <= 123957.5) and (id > 123781) => attack_cat = DoS (33.0/0.0) [13.0/1.0] Except (id <= 123931) and (id > 123796) => attack- cat = Generic (25.0/0.0) [9.0/0.0].The values specify accuracy and coverage. Total number of rules (incl. the default rule) is 141272 and time taken to build model: 3015.4 seconds. C. DECISION TABLE It specifies only logic rules and is used to find quality of decision. It contains classifier rules which are created by a simple decision table majority classifier. It returns the majority of the training sets if the decision table matching the new instance is empty [8]. The testing resulted in forward searching with 47 evaluated subsets. The number of rules generated is 4326 and time for generating them is 16.71 s. D. RIPPER RIPPER is repeated Incremental Pruning to Produce Error Reduction. It divides training set into growing and pruning sets [7]. It is easy to interpret the results and applicable for certain kind of problems. The sample rule generated is as follows. (label = Attacked) and (dmean >= 45) and (dmean >= 107) and (sttl >= 254) => attack_cat=exploits (230.0/48.0) (label = Attacked) and (sinpkt >= 0.024) and (sinpkt <= 103.884333) and (dloss >= 2) and (dmean <= 55) => attack_cat=exploits (88.0/17.0) The value (230.0/48.0) specifies the coverage. It means, out of 230 instances 48 instances in data set is covered by the rule and others are not covered by the rule. The number of rules generated is 39 and execution time is 8049 s. Dr C Manju 132
RULE BASED CLASSIFICATION FOR NETWORK INTRUSION DETECTION SYSTEM USING USNW-NB 15 DATASET [3.2] INDIRECT METHOD The rules in this method are extracted from other classification models. The rules generated here are mutually exclusive and exhaustive. A. PART This is a new method for rule induction in which extract rules from an unpruned decision tree in the attempt to avoid problems. Unlike both C4.5 and RIPPER, it does not need to perform global optimization to produce accurate rule sets and the added simplicity is its main advantage. It will create partial decision tree on the current state of instances and rules are created from decision tree. It is a separate divide and conquer rule proposed by EIBE [8][9]. It generates decision list which are ordered set of rules and are as below id <= 120977 AND inpkt <= 0.004 AND id <= 120774: Fuzzers (5.0/2.0) id <= 120977 AND sinpkt <= 0.004 AND d > 120784: DoS (3.0/1.0) [4] EXPERIMENTAL ANALYSIS The analysis of the above classifiers is done using WEKA tool [8]. The evaluation is done using mining techniques which include pre-processing and filtration techniques. The pre-processing is done using CfsSubsetEval which Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them and corresponding search is done using Best First which searches the space of attribute subsets by greedy hill climbing augmented with a backtracking facility. After the pre-processing and filtration, the number of attributes were reduced from 49 to 10. After that, data is passed through 10-fold cross validation. This method splits the dataset into 10 folds and for each 10 folds it builds a model on 9 sets of datasets. It records the error on each prediction and repeat the process until each of the 10 folds has served as test set. The dataset is analyzed through various classifiers in the category of direct and indirect method. They are evaluated for accuracy and error parameters. [4.1] ACCURACY PARAMETERS Accuracy parameters include Precision, True Positive, F-measure, ROC and Kappa statistic. Precision measure is the accuracy of the dataset and is evaluated based on attack so that intrusion detection data can be evaluated and find how accurate data is. It also specifies the attack on the data [8]. Accuracy refers to the ability of model to correctly predict the attacks of new or previously unseen data. Also, it is the percentage of correctly classified by the classifier testing set. The Precision is defined by TP/(TP+FP). Recall r is the number of correctly classified positive data divided by actual positive data in dataset. R = TP/(TP+FN). Receive Operating Characteristics Curve is the plot of True Positive Rate against False Positive Rate which also provides accuracy of classifier on the data [9]. It shows the tradeoff between sensitivity and specificity. The area under ROC is measurement of accuracy. Dr C Manju 133
Fig.1 ROC curve for PART classification The rule based system are evaluated and the accuracy parameters are as specified in the table CLASSIFIERS TP FP PRECISION F MEASURE ROC KAPPA DT 0.72 0.061 0.682 0.691 0.9 0.626 JRIP 0.679 0.169 0.66 0.621 0.795 0.541 One-R 0.628 0.099 0.555 0.57 0.788 0.498 PART 0.771 0.041 0.753 0.754 0.944 0.697 RIDOR 0.739 0.041 0.725 0.729 0.849 0.6565 Table:1 showing accuracy parameters of various classifiers The graph representing the above data is as follows Fig 2. Graph representing accuracy parameters From the graph and table, it is found that in evaluation based on accuracy parameters, the PART classification algorithm has increased accuracy rate precision, Kappa statistic, True positive and False Negative. The area Under ROC that means accuracy is high with PART classifier. From this, Dr C Manju 134
RULE BASED CLASSIFICATION FOR NETWORK INTRUSION DETECTION SYSTEM USING USNW-NB 15 DATASET we can conclude that PART classifier which is an indirect method of the classification is most suitable method of evaluating the USNWNB15 dataset. [4.2] Error rate evaluation Parameters The error evaluation parameters include Root Mean Squared error which shows the error in the predicated actual classes which the instance dataset belongs to [10][11]. RMSE values should be lower for more accurate classification rules. Mean absolute error measures the average magnitude of errors. The classifiers are evaluated for relative absolute error (RAE) and root relative squared error (RRSE) also. Classifiers MAE RMSE RAE RRSE DT 0.0767 0.1901 49.9626 68.6313 RIDOR 0.0591 0.2286 36.0723 82.6019 JRIP 0.0986 0.2224 64.2821 80.3133 One-R 0.0743 0.2726 48.4527 98.4417 PART 0.0523 0.1748 28.5934 63.1169 Table 2: Error parameters of various classifiers Fig 3: Graph representing error rate In the figure, PART algorithm have reduced error rate and have higher performance. RIDOR classification found to have next less error rate. So we can conclude that these two provide higher performance than other classifiers under study. [5] CONCLUSION The Intrusion Detection Systems plays a vital role in the secure communication of data. The system is evaluated through USNW-NB 15 dataset using various rule based classification algorithms. In this paper performance of rule classifiers namely RIDOR, RIPPER, Decision Table, PART, One-R is analyzed using the cross- fold validation. The performance is evaluated for accuracy and error parameters. From the result it is evident that PART classification which is an indirect method of rule based classification is the better method in accuracy and reduced error rate than when compared with other system under study. Dr C Manju 135
REFERNCES [1] Krishna Kant Tiwari, Susheel Tiwari, Sriram Yadav Intrusion Detection Using Data Mining Techniques International Journal of Advanced Computer Technology (IJACT). [2] Trupti Phutane, Apashabi Pathan A Survey of Intrusion Detection System Using Different Data Mining Techniques International Journal of Innovative Research in Computer and Communication Engineering, Vol. 2, Issue 11, November 2014. [3] UNSW-NB15: A Comprehensive Data set for Network Intrusion Detection systems (UNSW-NB15 Network Data Set) Nour Moustafa, University of New South Wales at the Australian Defence Force Academy Canberra,Australia.Conference Paper November2015DOI:10.1109/MilCIS.2015.7348942 [4] Safaa O. Al-mamory, Firas S. Jassim Evaluation of Different Data Mining Algorithms with KDD CUP 99 Data Set Journal of Babylon University/Pure and Applied Sciences/ No.(8)/ Vol.(21): 2013 [5] Dr C Manju, Performance Evaluation of Intrusion Detection System Using Classification Algorithms, International Journal of Innovative Research in Science, Engineering and Technology Vol. 6, Issue 7, July 2017. [6] S Vijayarani, S, M. Muthulakshmi. Evaluating The Efficiency Of Rule Techniques for File Classification. International Journal of Research in Engineering and Technology eissn: 2319-1163 ISSN: 2321-7308. [7] Gaines, B.R., Paul Compton, J. 1995. Induction of Ripple-Down Rules Applied to Modeling Large Databases. [8] Petra Kralj Novak,,Intell. Inf. Syst. 5(3):211-228,2009 : Classification in WEKA. [9] Ali, Shawkat, and Kate A. Smith. "On learning algorithm selection for classification." Applied Soft Computing 6.2 (2006): 119-138. [10] Pankaj Singh, Sudhakar Singh, Comparative Study of Data Mining Algorithms through Weka, International Journal of Emerging Research in Management &Technology, ISSN: 2278-9359 (Volume-4, Issue-9). [11] Qin, Biao, et al. "A rule-based classification algorithm for uncertain data." Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on. IEEE, 2009. Dr C Manju 136