INTRUSION DETECTION SYSTEM - PDF Free Download

INTRUSION DETECTION SYSTEM Project Trainee Muduy Shilpa B.Tech Pre-final year Electrical Engineering IIT Kharagpur, Kharagpur Supervised By: Dr.V.Radha Assistant Professor, IDRBT-Hyderabad Guided By: Mr. Jagannath Kranthi Research Fellow, IDRBT-Hyderabad

CERTIFICATE This is to certify that this project has been successfully completed to my satisfaction and that the goals set upon at the outset of this endeavor have been worked upon to the best of student s abilities and resources. I hereby allow this project to be presented for evaluation with my full consent. Supervisor Dr.V.Radha

Intrusion Detection using Support Linear Machine M. Shilpa a, b,jaganath Kranthi*, V. Radha *, a a Institute for Development and Research in banking Technology, Castle Hills Road #1, Masab Tank, Hyderabad 500 057 (A P) INDIA b Department of Electrical Engineering, Indian Institute of Technology-Kharagpur, Kharagpur 721 302 (W B) INDIA Introduction Intrusion Detection: Intrusion detection is the process of monitoring the events occurring in a computer system or monitoring the events occurring in a computer system or defined as attempts to bypass the security mechanisms of a computer or network ( compromise the confidentiality, integrity, availability of information resources ) Intrusion Detection System (IDS) Combination of software and hardware that attempts to perform intrusion detection. Raise the alarm when possible intrusion happens. Probe, DoS, U2R, R2L Due to the proliferation of high-speed Internet access, more and more organizations are becoming vulnerable to potential cyber attacks, such as network intrusions. Sophistication of cyber attacks as well as their severity has also increased recently. Cyber attacks (intrusions) are actions that attempt to bypass security mechanisms of computer systems. They are caused by: Attackers accessing the system from Internet Insider attackers - authorized users attempting to gain and misuse non-authorized privileges Typical intrusion scenario Scanning activity Computer network * Corresponding Author. Ph: +91-40-2353 4981

Why do we need Intrusion Detection Machine with vulnerability Security mechanisms always have inevitable vulnerabilities Current firewalls are not sufficient to ensure security in computer networks Security holes caused by allowances made to users/programmers/administrators Insider attacks Multiple levels of data confidentiality in commercial and government organizations needs multilayer protection in firewalls Traditional Intrusion Detection Systems Traditional intrusion detection system (IDS) tools (e.g. SNORT) are based on signatures of known attacks Example of SNORT rule (MS-SQL Slammer worm) any -> udp port 1434 (content:" 81 F1 03 01 04 9B 81 F1 01 "; content:"sock"; content:"send") www.snort.org Limitations Signature database has to be manually revised for each new type of discovered intrusion They cannot detect emerging cyber threats Substantial latency in deployment of newly created signatures Data mining based IDSs can alleviate these limitations Taxonomy of Computer Attacks Intrusions can be classified according to several categories: Attack type (Denial of Service (DoS), Scan, worms/trojan horses, compromises (R2L, U2R), ) Number of network connections involved in the attack single connection cyber attacks multiple connections cyber attacks Source of the attack multiple vs. single inside vs. outside Environment (network, host, P2P, wireless networks, ) Automation (manual, automated, semi-automated attacks)

Types of Computer Attacks DoS (Denial of Service) attacks DoS attacks attempt to shut down a network, computer, or process, or otherwise deny the use of resources or services to the authorized users Distributed DoS attacks Probe (probing, scanning) attacks Attacker uses network services to collect information about a host (e.g. list of valid IP addresses, what services it offers, what is the operating system) Compromises - attackers use known vulnerabilities such as buffer overflows and weak security to gain privileged access to hosts R2L (Remote to Login) attacks - attacker who has the ability to send packets to a machine over a network (but does not have an account on that machine), gains machine over a network (but does not have an account on that machine), gains access (either as a user or as a root) to the machine and does harmful operations U2R (User to Root) attacks - attacker who has access to a local account on a computer system is able to elevate his or her privileges by exploiting a bug in the operating system or a program that is installed on the system Trojan horses / worms attacks that are aggressively replicating on other hosts (worms selfreplicating; Trojan horses are downloaded by users) Source of Computer Attacks Attacks may be launched from single location or from several different locations Attacks may be also targeted to single or many different destinations Need to analyze network data from several sites in order to detect these distributed attacks. Single source attacks Distributed/Coordinated attacks Intrusion Detection Taxonomy Information source host-based ID, network-based ID, wireless-network ID, application logs, sensor alerts Analysis strategy Anomaly detection vs. misuse detection

Data mining approach vs. traditional techniques Time aspects in analysis Real-time analysis vs. off-line analysis Architecture Single centralized vs. distributed & heterogeneous Activeness Active reaction vs. passive reaction Continuality Continuous analysis vs. periodic analysis IDS - Analysis Strategy Misuse detection is based on extensive knowledge of patterns associated with known attacks provided by human experts Existing approaches: pattern (signature) matching, expert systems, state transition analysis, data mining Major limitations: Unable to detect novel & unanticipated attacks Signature database has to be revised for each new type of discovered attack Anomaly detection is based on profiles that represent normal behavior of users, hosts, or networks, and detecting attacks as significant deviations from this profile Major benefit - potentially able to recognize unforeseen attacks. Major limitation - possible high false alarm rate, since detected deviations do not necessarily represent actual attacks Major approaches: statistical methods, expert systems, clustering, neural networks, support vector machines, outlier detection schemes IDS Time Aspects in Analysis Real-time IDS Analyzes the data while the sessions are in progress (e.g. network sessions for network intrusion detection, login sessions for host based intrusion detection) Raises an alarm immediately when the attack is detected Off-line IDS Analyzes the data when the information about the sessions are already collected postanalysis Useful for understanding the attackers behavior Standard measures for evaluating IDSs: Detection rate - ratio between the number of correctly detected attacks and the total number of attacks False alarm (false positive) rate - ratio between the number of normal connections that are incorrectly misclassified as attacks (False Alarms in Table) and the total number of normal connections Trade-off between detection rate and false alarm rate Performance (Processing speed + propagation + reaction) Fault Tolerance (resistant to attacks, recovery, resist subversion) Data

DARPA Data Set The DARPA evaluation data set has been made available by MIT Lincoln Laboratory under DARPA sponsorship (Lippmann & Cunningham, 1999). To the date, there are three data sets available for evaluation, DARPA 98, 99 and 2000 (DARPA 1998 data). Each recent data set contains new attacks. A sub set of the DARPA intrusion detection data set is used for off-line analysis. In the DARPA intrusion detection evaluation program, an environment was set up to acquire raw TCP/IP dump data for a network by simulating a typical U.S. Air Force LAN. The LAN was operated like a real environment, but being blasted with multiple attacks (Kendall, 1998) (Webster, 1998). For each TCP/IP connection, 41 various quantitative and qualitative features were extracted (Lee & Stolfo, 1998). The 41 features extracted fall into three categories, intrinsic features that describe about the individual TCP/IP connections; can be obtained from network audit trails, content-based features that describe about payload of the network packet; can be obtained from the data portion of the network packet, traffic-based features, that are computed using a specific window (connection time or no of connections). As DOS and Probe attacks involve several connections in a short time frame, whereas R2U and U2R attacks are embedded in the data portions of the connection and often involve just a single connection; traffic-based features play an important role in deciding whether a particular network activity is engaged in probing or not. Attack types fall into four main categories: 1. Denial-of-Service (DOS): deny legitimate requests to a system. 2. Remote-to-Local (R2L): unauthorized access from a remote machine. 3. User-to-Root (U2R): unauthorized access to local super user (root) privileges. 4. Probing: surveillance and information gathering attacks. Network based data is provided in tcmdump files and host based data in BSM audit log files. 1. PROPOSED SCHEME In this research work, we propose a scheme for network intrusion detection using 5 SVM s for each respective class of data. During the proposed approach, SVM-RFE (Guyon, 2002) is first employed for feature selection purpose. Later, the 5 SVM s are then employed for training. The proposed approach is applied to build a network intrusion detector, a predictive model capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. The dataset used in this study is obtained from KDDcup99, which was held in conjuction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining (Lee et al., 1998). The dataset is very huge and unbalanced in nature, with 41 features, which gives details about basic TCP features and time and window based features. Table 1 presents the attribute information about the data analyzed in this research study. There are approximately 5 million training records and 0.5 million testing records. Both training and testing data subsets cover four major attack categories, viz. Probing/Scanning, DoS, User-to-Root and Remote-to- Local. Table 2 presents these attack categories based on the attributes. Attribute Selection has been done using SVM. First of all, using SVM we get the ranks of all the attributes, then a threshold limit is set. All the ranks of the attributes which fall above the threshold are taken. A new dataset is prepared using the reduced attributes and this dataset is employed for training. By

following this procedure and using various threshold limits,22 features have been selected which give almost the same accuracy and sensitivity as the 41 attributes dataset. By using the 22 features dataset, the testing time, training time and also the memory space reduces to a greater extent Table 1. Attribute information of the dataset analyzed No Features No Features No Features 1. Duration 15 su_attempted 29 same_srv_rate 2. protocol_type 16 num_root 30 diff_srv_rate 3. service 17 num_file_creations 31 srv_diff_host_rate 4. Flag 18 num_shells 32 dst_host_count 5. src_bytes 19 num_access_files 33 dst_host_srv_count 6. dst_bytes 20 num_outbound_cmds 34 dst_host_same_srv_rate 7. Land 21 is_host_login 35 dst_host_diff_srv_rate 8. wrong_fragment 22 is_guest_login 36 dst_host_same_src_port_rate 9. urgent 23 count 37 dst_host_srv_diff_host_rate 10. hot 24 srv_count 38 dst_host_serror_rate 11. num_failed_logins 25 serror_rate 39 dst_host_srv_serror_rate 12. logged_in 26 srv_serror_rate 40 dst_host_rerror_rate 13. num_compromised 27 rerror_rate 41 dst_host_srv_rerror_rate 14. root_shell 28 srv_rerror_rate 29 PROPOSED 5 CLASS SVM INTRUSION DETECTION ARCHITECTURE SVM1(normal) Internet Firewall Network Data Pre- Processor SVM2(DoS) SVM3(R2L) SVM4(U2R) Flag? Server SVM5(Probe) System Administrator

3.1 DATA PREPROCESSING As the dataset under consideration is very large with.5 million training records and 0.05 million testing records, it is observed that there are so many records appearing more than once. We considered these kinds of records as duplicate records, which add the redundancy to the data. Over training is the result of the machine learning approach when it is trained using redundant data, which degrades the performance of the underdeveloped model. We removed the redundant records from training data and test data as well. After removing redundant records from the actual data, the training data records are reduced to 0.145 million and test dataset is reduced to.077 million records. This modified data is then used for detecting network intrusions.

Table 2. Transformation of the classes Attack types Class Class Attack types Class Class Normal normal normal multihop apache2 Phf Back ftp_write Land guess_passwd mailbomb Spy neptune snmpattack Pod DoS snmpguess processtable sendmail R2L Smurf Named teardrop warezclient udpstorm attack warezmaster attack buffer_overflow Worm loadmodule Xlock Ps Xsnoop Perl Mscan Rootkit U2R Nmap Xterm Ipsweep Httptunnel Portsweep Probe Sqlattack Saint Imap Satan Ms Briefly Explained SV Support vector machines, or SVMs, are learning machines that place the training vectors in highdimensional feature space, labeling each vector by its class. SVMs classify data by determining a set of vectors from the training set, called support vectors, which outlines a hyper plane in the feature space. SVMs provide a generic mechanism to fit the surface of the hyper plane to the data through the use of a kernel function. The user may provide a function (e.g., linear, polynomial, or sigmoid) to the SVMs during the training process, which selects support vectors along the surface of this function. The number of free parameters used in the SVMs depends on the margin that separates the two classes but not on the number of input features, thus SVMs do not require a reduction in the number of features in order to avoid over fitting--an apparent advantage in applications such as intrusion detection. Another primary advantage of SVMs is the low expected probability of generalization errors. There are other reasons that we use SVMs for intrusion detection. The first is speed: as realtimeperformance is of primary importance to IDSs, any classifier that can potentially run fast is worth considering. The second reason is scalability: SVMs are relatively insensitive to the number of data points and the classification complexity does not depend on the dimensionality of the feature space, so they can

potentially learn a larger set of patterns and thus be able to scale better than neural networks. Finally, SVMs give highly accurate classification of the patterns. Final results REDUCED ATTRIBUTES(22) AFTER USING SVM No Features No Features No Features 1. Land 9 Srv_serror_rate 17 Serror_rate 2. Num_failed_logins 10 Dst_host_srv_count 18 Dst_host_rerror_rate 3. Dst_host_srv_diff_host_rate 11 Dst_host_same_srv_rate 19 Logged_in 4. Flag 12 Num_compromised 20 Is_guest_login 5. Dst_host_srv_serror_rate 13 Srv_count 21 Su_attempted 6. Dst_host_diff_srv_rate 14 duration 22 Root_shell 7. service 15 Dst_host_count 8. Srv_diff_host_rate 16 Srv_serror_rate Table 3 :Performance of 5 SVM s using 41 Features Class Accuracy Normal 92.26 DoS 95.67 Probe 99.46 R2L 94.86 U2R 99.9 Table 4. Results obtained using SVM. Classifier Full features(41 features) Reduced features(22 features) Accuracy Sensitivity Specificity Accuracy Sensitivity Specificity SVM 92.26 90.54 99.82 90.34 88.16 99.83 The results presented in this report are not strictly comparable to (Chen et al., 2009), because they carried on with original data without removing the redundant records during preprocessing. The accuracy obtained by our proposed approach with reduced features and SVM as a classifier is 90.34% with 88.16% sensitivity and 99.83% Specificity. (Chen et al., 2009) reported the accuracy of 89.13%, sensitivity of 86.72% and specificity of XX%, where they also trained SVM using reduced feature data. It is observed that the results obtained by our proposed approach are equally well compared to (Chen et al., 2009) approach.