CHAPTER 2 DARPA KDDCUP99 DATASET

44 CHAPTER 2 DARPA KDDCUP99 DATASET 2.1 THE DARPA INTRUSION-DETECTION EVALUATION PROGRAM The number of intrusions is to be found in any computer and network audit data are plentiful as well as ever-changing. They are also thoroughly scattered and attempts to structure or catalogue audit data are extremely effort-intensive. In order to create effective detection models, model-building algorithms typically require a large amount of labelled data. One major difficulty in deploying IDS is the need to label system audit data for the algorithms. Misuse-detection systems need the data to be accurately labelled as either normal or attack,. Whereas for anomaly-detection systems, the data must be verified to ensure that it is exclusively normal namely attack-free. This requires the same effort (Eskin et al 2000; Lee et al 2001) and preparation of the data in this manner is both time-consuming and costly. A generous sponsor for the production of intrusion-detection audit data was found in the US government agency DARPA (Defense Advanced Research Project Agency, US) an innovator and promoter of technology, this organization has funded many projects in the last few decades. In 1969, one such research and development project was subsidized to create an experimental packet-switched network. This one venture saw the modest beginnings of what grew into the omnipresent Internet, known today. As a matter of fact, DARPA supports the evaluation of developing technologies: focusing on an effort, documenting existing capabilities and guiding research.

45 The 1998 DARPA Off-line Intrusion-Detection Evaluation Program (Lippmann et al 2000; http://www.ll.mit.edu/ist/ideval/data/ data_index.html 1999; Lippmann. et al 2000; Haines et al 2001) was one such project. Aware of the lack of suitable audit data sets for intrusion detection, DARPA sets out (1) to generate an intrusion-detection evaluation corpus which could be shared by many researchers, (2) to evaluate many intrusion-detection systems, (3) to include a wide variety of attacks and (4) to measure both attack-detection rates and false-alarm rates for realistic normal traffic. To avoid publicizing confidential information concerning any real network in connection with the data and in order not to cause disruption in the operation of an on-line network, an extensive test bed has been set up at MIT s Lincoln Laboratories for synthesis purpose. This test bed simulated the operation of a typical US Air Force LAN for over two months allowing considerable amount of audit data to be collected from it. 2.2 ATTACK TYPES IN THE 1999 DARPA DATA SET categories: Each attack type falls into one of the four following main Denial-of-service (DOS) attacks have the goal of limiting or denying service(s) provided to a user, computer or network. A common tactic is to severely overload the targeted system like a SYN flood. Probing or surveillance attacks have the goal of gaining knowledge of the existence or configuration of a computer system or network. Port scans or sweeping of a given IPaddress range is typically used in this category like IPsweep. Remote-to-Local (R2L) attacks have the goal of gaining local access to a computer or network to which the attacker

46 only previously had remote access. Examples of this are attempts to gain control of a user account say the Dictionary. User-to-Root (U2R) attacks have the goal of gaining root or super-user access on a particular computer or system with which the attacker previously had user level access. These are the attempts by a non-privileged user to gain administrative privileges (e.g. Eject). A total of 24 attack types was included in the training data and further 14 novel attacks were added to the test data, to compare the performance of IDS on known and on yet-unseen attacks. A further aim of the evaluation was to determine whether systems could detect stealthy attacks. These are variations of an attack. They have been modified from the standard form available on the Internet, in an attempt to evade detection. Methods of being stealthy vary, depending on the attack type (Kendall 1999). The attacks are grouped according to a category and type. The number of occurrences is detailed; distinguishing between attacks launched in the clear or performed stealthily. Furthermore, specifying whether it is appeared in training or test data. For example, there were 46 Eject attacks in the simulation. Of these, 10 were stealthy and 36 were performed in the clear. Of those in the clear category, 29 figured in the training data and 7 in the test data. In the DARPA programmes, detection rates for each attack category was estimated for comparative purposes, when evaluating the performance of IDS.

47 2.2.1 Different Attack Types The category of an attack is determined by its ultimate goal, so that within a given category, attacks may closely resemble each other. The DOS attacks are designed to disrupt a host or network service. Some DOS attacks (e.g. smurf) excessively load a legitimate network service; others (e.g. teardrop, Ping of Death) create malformed packets, which are incorrectly handled by the victim machine. Others still (e.g. apache2, back, syslogd) take advantage of software bugs in network daemon programmes. Probe attacks are launched by programmes, which can automatically scan a network of computers to gather information or find known vulnerabilities. Such probes are often precursors to more dangerous attacks because they provide mapping to machines and services and pinpoint weak links in a network. Some of these scanning tools, satans, saint and mscan enable even an unskilled attacker to check hundreds of machines on a network for known vulnerabilities. In the R2L attacks, an attacker who does not have an account on a victim machine sends packets to that machine and gains local access. Some R2L attacks exploit buffer overflows in network server software (e.g. imap, named, sendmail); others exploit weak or misconfigured security policies (e.g. dictionary, ftp-write, and guest) and one (xsnoop) is a Trojan passwordcapture programme. The snmp-get R2L attack against the router is a password-guessing attack where the community password of the router is guessed and an attacker then uses SNMP to monitor the router. During U2R attacks, a local user on a machine tries to obtain privileges normally reserved for the UNIX root or super-user. Some U2R attacks exploit poorly-written system programmes which run at root level and are susceptible to buffer overflows (e.g. eject, ffbconfig, fdformat). Others may exploit weaknesses in path-name verification (e.g. loadmodule), bugs in some versions of perl (e.g. suidperl) or other software weaknesses.

48 2.2.2 Attack Descriptions back - Denial-of-service attack against apache webserver, where a client requests a URL containing many backslashes. dict - Guess passwords for a valid user, using simple variants of the account name over a telnet connection. eject - Buffer overflow using eject program on Solaris. Leads to a userto-root transition if successful. ffb - Buffer overflow using the ffbconfig UNIX system command leads to root shell. format - Buffer overflow using the fdformat UNIX system command leads to root shell. ftp-write - Remote FTP user creates.rhost file in world writable anonymous FTP directory and obtains local login. guest ipsweep - Try to guess password via telnet for guest account. - Surveillance sweep performing either a port sweep or ping on multiple host addresses. land - Denial of service where a remote host is sent a UDP packet with the same source and destination. loadmodule - Non-stealthy load module attack which resets IFS for a normal user and creates a root shell. multihop - Multi-day scenario in which a user first breaks into one machine. neptune - Syn-flood denial-of-service on one or more ports. nmap - Network mapping using the nmap tool. Mode of exploring network will vary-options include SYN.

49 perlmagic - Perl attack which sets the user id to root in a perl script and creates a root shell. phf - Exploitable CGI script which allows a client to execute arbitrary commands on a machine with a misconfigured web server. pod - Denial-of-service ping-of-death. portsweep- Surveillance sweep through many ports to determine which services are supported on a single host. rootkit - Multi-day scenario where a user installs one or more components of a rootkit. satan - Network probing tool which looks for well-known weaknesses. operates at three different levels. Level 0 is light. smurf spy - Denial-of-service icmp-echo reply flood. - Multi-day scenario in which a user breaks into a machine with the purpose of finding important information where the user tries to avoid detection. Uses several different exploit methods to gain access. syslog - Denial of service for the syslog service connects to port 514 with unresolvable source ip. teardrop - Denial of service where mis-fragmented UDP packets cause some systems to reboot. warez - User logs into anonymous FTP site and creates a hidden directory. warezclient - Users downloading illegal software which was previously posted via anonymous FTP by the warezmaster. warezmaster - Anonymous FTP upload of Warez (usually illegal copies of copyrighted software) onto FTP server.

50 2.3 DATA-SET DESCRIPTION The KDDCUP99 Data (Irvine 1999) are the data sets, which were issued for use in the KDDCUP 99 Classifier-Learning Competition. These sets of training and test data were made available by Stolfo and Lee (http:// kdd.ics.uci.edu/ databases/kddcup99/task.htm. 1999) and consisted of a preprocessed version of the 1998 DARPA Evaluation Data. This team s IDS had performed particularly well in the Intrusion-Detection Evaluation Program of that year, using data mining even as a pre-processing stage to extract characteristic intrusion features from raw TCP/IP audit data. The original raw training data were about four gigabytes of compressed binary tcpdump data obtained from the first seven weeks of network traffic at MIT. This was preprocessed with the feature-construction framework MADAM ID (Mining Audit data for automated models for Intrusion Detection) to produce about five-million connection records. A connection is defined to be a sequence of TCP packets starting and ending at some well-defined times, between which data flow to and fro from a source IP address to a destination IP address, under some well-defined protocol. Each connection is labelled as either normal or with the name of its specific attack type. A connection record consists of about 100 bytes. Ten percent of the complementary two-weeks of the test data were, likewise, pre-processed to yield a further less than half-amillion connection records. For the information of contestants, it was stressed that these test data were not from the same probability distribution as the training data, and that they included specific attack types which are not found in the training data. The full amount of labeled test data with some two million records was not included in this data set.

51 2.3.1 Set of Features used in the Connection Records In the KDDCUP99 Data, the initial features extracted for a connection record (Eskin 2002; Lee 1994-1999) include the basic features of an individual TCP connection, such as: its duration, protocol type, number of bytes transferred and the flag indicating the normal or error status of the connection. These intrinsic features provide information for general network-traffic analysis purposes. Since most DOS and Probe attacks involve sending a lot of connections to the same host(s) at the same time, they can have frequent sequential patterns, which are different to the normal traffic. For these patterns, a same host feature examines all other connections in the previous 2 seconds, which had the same destination as the current connection. Similarly, a same service feature examines all other connections in the previous 2 seconds, which had the same service as the current connection. These temporal and statistical characteristics are referred to as the timebased traffic features. There are several Probe attacks which use a much longer interval than 2 seconds (for example, one minute) when scanning the hosts or ports. For these, a mirror set of host-based traffic features were constructed based on a connection window of 100 connections: The R2L and U2R attacks are embedded in the data portions of the TCP packets and it may involve only a single connection. To detect these, connection features individual connections were constructed using domain knowledge. These features suggest whether the data contains suspicious behaviour, such as: a number of failed logins successfully logged in or not, whether logged in as root, whether a root shell is obtained, etc. In total, there are 42 features (including the attack type) in each connection record, with most of them taking on values. The individual features are listed and briefly described in Table 2.2 to 2.5. Table 2.1 shows the different types of attacks and their categories:

52 Table 2.1 Class Labels that Appears in 10% KDDCUP99 Dataset Attack Number of Samples Category smurf. 280790 DOS neptune. 107201 DOS back. 2203 DOS teardrop. 979 DOS pod. 264 DOS land. 21 DOS normal. 97277 Normal satan. 1589 Probe ipsweep. 1247 Probe portsweep. 1040 Probe nmap. 231 Probe warezclient. 1020 R2L guess_passwd. 53 R2L warezmaster. 20 R2L imap. 12 R2L ftp_write. 8 R2L multihop. 7 R2L phf. 4 R2L spy 2 R2L buffer_overflow. 30 U2R rootkit. 10 U2R loadmodule. 9 U2R perl. 3 U2R

53 Connection Features, KDDCUP99 Table 2.2 Basic Features of Individual TCP Connections Feature name Description Type Duration length (number of seconds) of the connection Protocol_type type of the protocol, e.g. tcp, udp, etc. discrete Service network service on the destination, e.g., http, telnet, etc.discrete Src_bytes number of data bytes from source to destination Dst_bytes number of data bytes from destination to source Flag normal or error status of the connection discrete Land 1 if connection is from/to the same host/port; 0 otherwise Wrong_fragment number of wrong fragments discrete Urgent number of urgent packets Table 2.3 Content Features Within a Connection Suggested by Domain Knowledge Feature name Description Type hot Number of hot ' indicators Num_failed_logins Number of failed login attempts Logged_in 1 if successfully logged in ; 0 otherwise discrete Num_compromised Number of compromised conditions Root_shell 1 if root shell is obtained; 0 otherwise discrete Su_attempted 1 if su root command attempted; 0 otherwise discrete Num_root Number of root accesses Num_file_creations Number of file creation operations Num_shells Number of shell prompts Num_access_files Number of operations on access control files Num_outbound_cmds Number of outbound commands in an ftp session Is_hot_login 1 if the login belongs to the hot list; 0 otherwise discrete Is_guest_login 1 if the login is a guest login ; 0 otherwise discrete

54 Table 2.4 Traffic Features Computed Using a Two-Second Time Window Feature name Description Type count number of connections to the same host as the current connection in the past two seconds Note: The following features refer to these same-host connections. serror_rate % of connections that have ``SYN'' errors rerror_rate % of connections that have ``REJ'' errors same_srv_rate % of connections to the same service diff_srv_rate % of connections to different services srv_count number of connections to the same service as the current connection in the past two seconds Note: The following features refer to these same-service connections. srv_serror_rate % of connections that have SYN errors srv_rerror_rate % of connections that have REJ errors srv_diff_host_rate % of connections to different hosts

55 Table 2.5 Traffic Features Computed Using a Hundred Second Connection Window Traffic features computed using a hundred connection window *=same-host cxn **=sameservice cxn dst_host_count* dst_host_serror_ rate* dst_host_rerror_ rate* dst_host_same_s rv_rate* dst_host_diff_sr v_rate* dst_host_srv_co unt** dst_host_srv_ser ror_rate** dst_host_srv_rer ror_rate** dst_host_srv_dif f_host_rate** No. of connections to same host as the current connection in the past two seconds % of connections that have SYN errors % of connections that have REJ errors % of connections to the same service % of connections to the different services No. of connections to the same service as the current connection in the past two seconds % of the connections that have SYN errors % of the connections that have REJ errors % of the connections to different hosts

56 Figure 2.1 Umatrix for KDDCUP99 Data (Features 1 to 10 are shown) The U-matrix visualizes the distances between neighbouring map units, and thus shows the cluster structure of the map: high values of the U- matrix indicates a cluster border; uniform areas of low values indicate clusters themselves. Each component plane shows the values of one variable in each map unit. On top of these visualizations, additional information can be shown: labels, data histograms and trajectories. U-Matrix of the KDDCUP99 data is shown in Figure 2.1. Continued use of the KDDCUP99 Data in current research reported from Columbia University (Pfahringer 2000; Elkan 2000; Levin 2000; Lee 1994-1999; Chimphlee et al 2006) confirms the uniqueness of these data set in offering a large volume of network audit data (originally from DARPA) with a wide variety of labelled intrusions. For these reasons, it was decided to use the KDDCUP99 Data set for the investigation which was done in this research work.