Application of the Generic Feature Selection Measure in Detection of Web Attacks

Application of the Generic Feature Selection Measure in Detection of Web Attacks Hai Thanh Nguyen 1, Carmen Torrano-Gimenez 2, Gonzalo Alvarez 2 Slobodan Petrović 1, and Katrin Franke 1 1 Norwegian Information Security Laboratory Gjøvik University College, Norway {hai.nguyen, katrin.franke, slobodan.petrovic}@hig.no 2 Instituto de Física Aplicada, Consejo Superior de Investigaciones Científicas {carmen.torrano,gonzalo}@iec.csic.es Abstract. Feature selection for filtering HTTP-traffic in Web application firewalls (WAFs) is an important task. We focus on the Generic- Feature-Selection (GeFS) measure [4], which was successfully tested on low-level package filters, i.e., the KDD CUP 99 dataset. However, the performance of the GeFS measure in analyzing high-level HTTP-traffic is still unknown. In this paper we study the GeFS measure for WAFs. We conduct experiments on the publicly available ECML/PKDD-2007 dataset. Since this dataset does not target any real Web application, we additionally generate our new CSIC-2010 dataset. We analyze the statistical properties of both two datasets to provide more insides of their nature and quality. Subsequently we determine appropriate instances of the GeFS measure for feature selection. We use different classifiers to test the detection accuracies. The experiments show that we can remove 63% of irrelevant and redundant features from the original dataset, while reducing only 0.12% the detection accuracy of WAFs. Key words: Web attack detection, Web application firewall, intrusion detection systems, feature selection, machine learning algorithms. 1 Introduction Web attacks pose many serious threats to modern Internet. The number of Web attacks is steadily increasing, consequently Web application firewalls (WAFs) [8] need to be more and more effective. One of the approaches for improving the effectiveness of WAFs is to apply the feature selection methods. Achieving reduction of the number of relevant traffic features without negative effect on detection accuracy is a goal that greatly increases the available processing time of WAFs and reduces the required system resources. As there exist many feature selection algorithms (see, for example [2,3]), the question that arises is which ones could be applied in intrusion detection in general and in Web attack detection in particular. The most of the feature selection work in intrusion detection practice is still done manually and the quality of selected features depends

2 Authors Suppressed Due to Excessive Length strongly on expert knowledge. For automatic feature selection, the wrapper and the filter models from machine learning are frequently applied [2,3]. The wrapper model assesses the selected features by learning algorithm s performance. Therefore, the wrapper method requires a lot of time and computational resources to find the best feature subsets. The filter model considers statistical characteristics of a dataset directly without involving any learning algorithms. Due to the computational efficiency, the filter method is usually used to select features from high-dimensional datasets, such as intrusion detection systems. Moreover, this method allows to estimate feature subsets not only by their relevance, but also by the relationships between features that make certain features redundant. A major challenge in the IDS feature selection process is to choose appropriate measures that can precisely determine the relevance and the relationship between features of a given dataset. Since the relevance and the relationship are usually characterized in terms of correlation or mutual information [2,3], we focus on the recently proposed generic feature selection (GeFS) measure for intrusion detection [4]. This measure consists of two instances that belong to the filter model from machine learning: the correlation feature selection (CFS) measure and the minimal-redundancymaximal-relevance (mrmr) measure. In a given dataset, if there are many features that are linearly correlated to each other, then the CFS measure is recommended for selecting features. Otherwise, the mrmr measure is alternatively chosen as it considers non-linear relations through the analysis of mutual information between the features. The GeFS measure was successfully tested on the KDD CUP 1999 benchmarking dataset for IDS [9]. However, this dataset is out of date and it was heavily criticized by the IDS community (see, for example [7]). Moreover, the KDD CUP 1999 dataset does not contain enough HTTP traffic for training and testing WAFs and the Web attacks of this dataset are not representative for currently existing Web attacks. Therefore, the question about the performance of the GeFS measure perform in Web attack detection is still open. In this paper, we propose to use the GeFS measure for selecting features in Web attack detection. We conducted experiments on the ECML/PKDD 2007 dataset, which was generated for the ECML/PKDD 2007 Discovery Challenge [6]. However, the attack requests of this dataset were constructed blindly [6] and did not target any real Web application. Therefore, we additionally generated our new CSIC 2010 dataset, which contains the traffic directed to an e-commerce Web application. From our expert knowledge about Web attacks, we listed 30 features that we considered relevant for the detection process. Then, we extracted the values of these 30 relevant features from the datasets. By applying the GeFS measure, we wanted to know within the particular datasets which features among the 30 extracted features are the most important for the Web attack detection process. In order to do that, we analyzed the statistical properties of the datasets to see whether they had linear correlation or non-linear relations between features. To do that, the data points of the datasets were visualized in the two-dimensional space and the correlation coefficients were computed. We

Generic Feature Selection Measure in Web Attack Detection 3 then chose the CFS measure for feature selection from the CSIC 2010 dataset and the mrmr measure for the ECML/PKDD 2007 dataset. The detection accuracies obtained after the feature selection by means of four different classifiers were tested. The experiments show that by using appropriate instances of the GeFS measure, we could remove 63% of irrelevant and redundant features from the original dataset, while reducing only 0.12% the detection accuracy of WAFs. The paper is organized as follows. Section 2 describes the generic feature selection (GeFS) measure for intrusion detection and its instances in more detail. Section 3 shows our experiments on the CSIC 2010 dataset and the ECML/PKDD 2007 dataset. The last section summarizes the findings. 2 Generic Feature Selection for Intrusion Detection In this subsection, we give an overview of the generic feature selection (GeF S) measure together with two instances applied in intrusion detection: the correlationfeature-selection (CFS) measure and the minimal-redundancy-maximal-relevance (mrmr) measure [4]. Definition: The feature selection problem by means of the generic feature selection (GeFS) measure is to find x {0, 1} n that maximizes the function GeF S(x): max GeF S(x) = a n 0 + i=1 A i(x)x i x {0,1} n b 0 + n i=1 B (1) i(x)x i In this definition, binary values of the variable x i indicate the appearance (x i = 1) or the absence (x i = 0) of the feature f i ; a 0, b 0 are constants; A i (x), B i (x) are linear functions of variables x 1,..., x n. The Correlation Feature Selection (CFS) Measure: This measure characterizes the relevance of features and their relationships in terms of the linear correlation. For a given dataset, if there are many features that are linearly correlated to each other, the CFS measure is recommended for selecting features. In [4], it was shown that the CFS measure is an instance of the GeFS measure. The minimal-redundancy-maximal-relevance (mrmr) Measure: The mrmr measure considers non-linear relations through the analysis of mutual information between the features. Therefore, it was recommended for selecting features from datasets that have many non-linearly correlated features. It was also shown that the mrmr measure belongs to the generic feature selection (GeFS) measure [4]. The feature selection problem (1) can be solved by means of the optimization approach proposed in [4]. The main idea is that the problem (1) is transformed into a mixed 0-1 linear programming problem, which can be solved by using the branch and bound algorithm. The search strategy for obtaining subsets of relevant features by means of the GeFS measure is:

4 Authors Suppressed Due to Excessive Length -Step 1 : Analyze the statistical properties of the given dataset in order to choose the appropriate feature selection instance (CFS or mrmr) from the generic feature selection measure GeFS. We choose the CFS measure if the dataset has many features that are linearly correlated to the class label and to each other. Otherwise, the mrmr measure is chosen. -Step 2 : According to the choice from Step 1, construct the optimization problem (1) for the CFS measure or for the mrmr measure. In this step, we can use expert knowledge by assigning the value 1 to the variable if the feature is relevant and the value 0 otherwise. -Step 3 : Transform the optimization problem of the GeFS measure to a mixed 0-1 linear programming (M01LP) problem, which is to be solved by means of the branch-and-bound algorithm. A non-zero integer value of x i from the optimal solution x indicates the relevance of the feature f i regarding the GeFS measure. 3 Experiment In this section, we show the application of the generic feature selection (GeFS) measure in Web attack detection. We first describe two datasets, on which the experiments were conducted: the ECML/PKDD 2007 dataset [6] and our new CSIC 2010 dataset. We then discuss the 30 features that we consider relevant for Web attack detection. We analyze the statistical properties of these datasets containing the 30 extracted features to choose appropriate instances from the GeFS measure. Since there is no standard Web application firewall (WAF), we apply four different machine learning algorithms to evaluate the detection accuracy on datasets containing the selected features. 3.1 Data Sets We conducted experiments on the ECML/PKDD 2007 dataset, which was generated for the ECML/PKDD 2007 Discovery Challenge [6]. In fact, we used the training set, which is composed of 50,000 samples including 20% of attacks (i.e. 10,000 attacks and 40,000 normal requests). The requests are labeled with specifications of attack classes or normal traffic. The classes of attacks in this dataset are: Cross-Site Scripting, SQL Injection, LDAP Injection, XPATH Injection, Path traversal, Command Execution and SSI attacks. However, the attack requests of this dataset were constructed blindly and did not target any real Web application. Therefore, we additionally generated our new CSIC 2010 dataset for experiments. The CSIC 2010 dataset contains the generated traffic targeted to an e- commerce Web application developed at our department. In this web application, users can buy items using a shopping cart and register by providing some personal information. The dataset was generated automatically and contains 36,000 normal requests and more than 25,000 anomalous requests. In this data set the requests are labeled as normal or anomalous. We included attacks such as SQL injection, buffer overflow, information gathering, files disclosure, CRLF

Generic Feature Selection Measure in Web Attack Detection 5 injection, XSS, server side include, parameter tampering and so on. In order to generate the traffic, we collected thousands of normal and anomalous values for the parameters of the web application. Then, we generated requests for every web-page and the values of the parameters, if any, were filled with the values collected (the normal values for the normal traffic and the anomalous ones for the anomalous traffic). Further details can be found in [5]. Table 1. Names of 30 features that are considered relevant for the detection of Web attacks. refers to features selected by the CFS from the CSIC-2010 dataset; refers to features selected by the mrmr from the CSIC 2010 dataset; refers to features selected by the CFS from the ECML/PKDD 2007 dataset; and refers to features selected by the mrmr from the ECML/PKDD 2007 dataset. Feature Name Feature Name Length of the request Length of the path Length of the arguments Length of the header Accept Length of the header Accept-Encoding Length of the header Accept-Charset Length of the header Accept-Language Length of the header Cookie Length of the header Content-Length Length of the header Content-Type Length of the Host Length of the header Referer Length of the header User-Agent Method identifier Number of arguments Number of letters in the arguments Number of digits in the arguments Number of special char in the arguments Number of other char in the arguments Number of letters char in the path Number of digits in the path Number of special char in the path Number of other char in path Number of cookies Minimum byte value in the request Maximum byte value in the request Number of distinct bytes Entropy Number of keywords in the path Number of keywords in the arguments 3.2 Experimental Settings From our expert knowledge about Web attacks, we listed 30 features that we considered relevant for the detection process (see Table 1). Some features refer to the length of the request, the length of the path or the headers, as length is important for detecting buffer-overflow attacks. From our expert knowledge, we observed that the non-alphanumeric characters were present in many injection attacks. Therefore, we considered four types of characters: letters, digits, nonalphanumeric characters which have an special meaning in a set of programming languages (in Table 1 we refer to them as special char) and other characters. We analyzed their appearances in the path and in the argument s values. We also studied the entropy of the bytes in the request. Additionally, we collected the keywords of several programming languages that were often used in the injection attacks and counted the number of their appearances in different parts of the request as a feature.

6 Authors Suppressed Due to Excessive Length 504 193 252 98 0 0 108.5 217 3 122 1269.5 2417 Fig. 1. Sample distribution of data points of the CSIC 2010 dataset. Fig. 2. Sample distribution of data points of the ECML/PKDD 2007 dataset. Then, we extracted the values of these 30 relevant features from the CSIC 2010 and from the ECML/PKDD 2007 datasets and analyzed their statistical properties to see whether they had linear or non-linear relations between features. From this analysis, the appropriate feature selection instance from the GeFS measure was chosen for each dataset according to the Step 1 of the search method described above. In order to do that, we first visualized the whole datasets in the two-dimensional space to get a plot matrix, in which each element was the distribution of data points depending on the values of a feature and the class label or the values of two features. For instance, Fig. 1 and Fig. 2 show the sample distributions of data points of the CSIC 2010 dataset and the ECML/PKDD 2007 dataset, respectively. We then calculated the correlation coefficients between the features. From these, we observed that the CSIC 2010 dataset had many features that were linearly correlated to each other, whereas in the ECML/PKDD 2007 dataset the non-linear relations between features were more representative. In fact, in the CSIC 2010 dataset, more than 63 % of the correlation coefficients were greater than 0.5, whereas in the ECML/PKDD 2007 dataset more than 83% of the correlation coefficients were less than 0.09. Therefore, we chose the CFS measure for selecting features from the CSIC 2010 dataset, and the mrmr measure for selecting features from the ECML/PKDD 2007 dataset. Moreover, the CFS and the mrmr measures were also applied to the ECML/PKDD 2007 and to the CSIC 2010 datasets, respectively, to see how the wrong choice of feature selection methods would negatively affect the detection performance. We applied the optimization algorithm proposed in [4] to find globally optimal feature subsets by means of the CFS and the mrmr measures. Four classifiers with 10-fold cross validation were used to evaluate detection performances before and after feature selection: C4.5, CART, RandomTree and RandomForest [1]. All the obtained results are given in the Tables 2, 3 and 4. 3.3 Experimental Results Table 2 shows the number of full-set features before feature selection and the number of features selected by the CFS measure and the mrmr measure (Table

Generic Feature Selection Measure in Web Attack Detection 7 1 shows which features were selected). Table 3 and Table 4 summarize the detection rate as well as the false positive rate of four different classifiers performed on the CSIC 2010 dataset and the ECML/PKDD 2007 dataset, respectively. Table 2. Full-set features and the number of selected features. Data Set Full-set CFS mrmr CSIC 2010 30 11 14 ECML/PKDD 2007 30 2 6 Table 3. Experimental results on the CSIC 2010 dataset. Classifiers Detection Rate False Positive Rate Full-set CFS mrmr Full-set CFS mrmr C4.5 94.49 94.06 79.80 5.9 6.8 25.7 CART 94.12 93.71 79.85 6.2 6.8 25.3 RandomTree 92.30 92.70 71.36 8.3 7.8 30.6 RandomForest 93.71 93.68 71.70 7.2 7.2 30.5 Average 93.65 93.53 75.67 6.9 7.1 28 It can be observed from Table 2 and Table 3 that the CFS measure performed well on the CSIC 2010 dataset and gave better results than the mrmr measure. In fact, the CFS measure removed the number of irrelevant and redundant features from the dataset by more than 63%, while reducing very slightly (only 0.12%) the detection accuracy. In this case, the mrmr measure gave much worse results in comparison with the full-set features. From Table 2 and Table 4, it can be seen that the mrmr measure removed 80% of irrelevant and redundant features from the ECML/PKDD 2007 dataset, whereas the detection accuracies were a bit lower than the ones obtained with the full-set feature. The CFS measure did not work well in this case. Therefore, based on all these experiments we can say that the effectiveness of WAFs would be improved by choosing and using appropriate feature selection methods of the GeFS measure. 4 Conclusions We have proposed to use the generic feature selection (GeFS) measure for Web attack detection. We analyzed statistical properties of the new generated CSIC 2010 dataset and the ECML/PKDD 2007 dataset. Based on this analysis, the CFS measure and the mrmr measure were chosen for selecting features from the CSIC 2010 dataset and the ECML/PKDD 2007 dataset, respectively. The

8 Authors Suppressed Due to Excessive Length Table 4. Experimental results on the ECML/PKDD 20007 dataset. Classifiers Detection Rate False Positive Rate Full-set CFS mrmr Full-set CFS mrmr C4.5 96.37 86.45 91.62 3.7 17.6 9.9 CART 96.11 86.45 91.54 4.3 17.6 10 RandomTree 96.89 86.39 93.41 2.6 17.7 6.4 RandomForest 98.80 86.39 95.18 1.2 17.7 5.0 Average 97.04 86.42 92.93 2.95 17.6 7.8 detection accuracies obtained after the feature selection by means of four different classifiers were tested. The experiments show that by choosing appropriate instances of the GeFS measure, we could remove 63% of irrelevant and redundant features from the original dataset, while reducing only 0.12% the detection accuracy of WAFs. Acknowledgements We would like to thank Gjøvik University College, Norwegian Information Security Laboratory, project CUCO (MTM2008-02194) from the Ministerio de Ciencia e Innovación, program JAE/I3P from the Consejo Superior de Investigaciones Científicas. References 1. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. John Wiley& Sons, USA, 2001. 2. I. Guyon, S. Gunn, M. Nikravesh and L.A. Zadeh. Feature Extraction: Foundations and Applications. Series Studies in Fuzziness and Soft Computing, 2005, Springer, 3. H. Liu, H. Motoda. Computational Methods of Feature Selection. Chapman & Hall/CRC, 2008. 4. H. Nguyen, K. Franke, and S. Petrovic. Towards a Generic Feature-Selection Measure for Intrusion Detection. In 20th International Conference on Pattern Recognition, Istanbul, Turkey, August 2010, pp. 1529-1532. 5. C. Torrano-Gimenez, A. Perez-Villegas, G. Alvarez. A Self-Learning Anomaly- Based Web Application Firewall. In Proceedings of Computational Intelligence In Security For Information Systems 2009 (CISIS09). pp. 85-92. 6. C. Rassi, J. Brissaud, G. Dray, P. Poncelet, M. Roche and M. Teisseire. Web Analyzing Traffic Challenge: Description and Results. In Proceedings of the Discovery Challenge ECML/PKDD 2007, pp. 47-52 (2007). 7. J. McHugh. Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory. Proc. ACM Transactions on Information and System Security (TISSEC) 3(4), pp. 262-294, 2000. 8. M. Becher. Web Application Firewalls. VDM Verlag Dr. Mueller e.k., February 1, 2007. ISBN-10: 383640446X, ISBN-13: 978-3836404464. 9. W. Lee. (1999). A data mining framework for building intrusion detection models. In IEEE Symposium on Security and Privacy, pages 120-132, Berkeley, California.