UNDERSTANDING AND EVALUATING THE IMPACT OF SAMPLING ON ANOMALY DETECTION TECHNIQUES

UNDERSTANDING AND EVALUATING THE IMPACT OF SAMPLING ON ANOMALY DETECTION TECHNIQUES Georgios Androulidakis, Vasilis Chatzigiannakis, Symeon Papavassiliou, Mary Grammatikou and Vasilis Maglaris Network Management & Optimal Design Lab (NETMODE) School of Electrical and Computer Engineering National Technical University of Athens, Zografou, Athens, Greece ABSTRACT In this paper, the emphasis is placed on the evaluation of the impact of various packet sampling techniques that have been proposed in the PSAMP IETF draft, on two widely used anomaly detection approaches. More specifically, we evaluate the behavior of a sequential nonparametric change-point detection method and an algorithm based on Principal Component Analysis (PCA) with the use of different metrics, under different traffic and measurement sampling methodologies. One of the key objectives of our study is to gain some insight about the feasibility and scalability of the anomaly detection process, by analyzing and understanding the tradeoff of reducing the volume of collected data while still maintaining the accuracy and effectiveness in the anomaly detection.. INTRODUCTION With the continuously increasing network traffic and highspeed Internet links, the problem of traffic measurement and analysis becomes more complicated and as a result sampling becomes an essential component of scalable Internet monitoring. Sampling is the process of making partial observations of a system of interest, and drawing conclusions about the full behavior of the system from these limited observations. The observation problem is concerned with minimizing information loss while reducing the volume of collected data in order to make this process feasible and scalable. The way in which the partial information is transformed into knowledge of the system is of high research and practical importance for developing efficient and effective anomaly detection techniques. Network anomaly detection techniques [][][] rely on the analysis of network traffic and the characterization of the dynamic statistical properties of traffic normality, in order to accurately and timely detect network anomalies. Anomaly detection is based on the concept that perturbations of normal behavior suggest the presence of anomalies, faults, attacks, etc. Obviously, the problem of anomaly detection becomes more complex when network traffic data is sampled. In this paper we study the impact of basic packet sampling techniques, proposed by the PSAMP IETF working group [], on two anomaly detection approaches that represent a wide set of commonly used anomaly detection strategies, covering cases that range from single link, single metric, to multi link and multi metric anomaly detection. Among the key objectives of our study is to investigate the possibility and corresponding degree of reducing the volume of collected data while still maintaining the accuracy and effectiveness in the anomaly detection. Such results and observations can facilitate the design of feasible and scalable anomaly detection techniques. It should be noted that our study is based on realistic data that have been collected from a real operational university campus network which consists of more than hosts. The remaining of this paper is organized as follows. In section we present some related work with respect to packet sampling as well as the objectives and contributions of our work. In section two anomaly detection algorithms under consideration in this paper are described, while in section different packet sampling techniques are presented. In section 5 we study and evaluate the impact of various packet sampling strategies on the overall performance of the anomaly detection process using real network traces, and finally section concludes the paper.. RELATED WORK AND PAPER OBJECTIVES Application of packet sampling on network traffic measurements was first studied in [5] using real network traces from the NSFNET backbone. The results revealed that the time driven techniques did not perform as well as the count driven ones while the performance differences within each class (count or time driven) were small. Furthermore, the problem of estimating flow distributions using packet sampling has been studied in [] and [7]. Corresponding Author is: Prof. Symeon Papavassiliou (papavass@mail.ntua.gr) of 7

Researchers have also proposed schemes that follow an adaptive packet sampling approach in order to achieve more accurate measurements of network traffic. Choi et al. in [] proposed an adaptive packet sampling technique for flow-level traffic measurement with stratification approach, which provides unbiased estimation of flow size (in terms of packet and byte counts) for large flows. In [9] an Adaptive NetFlow was proposed which addresses several shortcomings of NetFlow [], by dynamically adapting the sampling rate to the traffic mix to achieve robustness without sacrificing accuracy. However the specific problem of studying and understanding the impact of sampling on the anomaly detection process is quite different and far more complicated than the corresponding ones regarding other network management processes. This is mainly due to the fact that anomaly detection may operate under abnormal conditions/attacks, while by its nature involves simultaneously several factors, such as normal traffic, abnormal traffic and various detection metrics, whose statistical characteristics and behavior may be affected in quite diverse ways by the sampling process. Therefore, as we will see later in section 5, observations and common practices that have been used so far regarding the deployment and implementation of packet sampling techniques in different network devices, are not necessarily appropriate choices for the efficient and effective operation of the anomaly detection strategies, since they significantly affect their detection capability. Hence, understanding and providing a qualitative and quantitative evaluation of the impact of various sampling techniques on the anomaly detection process is of high research and practical importance, and is the main focus of our work.. ANOMALY DETECTION TECHNIQUES In this section we present two different anomaly detection techniques that represent a wide class of commonly used anomaly detection strategies. Firstly, we present a sequential nonparametric change-point detection method and then an algorithm based on Principal Component Analysis (PCA) using multiple metrics. The first method utilizes one of the most popular algorithms used in the singlemetric/single link anomaly detection approach, while the second one is mainly used for multi-metric/multi-link anomaly detection. Both techniques are independent of the network topology and traffic characteristics and can be applied to monitor every type of network.. Change-Point Detection Method The objective of Change-Point Detection (CPD) is to determine if the observed time series is statistically homogeneous and, if not, to find the point in time when the change happens [][]. The attack detection algorithm that is described below belongs to the sequential category of Change Point Detection in which tests are done online with the data presented sequentially and the decisions are made on-the-fly. Since non-parametric methods are not model-specific, they are more suitable for analyzing systems like the Internet which is dynamic and complex. The nonparametric CUSUM (Cumulative Sum) method is applied for the detection of attacks. The main idea of the non-parametric CUSUM algorithm is that the mean value of a random sequence {X n } is negative during normal operation and becomes positive when a change occurs. Thus, we can consider {X n } as a stationary random process which under the normal conditions, the mean of X n, E(X n )=c. A parameter α is chosen to be an upper bound of c, i.e., α > c, and another random process {Ζ n } is defined so that Ζ n = X n α, which has a negative mean during normal operation. The purpose of introducing α is to offset the possible positive mean in {X n } caused by small network anomalies so that the test statistic y n, which is described below, will be reset to zero frequently and will not accumulate with time. When an attack takes place, Ζ n will suddenly increase and become a large positive number. Suppose, during an attack, the increase in the mean of Ζ n can be lower-bounded by h. The change detection is based on the observation of h>>c. More specifically, let y n = (y n- + Ζ n ) +, y =, where x + is equal to x if x > and otherwise. The decision function can be described as follows: d N (y n ) =, if y n < N, d N (y n ) =, if y n > N where d N (y n ) is the decision at time n: stands for normal operation and for attack (a change occurs), while N represents the attack threshold. This anomaly detection technique has been used with different type of metrics, like the SYN/FIN packet ratio [] or the percentage of new source IP addresses in a time bin [] in order to detect Denial of Service attacks.. PCA-based Anomaly Detection The objective of Multi-Metric-Multi-Link PCA-based method [] is to provide a methodology of fusing and combining data of heterogeneous monitors spread through out the network, in order to provide a generalized framework, capable of detecting a wide range of classes of anomalies, such as the ones that may result in alterations in traffic composition or traffic paths. This is achieved by applying a PCA-based approach simultaneously on several metrics of one or more links. In general for every network of 7

link there is a number of metrics that describe the traffic that passes through it. In order to better model and represent this, a set of virtual links for every real link is created, with each virtual link corresponding to a different metric. Principal Component Analysis aims at the reduction of the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. The extracted non-correlated components are called Principal Components (PCs) and are estimated from the eigenvectors of the covariance matrix or the correlation matrix of the original variables. The overall procedure of this method may be divided into two different parts, as displayed in Figure : the offline analysis, that creates a model of the normal traffic, and the real time analysis that detects anomalies by comparing the current (actual) with the modeled traffic patterns. The input of the offline analysis is a data set that contains only normal traffic. During the offline analysis, PCA is applied on this data set and then the first few most important derived Principal Components (PCs) are selected. Their number depends on the network and the number of virtual links, and it represents the number of PCs required for capturing the percentage of variance that the system needs to model normal traffic. The output of the offline analysis is the PCs to be used in the Subspace Method. with the use of the PCs estimated in the offline analysis (Subspace Method). When an anomaly occurs, the residual vector presents great variation in some of its variables and the system detects the network path containing the anomaly by selecting these variables. In general, the occurrence of an anomaly tends to result in a large change to y res. A change in variable correlation increases the projection of y to the residual subspace. Within such a framework a typical statistic for detecting abnormal conditions is the squared prediction error (SPE) []: SPE y res When an anomaly occurs the SPE exceeds the normal thresholds and the system detects the network path(s) containing the anomaly, by selecting the variables that contribute mostly to the large change of the SPE. This may be realized by selecting the virtual links in the residual vector whose variation is significantly larger than the corresponding one under normal conditions.. SAMPLING TECHNIQUES The deployment of sampling techniques aims at the provisioning of information about a specific characteristic of the parent population at a lower cost than a full census would demand. Several sampling techniques are being used in order to reduce the amount of data being processed. The sampling techniques can be divided in two major categories: count-based and time-based. Figure illustrates three basic count-based sampling techniques that are defined in the PSAMP IETF draft [], and are briefly presented in the following. Figure. High-Level Methodology Representation The goal of the Subspace method is to divide current traffic data in two different spaces: one containing traffic that is considered normal (y norm ) and resembles to the modeled traffic patterns and one containing the residual (y res ). In general, anomalies tend to result in great variations in the residual, since they present different characteristics from the modeled traffic. During the real time analysis, the current traffic vector is projected into two different subspaces, Figure. Schematic of three basic sampling techniques Systematic Sampling Systematic Sampling describes the process of selecting the start points and the duration of the selection intervals according to a deterministic function. Here only equally of 7

spaced schemes are considered, where triggers for sampling are periodic. All packets occurring in a selection interval, in packet count, beyond the trigger are selected. In our experiments we assume the periodic selection of every k-th packet. Random n-out-of-n Sampling In Random n-out-of-n Sampling, the parent population is divided in bins of N packets each and n packets are randomly selected out of each bin. One example would be to generate n different random numbers in the range [,N] and select all packets which have a packet position equal to one of the random numbers. In our study we assume that n=. This is also known as stratified random sampling. Uniform Probabilistic Sampling In probabilistic sampling the decision whether a packet is selected or not is made in accordance to a pre-defined selection probability. For Uniform Probabilistic Sampling each packet is selected independently with the same probability p. This sampling is sometimes referred to as geometric random sampling, since the differences in count between successive selected packets are independent random variables with a geometric distribution of mean /p. The PSAMP IETF draft also defines time driven sampling methods. More specifically, it defines the systematic timebased sampling in which packets are selected in a specific start and stop interval, as well as the time driven analog of uniform probabilistic sampling which has the time between start and stop sampling points exponentially distributed. In this paper we focus only on count-based sampling methods, while time driven sampling techniques are not examined since as explained in [5] they tend to miss bursty periods with many packets of relatively small interarrival times, resulting in being inappropriate choices for use in the field of network anomaly detection. 5. PERFORMANCE EVALUATION In this section we evaluate the impact of the above packet sampling methods on the anomaly detection techniques described in section. The results and corresponding observations presented in this section are based on real network data that have been collected from an operational campus network. More specifically we monitored the link between the National Technical University of Athens (NTUA) and the Greek Research and Technology Network (GRNET) which connects the university campus with the internet. This link has an average traffic of 7-Mbit/sec and packets/sec, containing a rich network traffic mix carrying standard network services like web, mail, ftp as well as pp application traffic. In the following evaluation, a distributed Denial of Service attack (TCP SYN attack) against a single host inside the NTUA campus is studied in detail. Table presents the various attack ratios that are used in this study (expressed in packets and bytes) against the normal traffic. Table. Attack Ratios Attack Ratio in packets/sec and bytes/sec packets/sec (%).7Mbit/sec (.5%) packets/sec (%) kbit/sec (.5%) packets/sec (5%) kbit/sec (.%) packets/sec (%) 7kbit/sec (.5%) packets/sec (%).kbit/sec (.%) As mentioned before, in our study, we aim at evaluating the impact of packet sampling on the two network anomaly detection techniques described in section. To better demonstrate the results and corresponding observations, we study two scenarios for the Change Point Detection (CPD) method and one scenario for the PCA-based method. In all the scenarios we apply all the various types of packet sampling that we described in the previous section at different sampling rates. In our experiments, we used sampling rates of /, /5 and / for the systematic and the random n-out-of-n sampling, and probabilities equal to.,. and. respectively, for the uniform probabilistic sampling. 5. Change Point Detection (CPD) Method 5.. CPD with SYN/FIN This anomaly detection approach [] uses as a metric (variable X n ) the difference between SYN and FIN packets divided with the number of FIN packets. The SYN packets are the ones that have the TCP SYN flag set, while the number of FIN packets that we use here is actually the sum of packets that have the FIN or RST flag set. In normal condition the variable X n has a positive mean value near zero while the variable Z n has a negative value. When a SYN attack occurs, X n becomes a large positive number that has as a result Z n to become positive. This leads to the increase of y n which indicates an attack if its value exceeds a certain threshold. The average normal SYN traffic is packets/sec. As the three first attack ratios are very large with respect to the SYN/FIN ratio we present here only the last two attack ratios (% and % packet ratio). The SYN attack occurs during the interval -sec. The results depicted in Figure correspond to the % attack ratio ( SYN packets/sec) at the sampling rate of /. As we can observe all three types of sampling present similar behavior and their curves resemble the corresponding curve for the unsampled case. Similar results occur for the sampling rate of /5. of 7

9 7 5 No sampling SC- Random / UP-. Threshold 5 5 Figure. Attack ratio % - Sampling rate / Figure presents the corresponding results for the % attack ratio ( SYN packets/sec) at the sampling rate of /. Here we observe that systematic sampling gives some false positives, which however do not exceed the attack threshold. 9 7 5 No sampling SC- Random / UP-. Threshold 5 5 Figure. Attack ratio % - Sampling rate / When we reduce the attack ratio to reach % ( SYN packets/sec) at sampling rate / we do not observe any false positives, as shown in Figure 5, however if we change the sampling rate at /5, systematic sampling deviates from the unsampled case (Figure ) while the other two types of sampling have almost the same behavior as the unsampled one. 9 7 5 No sampling SC- Random / UP-. Threshold 5 5 No sampling SC-5 Random /5 UP-. Threshold 5 5 Figure. Attack ratio % - Sampling rate /5 This is due to the fact that this technique is based on detection of packets that have the SYN or FIN (or RST) flag set. These packets are not evenly distributed across the network packet sequence, thus resulting in inefficient sampling when systematic sampling is being used. On the contrary, the random out of N and uniform probabilistic sampling that make a more random selection of packets appear to be more effective. Thus, as demonstrated here, systematic sampling becomes the worst choice on packet sampling for the detection of attacks that depend on certain packet characteristics (like the TCP Flags). It should be noted here however that most of the deployed routers on the internet implement the systematic packet sampling due to its simplicity and low overhead [5]. 5.. CPD with source IP addresses The second scenario uses the same method as the previous one but with a different metric. In this case, X n is the ratio of new source IP addresses over the frequent source IP addresses that are observed in a specific time bin []. With the term frequent we define the top-m (in packets) source IP addresses in a time bin. With the term new we define the set of source IP addresses that do not belong to the IP source database that contains the frequent IP addresses during normal network operation. Figure 7 presents the corresponding results for an attack ratio of % ( packets/sec) at a sampling rate of /. As depicted in this figure, sampling does not affect significantly the detection effectiveness. Similar results occur for the sampling rate of /5. Figure presents the % attack ratio at the sampling rate of /. It is obvious that all three types of sampling have the same behavior, as they result in significant increase of false positives. Figure 5. Attack ratio % - Sampling rate / 5 of 7

,,7,,5,,,, No sampling SC- Random / UP-. Threshold 5 5 Figure 7. Attack ratio % - Sampling rate /,5,5,5 No sampling SC- Random / UP-. Threshold 5 5 Figure. Attack ratio % - Sampling rate / This fact can be explained from the distortion of the packet distribution of source IP addresses when packet sampling is applied. More specifically, based on the methodology presented in [] with respect to the statistical determination of the appropriate sample size, in the following we calculate the appropriate sample size (packets that need to be selected from a time bin) for a given confidence level on the packet distribution of source IP addresses. We specify an accuracy of r = 5% and a confidence level of 95% which implies z-value of.9 in the following expression for the appropriate sample size n: z σ n = () rµ where μ is the mean and σ is the standard deviation of the distribution. In our experiment, the packet distribution of source IP addresses had mean μ=.5 and standard deviation σ=7.. Therefore, according to expression (), the appropriate size of the sample in a time bin should be 7 packets. As the number of packets in a time bin is about 95, the sampling rate should be about /7. Considering the use of sampling rate equal to /, one can realize the significant distortion of the packet distribution. Consequently, the reduction of sampling rate, results in the increase of the percentage of new source IP addresses over the frequent source IP addresses, even in normal operation, as depicted in Figure. 5. PCA-based method using multiple metrics As mentioned before, PCA-based anomaly detection technique can take into account simultaneously multiple metrics coming from single link or even multiple links. In our experiment here we implemented a single-link-multimetric PCA-based approach and we used seven metrics, as follows: number of flows, packets, TCP flows, TCP packets, short flows (with a small number of packets), SYN packets, and FIN packets. With respect to this technique our experiments considered %, % and 5% attack ratios. The results demonstrate almost the same behavior under all the above attack ratios. Therefore, due to the space limitation in this paper we present only the corresponding results for the 5% attack ratio. Figure 9 presents the 5% attack ratio case ( packets/sec) at the sampling rate of / for the PCA-based method. As we can observe, all three types of sampling achieve to detect the attack, while there is a small variation in their values. Norm of Yres Norm of Yres.5..5..5..5..5 No sampling SC- Random / UP-. Threshold. Figure 9. Attack ratio 5% - Sampling rate /.5..5..5..5..5 No sampling SC-5 Random /5 UP-. Threshold. Figure. Attack ratio 5% - Sampling rate /5 of 7

Figure displays the case of 5% attack ratio with sampling rate /5, in which all three types of sampling introduce a large number of false positives. The same behavior appears for sampling rate of /. The residual vector y res contains the correlation coefficients that were not present in the training data (normal operation). During normal network operation y res tends to zero. Increase in y res corresponds to large variance of the correlation coefficients of the metrics which is usually attributed to an on-going attack. As depicted in Figure, PCAbased anomaly detection method captures the corresponding attack which occurred during the interval -sec, for all the sampling methods. The large number of false positives observed in Figure, before the start and after the end of the attack, is due to the significant differences of the correlation coefficients of the metrics for the sampling rate /5 (when compared to the unsampled case). Indicatively in Table we present the standard deviation of two metrics (total number of packets and total number of flows) of the residual vector y res for different sampling rates during normal network operation. As we can observe from table the difference in standard deviation between the sampling rates of / and /5 is significantly large. Both packet-based and flow-based metrics at sampling rate of /5 illustrate a significant difference from the unsampled case. Table. Standard Deviation of y res in normal traffic Sampling Rate Metric No Sampling / /5 Packets... Flows.5.. The experimental results of this PCA-based method, that uses both packet-based and flow-based metrics, show that the efficiency of the method is independent of the sampling method used and relies only on the sampling rate.. CONCLUSIONS Our results revealed that systematic sampling does not perform well under low sampling rates when the detection of the attack depends on certain packet characteristics (e.g. TCP flags). It is worthwhile mentioning here that systematic sampling is still the common use of the sampling procedure in network devices. Although such a sampling methodology has low complexity and may provide useful results for simple management functions, based on our results and observations it becomes apparent that they are inadequate for the support of anomaly detection and more enhanced sampling techniques are required. Furthermore, when flow-based metrics, like source IP addresses or number of flows, are used, the performance of the anomaly detection algorithm relies mainly on the sampling rate applied during the detection. With respect to the PCA-based method, that uses both packet-based and flow-based metrics, we observed that its effectiveness relies only on the sampling rate and not on the sampling methodology. REFERENCES [] P. Barford, J. Kline, D. Plonka, and A. Ron, A Signal Analysis of Network Traffic Anomalies, in Proc. of the nd ACM SIGCOMM Workshop on Internet measurement, pp. 7-,. [] N. Ye, S. Emran, Q. Chen, S. Vilbert, Multivariate Statistical Analysis of Audit Trails for Host-Based Intrusion Detection, IEEE Transactions on Computers, Vol. 5, No. 7, July [] W. Lee and D. Xiang, Information-Theoretic Measures for Anomaly Detection, In Proc. of the IEEE Symposium on Security and Privacy (S&P ), pp. -,. [] Packet Sampling (PSAMP) IETF Working Group Charter. http://www.ietf.org/html.charters/psamp-charter.html [5] K.C. Claffy, G.C. Polyzos, and H.-W. Braun. Application of Sampling Methodologies to Network Traffic Characterization, In Proceedings of ACM SIGCOMM 9, San Francisco, CA, pp. 7, September 99. [] N. Duffield, C. Lund, and M. Thorup, Estimating Flow Distributions From Sampled Flow Statistics, IEEE/ACM Transactions on Networking, Vol., No. 5, pp. 9-9, October 5. [7] Nicolas Hohn, Darryl Veitch, Inverting sampled traffic, In Proceedings of the rd ACM SIGCOMM conference on Internet measurement, pp. -, Miami, FL, USA,. [] B.Y. Choi, J. Park, Z.L. Zhang, Adaptive Packet Sampling for Accurate and Scalable Flow Measurement, Global Telecommunications Conference (GLOBECOM'),. [9] C. Estan, K. Keys, D. Moore, G. Varghese, Building a Better Netflow, Proceedings of the ACM SIGCOMM, Portland, Oregon, USA, August [] Cisco NetFlow http://www.cisco.com/warp/public/7/netflow/index.html [] H. Wang, D. Zhang, K.G. Shin, Change-Point Monitoring for the Detection of DoS Attacks, IEEE Transactions on Dependable and Secure Computing, Vol., No., pp. 9-, [] T. Peng, C. Leckie, K. Ramamohanarao, Proactively Detecting Distributed Denial of Service Attacks Using Source IP Address Monitoring, In Proc. of the Third International IFIP-TC Networking Conference, Athens, Greece, May [] V. Chatzigiannakis, S. Papavassiliou, G. Androulidakis, B. Maglaris, On the realization of a generalized data fusion and network anomaly detection framework, Fifth International Symposium on Communication Systems, Networks and Digital Signal Processing (CSNDSP ), Patras, Greece, July. [] J. E. Jackson and G. S. Mudholkar, Control Procedures for Residuals Associated with Principal Component Analysis, Technometrics, pp. 9, 979. [5] Baek-Young Choi and Supratik Bhattacharyya, On the Accuracy and Overhead of Cisco Sampled NetFlow, ACM SIGMETRICS Workshop on Large Scale Network Inference (LSNI 5), Banff, Canada, June 5. [] W. Cochran, Sampling Techniques, John Wiley & Sons, 97. 7 of 7