An Analysis of UDP Traffic Classification

Size: px

Start display at page:

Download "An Analysis of UDP Traffic Classification"

Nickolas Short
5 years ago
Views:

1 An Analysis of UDP Traffic Classification 123 Jing Cai 13 Zhibin Zhang 13 Xinbo Song 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate University of Chinese Academy of Sciences, Beijing, China 3 National Engineering Laboratory for Information Security, Beijing, China caijing@software.ict.ac.cn Abstract Accurate and timely classification of network applications is fundamental to numerous network activities. The traditional methods based on the well-known ports and packet payload analysis could no longer meet the need to accurately identify the IP traffic. Therefore, a promising approach using the machine learning techniques has received more and more attention. There are a lot of work about this field. However, earlier work generally believed that TCP traffic occupied the main body, and UDP traffic is negligible, and therefore ignored the study of classifying UDP traffic. However, with the increase of network bandwidth, based on real-time considerations, more and more new applications use UDP as transport layer protocol, which directly increase UDP traffic. In view of this, we mainly discuss the classification of UDP traffic. Firstly, we divided the whole UDP traffic into five categories according to theirs specific characteristic. Secondly, we use four machine learning techniques{naive Bayes, SVMs, C4.5, K-Means} to classify the UDP traffic of these five categories. Through the comparison and analysis, we find the supervised techniques can achieve higher accuracy than the unsupervised clustering techniques. Among the above four techniques, the Naive Bayes always gets the minimum performance, while the C4.5 is always the maximum. The Simple K-Means always lies between the Naive Bayes and other supervised learning techniques, and it outperforms the Naive Bayes classifier by 17%. I. INTRODUCTION Accurate and timely classification of network applications is fundamental to numerous network activities, from security monitoring to accounting, and from Quality of Service to providing operators with useful forecasts for long-term provisioning. Due to its great importance, different techniques have been used to classify IP traffic. In the past, the commonly used technique based on the well-known TCP or UDP port numbers(visible in the TCP or UDP headers). However, with growing numbers of network applications are port-agile(allocating dynamic ports as needed), end user are deliberately using non-standard ports to hide the traffic, and the widespread use of the network address port translation in peer-to-peer file sharing system, this technique is becoming increasingly less effective. Moore et al.[1] found out that this traditional techniques for traffic/flow classification is nomore accurate than 50-70%. Another well researched approach based on the packet payload analysis also cannot deal with the proprietary protocols or encrypted traffic, and it may pose privacy and security concerns. Therefore, a promising approach that has recently received more and more attention is traffic classification using machine learning techniques. This technique is based on application protocol(payload) independent statistical features such as packet length and inter-arrival times. Each traffic flow is characterized by the same set of features but with different feature value. A ML classifier is built by training on a representative set of flow instances where the network applications are known. The built classifier can be used to determine the class of unknown flows. However, the earlier work generally believed that TCP traffic occupied the main body of the network traffic, and UDP traffic is negligible, and therefore ignored the classification of the UDP traffic. However, the situation has undergone tremendous changes at present. With the increase of network bandwidth, the traditional networking services based on images and text could no longer satisfy people s needs. More and more audio, video, and online games, have gradually become the main body of the network traffic. These applications mostly use UDP as their transport layer protocol [2], which directly results in the increase of UDP traffic. The organization of CAIDA [3] analyzed the trace collected in the period on several backbone links located in the US and Sweden and found the ratio between the UDP and TCP in packets, bytes, and flows have increased greatly. For UDP, compared with the TCP, we found there at least exist two big differences. Firstly, TCP is a connection-oriented protocol, it has controlling flags such as FIN and RST to explicitly identify the end of flow. But for UDP, it is a connectionless protocol. The main methodology to terminate UDP flows is the timeout strategy. The second, the composition of UDP is more complicated. The characteristics of different applications often demonstrate significant differences. Therefore, the situation is more complex for UDP. Due to these two great differences, the study on the classification of UDP traffic is nearly in the blank stage. In view of this, we mainly discuss the classification of UDP flows in this paper. To the best of our knowledge, we are the first to do so. There are two main contributions in our paper. We mainly discuss the classification of the UDP traffic. The whole UDP traffic have been divided into five categories according to theirs specific characteristic. These five categories are {SERVICE, IM, DOWNLOAD, STREAMING, Other}. We apply the unsupervised and supervised machine learning techniques for UDP traffic identification. Our unsupervised approach uses the Simple K-Means, and the

2 supervised approaches used the Naive Bayes, C4.5, and SVMs. Through the qualitative and quantitative analysis, our results shows that the supervised techniques can achieve higher accuracy than the unsupervised clustering techniques in most cases. Among the above four techniques, the Naive Bayes always gets the minimum performance, while the C4.5 is always the maximum. And the Simple K-Means always lies between the Naive Bayes and other supervised learning techniques. The remainder of this paper is organized as follows: Section II presents some related work. Section III outlines the basic machine algorithms we used. In section IV, we introduce the data trace used in our work and present the information of flow definition, the feature selection and evaluation criteria. Section V give the result and our analysis. At last, we conclude the paper and give some suggestions in Section VI. II. RELATED WORK Due to its fundamental nature and its underpinning of many other technique, the field of traffic classification has maintained continuous interest. There has been much recent work in the field of traffic classification. This section will survey the different techniques presented in literature. A. Port Number Analysis Historically, traffic classification techniques used the wellknown port number to identify Internet traffic. It was successful because many traditional applications use fixed port numbers assigned by IANA. For example, dns applications commonly use port 53. This techniques has been shown to be ineffective by Karaginnis et al. in [4] for some application such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port number are known in advance can be identified. B. Payload-based Analysis Another well researched approach is analysis of packet payloads[5]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known application. Although payload based inspection avoids reliance on fixed port numbers, it imposes significant complexity and processing load on the traffic identification device. This approach can be difficult or impossible when dealing with proprietary protocols or encrypted traffic. Furthermore direct analysis of session and application layer content may pose privacy and security concerns. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic. C. Machine Learning Approaches Newer approaches rely on traffic s statistical characteristics to identify the application. An assumption underlying such methods is that traffic at the network layer has statistical properties that are unique for certain classes of applications and enable different source applications to be distinguished from each other. However, the need to deal with traffic patterns, large data sets and multi-dimensional spaces of flow and packet attributes is one of the reason for the introduction of ML(Machine Learning) techniques in this field. Machine learning techniques generally consists of two parts:model building and then classification. A model is first built using training data. This model is then imputed into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. Bernaille et al.[6] used a Simple K-Means clustering algorithm to perform classification using only the first five packets of flow. McGregor et al. [7]used the Expectation Maximization algorithm to classify IP traffic. The approach clusters traffic with similar observable properties into different application types. Zander et al.[8] extended this work by using an EM algorithm called AutoClass and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques also use connection-level statistics to classify traffic. Roughan et al.[9] use the technique of nearest neighbour(nn), linear discriminate analysis(lda) and Quadratic Discriminant Analysis(QDA) ML algorithms to map different applications to different QoS classes. Moore et al.[10][11] used the technique of Naive Bayes to build a classifier and shows that the Naive Bayes approach also has a high accuracy classifying traffic. Auld et al.[12] also applied Bayesian Neural Network to classify the Internet traffic. Este et al.[13], Park et al.[14] separately use the SVMs(Support Vector Machines), Genetic Algorithm to classify Internet traffic. Besides that, Crotti et al.[15] proposed a flow classification mechanism base on protocol fingerprints containing the information of packet length, inter-arrival time and packet arrival order. These works have demonstrated that supervised ML algorithms are also able to separate traffic into classes, with encouraging accuracy. III. EXPERIMENTAL SETUP A. Data Traces and Traffic Class We collected the experiment traces from a backbone router in China. The basic information of these traces is in Table I. The reason for not using the data set of CAIDA is that its payload information has been encrypted. And in this paper, we must use the payload information to classify the different applications. Compared with TCP, the composition of UDP is more complicated, it contains many different data elements. In UDP, there exists some streaming media protocols such as ppstream, pplive which seem to be appropriate to profile flow, but also exists protocols such as dns which only involves questionanswer and seems not to be suitable to profile flow. Due to this great differences among the applications, it is difficult to deal with it as a whole. Therefore, we divided the whole UDP flows into five categories according to their specific characteristics. Table II shows the basic information of the classification. The whole UDP flows have been divided into five cate-

3 TABLE I THE BASIC INFORMATION OF THE TRACE Id Begin time End time Bytes Packets I 2009,5.5,14: ,5.6,00:30 275G 2805(million) TABLE II UDP TRAFFIC ALLOCATED TO EACH CATEGORY Category SERVICE IM DOWNLOAD STREAMING OTHER Example Application {dns,ntp,messengerservice} {qq,msn} {bittorrent,edonkey,xunlei,guntella,kazaa} {pplive,ppstream,sopcast,qqlive} {unknown} gories named {SERVICE, IM, DOWNLOAD, STREAMING, OTHER}. As we said earlier, the traditional traffic identification method based on well known ports will lead to inaccurate judgement. Fortunately, some open source software such as L7-filter[16], OpenDPI provides features information of some application protocols. We draw this information to analyze the traffic of these most important applications in UDP. B. Flow and Feature Definitions We formally define UDP flows as a series of packets that consistent with a specific flow specification and timeout constraint. At present, the most widely used flow specification is the five-tuple specification(source address, destination address, source port, destination port, transport layer protocol); And the timeout constraint define a flow which became inactive beyond a specific timeout as a end flow. In this paper, we finally set the timeout value as 64s. Flows are bidirectional and the first packet seen by the classifier determines the client-server direction. Each flow has a number of unique properties(e,g,. the source and destination ports), and a number of characteristics parameterizing its behavior - together these values form the input discriminators for out classification work. The flow features we use to classify are as follows: Flow duration Flow volume in bytes Flow volume in packets Packet length(minimum,average,maximum and standard deviation) Inter-arrival time(minimum,average,maximum and standard deviation) Packet lengths are based on the IP length excluding link layer overhead. Inter-arrival times have at least microsecond precision and accuracy. As the traces contained both directions of the flows, feature were calculated in both directions. This produces a total of 22 flow features, which we refer to as the full feature set. Our feature are simple and well understood within the networking community. They represent a reasonable benchmark feature to which more complex features might be added in the future. C. Evaluation Criteria To measure the effectiveness of the algorithms three metrics were used: precision, recall, and overall accuracy. These measures have been widely used in the data mining literature. For a given class, the number of correctly classified objects is referred to as the True Positives. The number of objects falsely identified as a class are referred to as the False Positives. The number of objects from a class that are falsely labeled as another class is referred to as the False Negatives. Precision: the number of class members classified correctly over the total number of instances classified as class numbers. It is the ratio of True Positive to the number of True Positives and False Negatives. This determines how many identified objects were correct. precision = TP TP+FP (1) Recall(or true positive rate): the number of class members classified correctly over the total number of class numbers. It is also the ratio of True Positives to the number of True Positives and False Negatives. This determines how many objects in a class are misclassified as something else. recall = TP TP+FN (2) Overall Accuracy: the percentage of correctly classified instances over the total number of instances. It is the ratio of the sum of all True Positives to the sum of all the True and False Positives for all classes. This measures the overall accuracy of the classifier. overallaccuracy = n i=1 n i=1 TP i (TP i+fp i) IV. RESULTS AND ANALYSIS In the experiment, we do not use the complete flow characteristic information to classify the different application s traffic. In contrast, we only use the information of the first ten packets. When packets in flow reach ten, we collect the flow characteristic information and use these information to classify UDP traffic. A. Supervised Learning Approaches The supervised learning classifier is first trained with a training set containing 10,000 random samples from the whole flow set. Among the 10,000 flows, there are 5,000 flows from the category of DOWNLOAD, 2,500 flows from the category of STREAMING, 2,000 flows from the category of SERVICE and 500 flows from the category of IM. Once this training is complete, the classifier is then tested to see how well it classifies 10 different test sets containing 10,000(different) random samples. The composition of the test sets is the same as the training set, which is also 10,000 flows from these four categories. The initializing class labels are define by the mean of the payload-based analysis. (3)

4 Overall Accuracy Number of Clusters Fig. 1. Accuracy using K-Means B. Unsupervised Learning Approaches The simple K-Means algorithms have an input parameter of K. This input parameter is the number of disjoint partitions used by the unsupervised learning algorithms. In our data set, we would expect there would be at least one cluster for each traffic class. In addition, due to the diversity of the traffic in some classes, we would expect even more clusters to be formed. Therefore, based on this, the simple K-Means algorithm was evaluated with K initially being 10 and K being incremented by 10 for each subsequent clustering. The average overall accuracy results of the testing set for the K-Means clustering algorithm are shown in Fig.1. Initially, when the number of clusters is small the overall accuracy of K-Means is approximately 63%. The overall accuracy steadily improves as the number of clusters increase. This continues until K is around 100 with the overall accuracy being 82%. At this point, the improvement is much more gradual with the overall accuracy only improving by an additional 3.0% when K is 150. When K is larger than 150, the improvement is further diminished with the overall accuracy improving to the high 87% range when K is 300. However, large values of K increase the likelihood of over-fitting. In view of this, we think the 150 clusters is the best trade-off between behavior separation and complexity. Once an acceptable clustering has been found using the flow samples in a training data set, the clustering is transformed into a classifier by using a transductive classifier. In this approach, the clusters are labeled and a new object is classified with the label of the cluster which it is most similar to. We labeled a cluster with the most common traffic category of the flow samples in it. If two or more categories are tied, then a label is chosen randomly amongst the tied category labels. A new flow sample is then classified with the traffic class label of the cluster it is most similar to. By this mean, the resulting classifier is then used to predict which traffic class a new connection belongs to from the same 10 test sets of data. C. Experiment Result Fig.2 shows the average recall and precision for the four machine learning techniques{naive-bayes, Support Vector TABLE III OVERALL ACCURACY OF EACH ALGORITHM Algorithm Average Minimum Maximum Naive Bayes 67.12% 66.55% 67.46% SVMs 95.39% 95.12% 95.58% C % % K-Means 84.97% 84.63% 85.30% Machines, C4.5 Decision Tree, Simple K-Means}. In Fig.2, for SVMs and C4.5 classifiers, all classes have precision and recall values above 90%. For the SVMs classifiers, note that two of the four classes have average recall values over 95%, and two have average precision values above 90%. For the C4.5 classifiers, the situation is also the same. Therefore, these two approaches performs quite well for the data sets with precision and recall values averaging around 95% for both data sets. For the K-Means classifier, all classes have recall and precision values above 70%. For the DOWNLOAD and STREAM- ING class, it can reach 91% and 81% separately. The two worst classified classes, SERVICE and IM, still have recall and precision over 70%. The reason SERVICE and IM have this low recall and precision is that approximately 10% of the SERVICE and IM flow samples are being incorrectly classified as DOWNLOAD and STREAMING flow samples. For the Naive Bayes classifier, it works the worst. For DOWNLOAD flows, it performs best with 70% recall and 87% precision, followed by 88% recall and 50% precision for STREAMING. The two worst classified classes are also the SERVICE and IM. For SERVICE flows, it only reach the 32% recall and 72% precision. In contrast, for IM flows, it is the 71% recall and 47% precision. The reason for this poor performance is owing to lots of DOWNLOAD and STREAMING flows being falsely classified as SERVICE and IM, and thus consequently contributing to their lower recall values. Among the four categories{download, STREAMING, SERVICE, IM}, the experiment result shows that the DOWN- LOAD and STREAMING flows are easier to be identified. And the SERVICE and IM flows are more harder to be classified. We think this is also determined by its characteristic nature. The SERVICE flows only involves one request packet and one response packet in one request and seems to be independent between many requests. And the IM flows are usually accompany with large inter-arrival time caused by intermittent chat. D. Overall Accuracy of Algorithms Table III shows the minimum, maximum, and average overall accuracy over the 10 test sets. As Table III shows, the Naive Bayes classifier has an overall accuracy of 67.12%, which is the minimum. Whereas in comparison, the C4.5 classifier has an average overall accuracy of 96.16%, which is the maximum. The SVMs classifier also can achieve an average overall accuracy of 95.39%. However, for the unsupervised learning algorithms(k-means), it has an average overall accuracy of 84.97%, lies between the Naive Bayes

5 Recall (a) Average Recall SVMs C4.5 Naive Bayes K Means DOWNLOAD STREAMING SERVICE IM Precision (b) Average Precision DOWNLOAD STREAMING SERVICE IM SVMs C4.5 Naive Bayes K Means Fig. 2. Supervised and Unsupervised Learning Approaches Result classifiers and other supervised learning algorithms. Thus, we find that the supervised techniques can achieve higher accuracy than the unsupervised clustering techniques in most cases. The only exception is the K-Means outperforms the Naive Bayes classifier by 17%. V. CONCLUSION In this paper, we mainly discuss the classification of UDP traffic. First, we divided the whole UDP traffic into five categories according to theirs specific characteristic. Next, we apply the unsupervised and supervised machine learning techniques for UDP traffic identification. Our unsupervised approach uses the Simple K-Means, and the supervised approaches uses the Naive Bayes, C4.5, and SVMs. Through the qualitative and quantitative analysis, our results shows that the supervised techniques can achieve higher accuracy than the unsupervised clustering techniques in most cases. And among the above four techniques, the Naive Bayes always gets the minimum performance, while the C4.5 is always the maximum. The Simple K-Means always lies between the Naive Bayes and other supervised learning techniques, and it outperforms the Naive Bayes classifier by 17%. ACKNOWLEDGMENT Our work is supported in part by the National Basic Research Program 973 of China(Grant No.2007CB311100). REFERENCES [1] A. Moore, and K. Papagiannaki Toward the Accurate Identification of Network Applications. in PAM 2005, Boston, USA, March 31-April 1, [2] K. Sripanidkulchai, B. Maggs, and H. Zhang, Analysis of Live Streaming Workloads on the Internet. In Proc. of IMC 04, October, 2004, pp [3] CAIDA. [4] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy Transport Layer Identification of P2P Traffic. In Proc. of IMC 04, Taormina, Italy, October 25-27, [5] S. Sen, O. Spatscheck, and D. Wang ccurate, Scalable In-Network Identification of P2P Traffic Using Application Signatures. in WWW2005, New York, USA, May 17-22, [6] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian Traffic classification on the fly. ACM Special Interest Group on Data Communication (SIGCOMM) Computer Communication Review, vol. 36, no. 2, [7] A. McGregor, M. Hall, P. Lorier, and J. Brunskill Flow clustering using machine learning techniques. in Proc. Passive and Active Measurement Workshop (PAM2004), Antibes Juan-les-Pins, France, April [8] S. Zander, T. Nguyen, and G. Armitage Automated traffic classification and application identification using machine learning. in IEEE 30th Conference on Local Computer Networks (LCN 2005), Sydney, Australia, November [9] M. Roughan, S. Sen, O. Spatscheck and N. Duffield Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification. in IMC04, Taormina, Italy, October 25-27, [10] A. Moore, and D. Zuev Internet Traffic Classification Using Bayesian Analysis Techniques. in SIGMETRICS05, Banff, Canada, June 6-10, [11] A. Moore, and D. Zuev Discriminators for use in flow-based classification. Intel Research, Technical Report. [12] T. Auld, A. W. Moore and S. F. Gull Bayesian neural networks for Internet traffic classification. IEEE Transactions on Neural Networks, no. 1, pp. 223C239, January [13] A. Este, F. Gringoli, and L. Salgarelli, Support Vector Machines for TCP Traffic Classification. Elsevier Computer Networks (COMNET), Vol. 53, No. 14, pp , Sep [14] J. Park, H. R. Tyan, and K. C. C. J, GA-Based Internet Traffic Classification Technique for QoS Provisioning. in Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Pasadena, California, December [15] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, Traffic classification through simple statistical fingerprinting. in SIGCOMM Comput. Commun. Rev., vol. 37, no. 1, pp. 5C16, [16] l7-filter. [17] G. H. John, and P. Langley, Estimating Continuous Distributions in Bayesian Classifiers. in Proceedings of 11th Conference on Uncertainty in Artificial Intelligence, pp , Morgan Kaufman, San Mateo, [18] R. Kohavi, J. R. Quinlan, W. Klosgen and J. M. Zytkow, Decision-tree discovery. in Handbook of Data Mining and Knowledge Discovery, pp , Oxford University Press, [19] U.v. Luxburg, A Tutorial on Spectral Clustering. Stat Comput. 17, [See also Technical Report 149, Max Planck Institute for Biological Cybermetics,2006.]

Rethinking The Building Block: A Profiling Methodology for UDP Flows

Rethinking The Building Block: A Profiling Methodology for UDP Flows 123 Jing Cai 13 Zhibin Zhang 13 Peng Zhang 13 Xinbo Song 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing,