Internet Traffic Classification using a Hidden Markov model

Size: px

Start display at page:

Download "Internet Traffic Classification using a Hidden Markov model"

Tiffany Ashlee Kelly
6 years ago
Views:

1 th International Conference on Hybrid Intelligent Systems Internet Traffic Classification using a Hidden Markov model José Everardo Bessa Maia Department of Statistics and Computing UECE - State University of Ceará Fortaleza - Ceará - Brazil jmaia@uece.br Raimir Holanda Filho Masters Course in Applied Computer Sciences UNIFOR - University of Fortaleza Fortaleza - Ceará - Brazil raimir@unifor.br Abstract This paper examines the performance of a new Hidden Markov Model (HMM) structure used as the core of an Internet traffic classsifier and compares the results against other models present in the literature. Traffic modeling and classification find importance in many areas such as bandwidth management, traffic analysis, prediction and engineering, network planning, Quality of Service provisioning and anomalous traffic detection. The new HMM structure, which takes into account the packet payload size (PS) and the inter-packet times (IPT) sequences, is obtained by concatenation of a first part which is framed with a HMM profile with another part whose structure is that of a fully-connected HMM. The first part captures the specific properties of the initial protocol packets while the second part captures the statistical properties of the whole sequence present in the flow. Models generated are found to increase the accurate in classifying different traffic classes in the analysed dataset. The average accuracy obtained by the classifier is 62.5% having seen only five packets, 80.0% after examining 13 packets and 95.5% after seeing the unidirectional entire flow. Keywords-Internet Traffic Classification; Hidden Markov model. I. INTRODUCTION The ability to accurately classify and identify the network traffic associated with different applications is fundamental to numerous network activities, including network management and security monitoring, traffic modeling and network planning, accounting and Quality of Service provision [1]. It is at the basis of any modern network management platform. Despite the various approaches proposed for this task, no definite answer has been found to date [2]. Real-time classification, independent performance of the network in which the algorithm was trained, completeness and accuracy of the classification are still challenges to be overcome. Because of this fact, and its relevance, traffic classification has become one of the hottest research topics in computer science and telecommunications. Internet traffic classification is a hard task for several reasons. The traditional and direct approaches of relying on transport level protocol ports or on payload inspection have become rapidly unreliable [3] or not feasible [4]. Moreover, in several network scenarios it is quite unrealistic to assume that all the IP traffic classes are known a priori. In these cases, in fact, some network protocols may be known, but novel protocols can appear so giving rise to unknown classes. Additionally, a platform for traffic classification and identification of applications must meet the time constraints of the particular use. For example, a management function requires the identification of the application protocol after being seen only some of the first flow packets while other functions can operate on the complete information of the flow. Moreover, the ranking can be set at different levels of granularity, from a few very broad classes (e.g. interactive, transactional and bulk data classes) through intermediate degrees of discrimination (e.g., application protocols family) to the identification of the application itself (e.g., the application protocol). These aspects together make this an incredibly difficult task and this scenario is pushing the search for alternative techniques. This paper analyzes the performance of a new type of Internet traffic classifier which combines the ideas of previous proposals [2], [5], [6], [7] in the hope of obtaining a model that has the best properties of the original models. This classifier is based on a Hidden Markov Model (HMM) with a new structure. The new HMM structure, which takes into account the packet payload size (PS) and the interpacket times (IPT) sequences, is obtained by concatenation of a first part which is framed with a HMM profile with another part whose structure is that of a fully-connected HMM. The first part captures the specific properties of the initial protocol packets while the second part captures the statistical properties of the whole sequence present in the flow. The new HMM structure is used as the core of an Internet traffic classsifier and is evaluated, and the results produced with the evaluation are compared against other models in the literature. This study was based on the following four broad application classes with their reference applications, commonly found in IP networks. Interactive Class, which is represented by Telnet, CounterStrike(CS) game and HTTP protocols, Bulk data transfer Class, with the FTP-data protocol, Transactional Class, which is present with the HTTPS protocol, and Continuous-Media (CM) Streaming Class, which is represented by RealMedia streaming [8]. These reference applications are clearly within one class [1], are widely used /10/$ IEEE 37

2 and have server ports in the well known port range. The remainder of this paper is organized as follows. Section II presents the related work and places this work in the context of others. Section III presents the HMM model for realizing the classification. Section IV presents the real traffic traces used in the experiments and the measurement procedures. Section V presents evaluations of this technique and a results discussion. Finally, Section VI concludes the paper. II. RELATED WORK The two main techniques used for traffic classification on an IP network, namely, mapping the transport layer source and destination ports in the applications or the payload signatures recognition, become every day less effective or impractical. So much of the research in this area is shifting to the use of statistical and Machine Learning (ML) methods which are independent of the existence of packet fixed parameters or inspection of their contents. These classification techniques rely on the fact that different applications typically have distinct behavior patterns when communicating on a network. The traffic behavior patterns are originating mainly from the used protocol specifications, the application type itself or the user behavior, in the case of interactive applications. The model used in this work to recognize these distinct behavior patterns is based on HMM. HMMs are appropriate models to the approximate matching problem of families of sequences which have different sizes and can record any insertions or absence of some of its elements. These are typical phenomena in the packets sequences constituting a communication flow in IP networks when considering parameters such as PS or IPT. This work is a progression and inspired on those developed in [2], [5], [6], [7] based on statistical properties of IPT or PS sequences (or on joint sequence), present in the flows. In [6], an approach based on profile HMMs has been proposed in which a left-to-right structure for the state topology of the HMM is used. The authors present two classifiers working separately on IPTs or on PSs. In [7], the same authors to account for joint IPT and PS. In fact, the observable variables are one-dimensional and the IPT and PS joint information is taken into account via vector quantization based on K-means. Furthermore, a heuristic technique is used to account for different trace lengths. In these works the PS and IPT variables are discretized and the model considers packets in the two directions. In [2], the proposed model works directly on a twodimensional continuous observable variable, thus exploits IPT and PS joint information without needing any preprocessing like vector quantization. The approach presents a fully-connected HMM structure for the state topology that allows an reduction of the number of states, avoids postprocessing, and although being much less structured than the profile HMMs with respect to the traffic characteristics, is still able to achieve good classification results, as recorded by the authors. The model considers packets in one direction only. Moreover, in [5], based on that applications have different packet sizes for control flows, is proposed a technique which by applying unsupervised clustering (Simple K-Means) to first k-data-packet size vector of each TCP flow provide more than 95% average accuracy to identify traffic in protocol level. Sequences made of only the first 4 to 10 packets were used to train HMMs and to attempt flow classification at an early stage. The Model takes into account the initial packets in both directions and only TCP traffic is considered. The idea in the model presented on the next section is to use profile HMM as the behavior memory of the first packet of a TCP connection and use fully-connected HMM to recognize the global behavior of the flow. The hypothesis here is to separate more efficiently between TCPs and UDP traffics and thus improving the classification results. The results reported in this paper test the simplifying assumption of using only the first package of one of the directions and the values of PS and IPT are quantized using the k-means algorithm. The extension of the model to consider both traffic directions at the same time and a two-dimensional continuous observable variable is under investigation. A large and varied range of other statistical techniques have been applied to the problem of traffic classification in IP networks but are not directly related to this work [9], [10], [11], [12], [13]. Among the latest are the algorithms of Support Vector Machine (SVM) [2] and the classifier ensemble [2]. There are also many efforts that combine packet inspection and ML [14]. Surveys on this subject can be found in [4], [15]. III. THE MODEL Consider sequences of observations O = {o 1...o N, }, N 1, defined over S, o i S R n and a set of admissible classes Ω = {ω 1...ω c } where c = Ω. A sequence O belongs to the space of sequences of length N, S N. The sequence classification problem is to recognize the class ω i of a sequence of observations. A hidden Markov model (HMM) based classifier is a specific type of Bayesian classifier [16] in which the system being modeled is assumed to be a Markov process with unobserved state [17]. Each state has a probability distribution over the possible output simbols. Therefore the sequence of simbols generated by an HMM gives some information about the sequence of states. In a Bayes classifier the decision is based on classification cost: c j (O) = i c ij P (ω i /O) (1) where O is the observed sequence, c ij is the cost of misclassifying an observation in class i to class j and P(ω i /O) 38

3 is the a posteriori probability of i class. The a posteriori probabilities can be calculated using the Bayesian inversion, P (ω i /O) = kp (O/ω i )P (ω i ) (2) which requires the probability distributions of the generated sequences for each class, which are generally unknown. k is a normalization constant. In this work c ij = c = cte, ij and P (ω i ) = p = cte, i. In this Bayesian approach the goal of the HMM model (or any others who wished to use) is to generate estimates of P (O/ω i ) learning to do this from a classified set of available observations. HMM theory will not be covered in detail here; for a comprehensive tutorial, see [17]. Basically, an HMM λ is a 4-tuple λ = (S, A, π, B), where S is the set of states, A is the transition matrix (representing the probabilities of transition between states), π is a vector of initial state probabilities, and B is the emission model, which describes the probability (density or mass) function of symbol emission from each state. The standard HMM-based approach to sequence classification, adoted here, consists in training one HMM for each class, which are subsequently used as class-conditional densities in a standard Bayes classification paradigm. For example, assuming a priori equiprobable classes, an unknown sequence is classified into the class whose model shows the highest probability (likelihood) of having generated this sequence (this is the well-known maximum-likelihood (ML) classification rule) [18]. Thus an unknown sequence O is assigned to the class showing the highest likelihood, i.e. some decisions had to be taken. It is designed for onedimensional observable variables. The two variables, PS and IPT have been quantized on a scale not linear, separately, using the k-means algorithm [18]. Using previous work as a guide and after some experimental trials, PS and 10log 10 (IPT/1µs) (called dbµs in [6]) were each quantized to eight values. The Cartesian product of two sets generated an output alphabet of sixty-four one-dimensional observable symbols. The architecture of the classifier is the one shown in Fig-2. Even though conceptually composed of two parts, this model is operated as a single model both in the training phase and to calculate the likelihood for a sequence in the test phase. Note that no transitions are provided of the second part to the first part. Thus the first part of each model represents a memory of the first packets profile of each protocol family which is not modified by the effect of training on a fully-connected part. During the classification of a test sequence, the likelihood of the first protocol packets is captured and retained in this model part which affect the likelihood of the whole sequence through the composition with the likelihood calculated by the second half. Insert states Delete states Profile HMM part Match states Full connected HMM part Class(O) = arg max i P (O/λ i ) (3) where λ i is the HMM corresponding to the ith class. This requires training C HMMs for a C-class problem. Training was performed using the standard Baum-Welch algorithm [17]. This algorithm is an iterative forwardbackward procedure which just search for model parameters maximizing the probability that the model itself generates the sequences used in the training. The general structure of the HMM model used in the classifier is that shown in Fig-1. It has fixed structure for all applications, with five matching states, four delete states and four insert states in the profile HMM part and five states in the fully-connected HMM part resulting in a total of eighteen states. The classifier architecture for C classes is composed by a bank of C parallel HMMs and a decision block which selects the best estimate for the traffic class as one whose HMM model generated the greatest likelihood for the test sequence. The implementation model used is discrete both in the state and the symbols observed. To make this simple model Figure 1. Proposed HMM model structure. HMM 1 PS stream IPT stream k means symbols quantizers (1.. 64) HMM HMM C Class decision (1.. C) Figure 2. Architecture of the proposed classifier. 39

4 IV. DATA AND MEASUREMENTS In this work the analysis unit is the flow given by 5- tuple: source IP, source port, destination IP, destination port, transport protocol, with a timeout of 60 seconds. Therefore, a model must be built for traffic in each direction. This study considered only traffic in one direction for each host (e.g. packets with port 25 or 80 for SMTP or HTTP, respectively). In Fig-3, was taken into account only traffic exiting from target hosts and reaching the client computer. To build the class HMM models, each HMM in the bank have been obtained via the Baum-Welch training algorithm [17] using only sequences in each class. Each model was trained with a trace containing 400 flows. The algorithm starts from an initial model in which all symbols are equally likely in every state. As is standard, all packets with empty payload and the flows with less than 10 packets heve been excluded both from training and test sets. To evaluate the model, we use data from two packet traces which are collected in laboratory and in a campus network. Ground-truth information is done using a human supervised data verification process. Using database tools flows were filtered by 5-tuple and examined their contents for labeling. All known information such as well-known port numbers and packet payload contents, including the some protocol signatures, were used to identify the application within the flows. Flows whose labeling was unreliable were simply discarded. The Fig-3 is used to describe the measurements for the acquisition of the traces. As shown in this figure, to facilitate the acquisition and labeling of training traces, one client at a time was performed for each application reference in a single computer running this application only. Then each set was visualized and labeled manually in the laboratory using database tools. Note that although the acquisition point to be a isolated router, collected packets through the campus network and therefore the sequence (PS, IPT) obtained is not artificial. In the Fig-3, the Internet router is interfacing directly to the Internet. This traces set is used to obtain HMM models. Were collected a total of 2400 flows, 400 flows of each application. Moreover, the validation traces were collected from a campus network router under regular traffic. The trace file has been pre-processed to separate the target applications flows. Again, each set was visualized and labeled manually in the laboratory. For the less frequent applications, the traffic was induced to a user. Were collected a total of 1200 flows, 200 flows of each application, using tcpdamp. Data were collected at various days and times over 3 months. Flows in the streaming class are made up of small video clips. For performance comparison were also included in the results for 3 statistical classifiers operating in the flow level, namely, minimum-distance to centroid based, 1-NN and Naive Bayes classifiers, also for the other two which preceded this proposal. For this 8 features were extracted from flows. They are: mean and standard deviation of PS, IPT, flow size in packets and flow size in bytes. Computations were performed with the support of WEKA [19] and Matlab [20]. Computer Router Internet Servers Clients: ftp telnet http mail Figure 3. Campus Network Data for trainning and classification, here. Router Campus Network Servers: ftp telnet http mail Diagram of measurements topology. V. RESULTS AND DISCUSSION The validation results are presented in three tables. Accuracy is the percentage of data sets predicted correctly using the models. Table 1 shows the results of the experimental validation of the proposed classifier against five other models used in the literature including the two models which inspired this proposal, HMM Profile (5P-HMM) and Fully- Connected HMM (5F-HMM). The centroid, 1-NN and Naive Bayes classifiers are widely known but a detailed description of each can be found in the references noted in the table. The table describes the classification accuracy when the classification decisions occur after observing the first five packets and the first thirteen (ad hoc chosen) packets of the flow and after watching the whole flow. The decision point in five packages was chosen because it is the memory size of the HMM Profile part and decision point in thirteen packages was determined experimentally as that in which achieved a substantial improvement in ratings from the previous decision. The results show the superior performance of the proposed mixed model (5-Profile+5-Fully HMM). The last 3 rows of the table seem to confirm the working hypothesis that the behavior memory of the first packets brings the quality of the final decision. For comparison, note that with six classes the random hit rate would be 16.66% while the algorithm 5P5F-HMM achieves 62.5% had seen only 5 packets of the unidirectional flows. Table 2 shows the performance of each classifier per traffic class. Note that the gain by the new model is consistent and overcomes the other classifiers in all classes. Only for mail and telnet applications the accuracy is below 94%. The biggest advantage is 15.0% for streaming class and the lowest advantage is 8.0% for http and telnet classes, compared to 5P-HMM and 5F-HMM, favorable to the new model. Table 3 is the Confusion Matrix produced by the new classifier for the test dataset. Note the values on the diagonal 40

5 Table I CLASSIFICATION RESULTS: AVERAGE ACCURACY - PERCENT CORRECTLY PREDICTED. (TOTAL: 1200 FLOWS) Classifier 5-Packets 13-Packets Flow % % % Centriod Class.(CC)[17] NN [5] Naive Bayes(NB) [14] Fully-connected HMM Profile HMM Profile+5-Fully HMM Table II CLASSIFICATION RESULTS: ACCURACY PER APPLICATION, AFTER SEEING THE WHOLE FLOW, IN PERCENT.(TOTAL: 1200 FLOWS) Class. http telnet ftp mail stream CS av. CC NN NB F P P5F that the lower rate of correct classification is 92.5%, higher than the rates obtained in [2], 90.23%, and in [6], 81.69%. The largest share of confusion (3.5%) is one in which mail is classified as telnet. From this table, all other performance measures, such as false positive and false negative can be derived. Table III CLASSIFICATION RESULTS: CONFUSION MATRIX, AFTER SEEING THE WHOLE FLOW, IN PERCENT. http telnet ftp mail stream CS http telnet ftp mail stream CS Tables 1, 2 and 3 indicate clear advantages of the proposed model. The average accuracy obtained by the classifier is 62.5% having seen five packets, 80.0% after examining 13 packets and 95.5% after seeing the unidirectional entire flow. Although standard deviations were not included, all results presented are confirmed by average values obtained in 30 executions in the way of cross-validation with randomly selected 20% of flows for validation and 80% for training, using the mix of the all traces. It is worth noticing that in this work the implementation of the 5P-HMM and 5F-HMM models used here for comparison are one-dimensional, the traffic is uni-directional and adopts a set of sixty-four discrete symbols which differentiates them from similar models who inspired them in [2] and [6]. The superior results presented in Table 3, when compared with those obtained in [2] and [6] may not be conclusive because of the different data and models used. Unfortunately the scientific community does not yet have repositories of data to benchmark of classification traffic algorithms. However they clearly show, on the considered comparison basis, the hybrid model proposed here provides gains with respect to the previous mentioned. VI. CONCLUSION Internet traffic classification finds importance in many areas such as bandwidth management, Quality of Service provisioning, network security and anomalous traffic detection. This paper proposed and analyzed a new type of Internet traffic classifier which combines the ideas of previous proposals [2], [6] in the hope of obtaining a model that has the best properties of the original models. The experimental validation of this new model shows an improvement of the classification accuracy when compared with other methods, under the same conditions, for the analyzed dataset. Despite the promising results already obtained with this model, this research continues in some directions. Evaluations using large traffic traces from different network locations is the next step of this work. In addition, variations that can optimize the performance of the classifier should also be tested. For example, consider the flows in both directions, using continuous probability distributions and investigate the best combination of number of states in the two parts of the HMM (HMM Profile part and Fullyconnected HMM part) for each application can significantly improve the results already obtained. These aspects are under investigation and will be presented in a future paper. In addition, further work is under progress in two directions. First, we are implementing sequential decision in which the classifier updates the classification decision on each new packet received from the stream. In another line, which will include a third part to the classification model, based on statistical characteristics: a classifier based on aggregate statistics of the whole flow (Naive Bayes). In this new classifier, the decision about the class that owns the flow can occur at three different times depending on the degree of confidence previously found by the classifier, which may also have changed their decision at the end of the flow. The first turning point occurs after the first five packets are seen, based on HMM-Profile, which captures the specific features of the first protocol packets, the second turning point occurs after seeing the whole flow, based on HMM-Full, which captures the properties of the whole sequence of packets in the flow, and in the third time the classifier calculates some aggregate statistics for the flow and re-evaluate the decision. As already anticipated, depending on the advisability of implementing the decision taken in one stage could be announced and yet, corrected by the 41

6 next stages. The evidence and confidence in the three stages are combined cumulatively for decision making and not in isolation. Are also being carried out tests using the two parts as separate classifiers (in parallel) and calculating the sequence likelihood as a linear combination of the two. This classification mode also allows a balance between the likelihood of the first protocol packets and that of the whole sequence. This paper does not present results for this classification method. REFERENCES [1] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, Classof-service mapping for qos: a statistical signature-based approach to ip traffic classification, in Proc. the 4th ACM SIGCOMM, p , [2] A. Dainotti, W. Donato, A. Pescape, and P. S. Rossi, Classification of network traffic via packet-level hidden markov models, in IEEE GLOBECOM 2008, [14] W. Li, M. Canini, A. W. Moore, and R. Bolla, Efficient application identification and the temporal and spatial stability of classification schema, Computer Networks, vol. 56, no. 3, pp , [15] T. T. Nguyen and G. Armitage, A survey of techniques for internet traffic classification using machine learning, IEEE Communications Surveys and Tutorials, [16] K. Fukunaga, Introduction to Statistical Pattern Recognition. 2nd Edition. New York: Academic Press, [17] L. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Procs. IEEE, vol. 77, no. 2, pp , [18] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. 2nd Edition. New York: Wiley, [19] Waikato, Weka: Data mining software in java, [20] Matworks, Matlab, [3] A. W. Moore and K. Papagiannaki, Toward the accurate identification of network applications, in PAM 2005, [4] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee, Internet traffic classification demystified: Myths, caveats, and the best practices, in ACM CoNEXT 2008, [5] L. Bernaille, R. Teixeira, and K. Salamatian, Early application identification, in ACM CoNEXT 2006, [6] C. Wright, F. Monrose, and G. Masson, Hmm profiles for network traffic classification, in VizSEC/DMSEC, pp. 9 15, [7], Towards better protocol identification using profile hmms, in JHU Tech. Rep. JHU-SPAR051201, [8] H. Sun, A. Vetro, and J. Xin, An overview of scalable video streaming: Research articles, Wireless Comm. and Mobile Computing, vol. 7, pp , [9] A. Moore and D. Zuev, Internet traffic classification using bayesian analysis techniques, In ACM SIGMETRICS 2005, [10] G. Szabo, D. Orincsay, S. Malomsoky, and I. Szabo, On the validation of traffic classification algorithms, In PAM 2008, [11] B.-C. Park, Y. J. Win, M.-S. Kim, and J. W. Hong, Towards automated application signature generation for traffic identification, In NOMS 2008, [12] J. Erman, M. Arlitt, and A. Mahanti, Traffic classification using clustering algorithms, In SIGCOMM 2006, [13] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, Traffic classification through simple statistical fingerprinting, SIG- COMM Comput. Commun. Rev., vol. 37, no. 1, pp. 5 16,

Can we trust the inter-packet time for traffic classification?

Can we trust the inter-packet time for traffic classification? Mohamad Jaber, Roberto G. Cascella and Chadi Barakat INRIA Sophia Antipolis, EPI Planète 2004, Route des Luciolles Sophia Antipolis, France