Network Traffic Classification by Common Subsequence Finding

Size: px

Start display at page:

Download "Network Traffic Classification by Common Subsequence Finding"

Ashley Hines
5 years ago
Views:

1 Network Traffic Classification by Common Subsequence Finding Krzysztof Fabjański and Tomasz Kruk NASK, The Research Division, Wąwozowa 18, Warszawa, Poland Abstract. The paper describes issues related to network traffic analysis. The scope of this article includes discussion regarding the problem of network traffic identification and classification. Furthermore, paper presents two bioinformatics methods: Clustal and Center Star. Both methods were precisely adapted to the network security purpose. In both methods, the concept of extraction of a common subsequence, based on multiple sequence alignment of more than two network attack signatures, was used. This concept was inspired by bioinformatics solutions for the problems related to finding similarities in a set of DNA, RNA or amino acids sequences. Additionally, the scope of the paper includes detailed description of test procedures and their results. At the end some relevant evaluations and conclusions regarding both methods are presented. Keywords: network traffic analysis, anomaly detection, network intrusion detection systems, common subsequence finding, bioinformatics algorithms, Clustal algorithm, Center Star method, automated generation of network attack signatures 1 Introduction The Internet became one of the most popular tools used by almost everyone. It is important to mention that the Internet and the World Wide Web (WWW) are not synonymous. The World Wide Web is one of the many services available in the Internet. The Internet consists of an enormous number of computer networks. Therefore, the issue regarding network security is so important. The network security issue is not only a set of security methods required for ensuring safety. It also consists of elements related to network security policy which should be obeyed. Different institutions and companies are introducing their private security policies. Often, security policies are performed according to some known standards. Unfortunately, this approach does not guarantee that the precious resources will remain unaffected. Other, more sophisticated methods should be introduced. One of the most recognized families of systems are network intrusion detection systems. This group of systems allows to alert about unwanted and malicious activity registered in the network flow. The process of identifying a malicious network flow involves comparing the network flow content with a predefined set of rules. The set of rules, sometimes known as M. Bubak et al. (Eds.): ICCS 28, Part I, LNCS 511, pp , 28. c Springer-Verlag Berlin Heidelberg 28

2 5 K. Fabjański and T. Kruk well as a set of network attack signatures, describes different Internet threats by mapping their content into the specific format. Despite of the malicious flow method identification, there are many new Internet threats which have not been discovered yet. Fortunately there are methods and heuristic approaches which allow to identify new Internet threats by following different network trends and statistics. Although those methods are very promising, still there is a huge requirement for new algorithms. Those algorithms should be capable of analysing a huge portion of attack signatures for network intrusion detection systems, produced in an automatic manner. In order to support this process, some new approaches were proposed. One of the ways allowing to analyse the attack signature collections is the bioinformatics approach. Multiple sequence alignment is a fundamental tool in bioinformatics analysis. This tool allows to find similarities embedded in a set of DNA, RNA or amino acids sequences. The bioinformatics approach can be adapted to the network traffic identification and classification problem. The second section of this article presents different systems for network traffic analysis. Section three develops briefly two bioinformatics methods: Center Star and Clustal. The fourth section includes various test results. The last section discusses algorithm complexity and their suitability in network traffic analysis. 2 Network Traffic Classification and Identification Problem Computer threats are often a reason of unwanted incidents, which might cause irreversible damage to the system. From the scientific point of view, computer threats use certain vulnerabilities, therefore threats, vulnerabilities and exposures should be considered as a disjointed issue. One of the most popular and widely present group of Internet threats is Internet Worm [1]. Intrusion detection systems (IDS) [2] detect mainly malicious network flow by analyzing its content. An example of IDS are network intrusion detection systems (NIDS). NIDS are able to detect many types of malicious network traffic including worms. One of the most popular NIDS is Snort. It is an open source program available for various architectures. It is equipped with the regular expression engine which enhance the network traffic analysis. It analyses the network flow by comparing its content (not only a payload) with a specific list of rules. During this process, Snort utilises the regular expression engine. As a result of this analysis, Snort makes a decision regarding a particular network flow, whether it is malicious or regular. An example of simple Snort rule is shown in the table (Table 1). Table 1. An exemplary Snort rule alert udp $EXTERNAL_NET 22 -> $HTTP_SERVERS 22 ( msg:"misc slapper worm admin traffic"; content:" E "; depth:1; reference:url,isc.incidents.org/analysis.html?id=167; reference:url, classtype:trojan-activity; sid:1889; rev:5;)

3 Network Traffic Classification by Common Subsequence Finding 51 Snort works as a one thread application. Its action is to receive, decode and analyse the incoming packets. Snort allows us to identify unwanted malicious network flows by generating appropriate alerts. The main problem is that if the set of rules for Snort has a poor quality, we can expect many false positive or false negative alerts. Therefore classification of a network attack signature as well as improving their quality is a matter of great importance. Very often NIDS are combined with systems for automated generationof network attack signatures, such as Honeycomb [3,4]. Tools which join functions of NIDS and automated signature creation system are known as network early warning systems (NEWS). An exemplary network early warning system is Arakis [5]. NEWS develop very sophisticated methods for network traffic classification in order to speed up the process of identification of potential new Internet threats. Main problem concerning classification and identification of network flows is related to extraction of common regions from the network attack signatures sets [6]. Many techniques were developed. One of the techniques allowing the network security specialists to distinguish the regular network flow from the suspicious one, is usage of honeypots [7]. Honeypot is a specially designed system which simulates some network resources in order to capture the malicious flow. Generally, it consists of a part of an isolated, unprotected and monitored network with some virtually simulated computers which seem to have a valuable resources or information. Therefore, flow which occurs inside the honeypot is assumed to be malicious by the definition. Protocol analysis and patterndetection techniques performed on flows collected by honeypots result in network attack signatures generation. Generation of network attack signatures is mainly based on the longest common substring extraction [4,8]. One of the tools allowing generation of network attack signatures is Honeycomb. 3 Sequence Alignment Sequence alignment is a tool that can be used for extraction of common regions from the set of network attack signatures [6]. Extraction of common regions is shown in the figure (Fig. 1). It is somewhat similar to the biologist task. The biologist identifies newly discovered genes by comparing them to the family of genes whose function is already known. Comparison is performed by assigning those newly discovered genes to the known families by common subsequence finding. Problem of extraction of common regions from the network attack signature set is actually the multiple sequence alignment (MSA) [9] problem. The MSA is a generalization of the pairwise alignment [1]. Insertion of gaps is performed into each string so that resulting strings have equal length. Although the problem of multiple sequence GET / HTTP GET /a/a.htm HTTP GET / HTTP/1.1 Fig. 1. Problem of the longest common subsequence finding

4 52 K. Fabjański and T. Kruk alignment is an NP-complete task, there are many heuristics, probabilistic and other approaches that cope with that issue. A specific classification of those methods was proposed in [1]. Among so many algorithms, two classical approaches where chosen. The first algorithm and probably the most basic is a Center Star method. It was chosen for network traffic identification purpose. The main goal of this adaptation was to check whether this method can be used for network attack signature common region extraction. The second algorithm that was required for classification of network attack signatures is Clustal. It is worth to mention that in both algorithms a global alignment was used. Global alignment was computed using Needleman-Wunsch [2] algorithm. 3.1 Center Star Method The Center Star method [11] is classified to the group of algorithms with some elements of approximation. As it was mentioned before, multiple sequence alignment is an NP-complete problem. Presented method, Center Star, is an approximation of multisequence alignment. Thus, expected results can provide, but do not have to provide optimal solutions. The Center Star method consists of three main steps. Detailed description of Center Star method can be found in [11]. 3.2 Clustal Algorithm Clustering is the method which classifies particular objects into appropriate groups (clusters). Classification is performed based on the defined distance measurement technique. Every object from a single cluster should share a common trait. Data clustering is widely used in many science fields. We can find it in data mining, pattern recognition or bioinformatics. Data clustering algorithms can be divided into two main categories: hierarchical methods. latexdeschierarchical methods assign objects to the particular clusters by measuring the distance between them. In partitioning approach, new clusters are generated and then recomputing of the new cluster centers is performed, partitioning algorithms. latexdescpartitioning algorithms start with an initial partition and then by iterative control strategy optimize an objective function. Every cluster is represented by the gravity center or by its center object. In hierarchical methods, in turn, we can distinguish two different types: agglomerative. we begin the clustering procedure with each element as a separate cluster. Merging them into larger clusters, we come to the point where all elements can be classified into one big cluster, divisive. starts the process from one big set and then, divides it into successively smaller subsets Clustal is an example of agglomerative algorithm, also know as bottom-up approach. During implementation of Clustal algorithm some modification were performed. Modification were introduced in order to adapt this method for network attack signature classification purpose. Instead of profile representation of internal nodes in the dendrogram, consensus sequence was used. This was caused mainly by the fact that so far the

5 Network Traffic Classification by Common Subsequence Finding 53 scoring scheme used for network traffic classification has a very basic structure. Assuming that network flow can be represented as a sequence of extended ASCII characters, we have 1 for each match and otherwise. Although the standard objective function is the only reasonable solution for this moment, there is some research [12], which may result in new scoring scheme proposition. 4 Tests and Results This section provides detailed description concerning efficiency tests of Center Star method and Clustal algorithm. Tests were executed on the Intel(R) Xeon(TM) CPU 3.GHz computer equipped with kb of the total memory. Compiler used for compilation was g++ (v4.1). In the test procedure, external data sets, extracted from Arakis database, were used. What is most interesting, data were extracted from Arakis database. Therefore, some tests results were confronted with the Arakis algorithm results. Data, used in the test, consisted of real network signatures, suspected to be malicious. Arakis algorithms were mainly based on DBSCAN [19] clustering mechanism and edit distance measurement. 4.1 Center Star Methods Tests In the figure (Fig. 2 (a)) we have experimental data set presented. The horizontal axis represents the total number of characters (counted as a total sum of network attack signatures lengths). The vertical axis, in turn, represents the actual number of processed signatures. This data set was used in Center Star method tests. (Fig. 2 (b)) shows the actual execution time of the Center Star method. Time was measured in seconds. Next figure (Fig. 2 (c)) reflects the relation between the length of the multiple sequence alignment (MSA) and the common subsequence extracted from MSA. This relation provides us information regarding total length of the extracted subsequence. In the next test, we measured average length of single division in one signature. Assuming that single network attack signature may consist of many parts, this test provided us an approximate information concerning the quality of the extracted common subsequence. The greater the average length of a single division in network attack signature, the lower the probability of false positive or false negative alerts. Center Star method algorithm was compared with Arakis algorithms. The results of the test are shown in the figure (Fig. 2 (d)). In some cases Arakis algorithm seems to obtain better results than Center Star algorithm. Those situations were precisely investigated and it turned out that the reason of that had a background in different interpretation of the clusters representatives. In some cases Arakis algorithm does not update the clusters representatives even if some very long network attack signatures have expired. This has a consequences in overestimating the average single division length of a common subsequence. 4.2 Clustal Algorithm Tests Most of the Clustal tests were performed in order to compare results with those of Arakis algorithm. Data used in the tests, are shown in the figure (Fig. 3 (a)).

6 54 K. Fabjański and T. Kruk Number of signatures vs Number of signatures database (a) Center Star: The no. of characters vs the number of signatures Time Number of characters vs Time Number of characters Center Star method (b) The Center Star method execution time 2 vs Pattern length 16 vs Average length of divs Pattern length Average length of divs LCS length MSA length Center Star method Arakis algorithm (c) Center Star method: the MSA and LCS relation (d) Average single division length of common subsequence: Arakis vs Center Star Fig. 2. Center Star method tests Next test (Fig. 3 (b)) investigates the Clustal algorithm execution time in respect to the total number of processed characters. Time was measured in seconds. The (Fig. 4) represents the comparison of the Arakis clustering algorithm with the Clustal method. Comparison of those two methods was made in order to show the main advantage of the Clustal algorithm. The main advantage of the Clustal algorithm is the possibility of adjustment. Two subfigures (a,c) present the number of clusters produced by the Arakis and Clustal algorithms. In the first subfigure (a), we can notice that Clustal algorithm produces smaller number of clusters than Arakis solution. However, the precision for that test was rather poor (b). Smaller number of clusters was achieved using EPS1 =.1. EPS1 is an epsilon which determines whether the investigated signature should be classified to particular cluster. The condition where distance 1 between two signatures is greater than EPS1 is consider as satisfied. Precision is expressed as ratio of MSA length to the LCS length. The closer the ratio to 1,the better precision we obtain. Better precision is obtained at the greater number of clusters cost. In the next subfigures (c,d), EPS1 was set to.9. For this values, Clustal algorithm produces much more clusters than Arakis algorithm. On the other hand, precision 1 Levenshtein distance [18].

7 Network Traffic Classification by Common Subsequence Finding 55 8 vs Number of signatures 25 Number of characters vs Time Number of signatures Time database (a) Clustal: number of characters vs number of signatures Fig. 3. Clustal algorithm tests Number of characters Clustal algorithm (b) The Clustal algorithm execution time gained in those two tests was very high. All four subfigures (a,b,c,d) were generated by computing the Clustal algorithm with the standard scoring scheme 2. Parameters MATCH, MISMATCH and GAP _PENALTY were set according to standard scoring scheme. The reason why gap penalty had the same value assigned as mismatch was straightforward. So far there is no scoring scheme for ASCII alphabet, therefore only trivial approach was presented. In this approach gap penalties were not considered. However in extended test procedure, different values for gap penalties were assigned. Those test results were preliminary and thus they were not published in this paper. 5 Evaluation and Conclusions In this section, detailed estimation of the main methods are given. Estimation was based on theoretical assumption and faced with empirical implementation of both methods. 5.1 Center Star Method Complexity The Center Star method consists of three main phases, after which multiple sequence alignment is found. In the first phase of this method, all pairwise alignment are formed (distance matrix calculation). The complexity of this phase, in the worst case, is O((N 2 + 3N) ( K 2 ) ),wherek is the number of input signatures. The second phase is related to finding the signature which is "the closest" to others. This step requires O(K). In the last step, multiple sequence alignment is formed. The last phase computational complexity of the Center Star method is O(2N ( K 2 ) ). The Center Star method provides essential functionality in common motif finding process. It allows us to extract the common subsequence from the multiple sequence alignment. This procedure requires O(KN). 2 1 for match and otherwise.

8 56 K. Fabjański and T. Kruk Number of cluster vs Number of cluster Clustal algorithm Arakis algorithm (a) dist = 1, MATCH = 1, MISMATCH =, GAP_PENALTY =, EPS1 =.1 Pattern length vs Pattern length LCS length MSA length (b) dist = 1, MATCH = 1, MISMATCH =, GAP_PENALTY =, EPS1 =.1 35 vs Number of cluster 8 vs Pattern length Number of cluster Clustal algorithm Arakis algorithm (c) dist = 1, MATCH = 1, MISMATCH =, GAP_PENALTY =, EPS1 =.9 Pattern length LCS length MSA length (d) dist = 1, MATCH = 1, MISMATCH =, GAP_PENALTY =, EPS1 =.9 Fig. 4. Number of clusters vs precision: comparison of the Arakis algorithm with Clustal algorithm 5.2 Clustal Algorithm Complexity In Clustal algorithm we have very complicated and time consuming procedures, including distance matrix calculation, dendrogram creation and clustering mechanism. All those three phases have the following computational complexities: 1. Distance matrix calculation - O((N 2 +3N) ( ) K 2 2. Dendrogram creation - O( ( ) ( K 2 +2(K 1) [ K ) 2 +N 2 +4(N + K)]) 3. Clustering (reading the dendrogram and writing clusters to the file) - O(K) All calculations regarding computational complexity were based on theoretical assumptions and source code analysis. Run-time dependencies shown in (Fig. 3 (b)) seem to confirm the results. Moreover, presented computational complexities do not compromise the theoretical assumptions regarding complexities of presented methods. To sum up, Clustal and Center Star algorithms have got some advantages and disadvantages. One of the biggest drawback of both algorithms is their high run-time complexity. On the other hand, the whole task is an NP-complete problem, so we cannot expect better run-time complexity. Clustal, as well as Center Star method, can be modified in order to decrease this complexity. In the Center Star method instead of finding

9 Network Traffic Classification by Common Subsequence Finding 57 all pairwise alignment, we can take a randomly selected sequence from the set of input signatures. After that, we can form the multiple sequence alignment by computing all pairwise alignments of the chosen sequence with the rest of sequences. As a result, we would omit the process of choosing the center sequence, which involves computation of all pairwise alignments in the set of network attack signatures. On the other hand, in Clustal algorithm, instead of using Neighbor-Joining algorithm [13][15][16][17] for dendrogram creation, we could have used an Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [14]. The UPGMA is faster than Neighbor-Joining algorithm at precision expense. This improvements lead to the better time complexity, but on the other hand, they result in worse common subsequence extraction and worse network traffic classification. In our case, better time complexity might occur to be more important than worse common subsequence extraction. Extraction of the common subsequence during preprocessing phase should be performed in online mode. On the other hand, clustering of already created signatures must be performed in offline mode. This paper involves the process of classification and identification of network attack signatures only. Therewere no other tests checking the influence on the number of false positive or false negative alerts. Such experiments will be performed after we finally prove that bioinformatics methods are suitable for suspicious network traffic analysis. In further work, it is expected that adapted methods will be constantly developed. Although, the results of tests performed on the real network traffic data are very promising, still there is an issue related to new scoring function proposition. Therefore, further work will focus on aspects related to statistics regarding the network traffic. Statistics will allow in future, to represent the particular families of Internet threats as a profile structures. The profile structure will allow to create scoring matrices, similar to those which can be met in bioinformatics. Moreover, profile structures will allow to deal with Internet threats such as polymorphic worms. Furthermore, profiles will allow us to identify those regions in the network traffic patterns, which remain unchanged even in case of polymorphic Internet threats. References 1. Nazario, J.: Defense and Detection Strategies against Internet Worms. Artech House, Boston & London (24) 2. Kreibich, C., Crowcroft, J.: Automated NIDS Signature Creation using Honeypots. University of Cambridge Computer Laborator (23) 3. Kreibich, C., Crowcroft, J.: Honeycomb - Creating Intrusion Detection Signatures Using Honeypots. In: Proceedings of the Second Workshop on Hot Topics in Networks (Hotnets II). Cambridge Massachusetts: ACM SIGCOMM, Boston (23) 4. Rzewuski, C.: Bachelor s Thesis: SigSearch - automated signature generation system (in Polish). Warsaw University of Technology, The Faculty of Electronics and Information Technology (25) 5. Kijewski, P., Kruk, T.: Arakis - a network early warning system (in Polish) (26) 6. Kreibich, C., Crowcroft, J.: Efficient sequence alignment of network traffic. In: IMC 26: Proceedings of the 6th ACM SIGCOMM on Internet measurement, isbn , pp ACM Press, Brazil (26)

10 58 K. Fabjański and T. Kruk 7. Bakos, G., Beale, J.: Honeypot Advantages & Disadvantages, LasVegas, pp. 7 8 (November 22) 8. Kreibich, C.: libstree A generic suffix tree library, 9. Gusfield, D.: Efficient method for multiple sequence alignment with guaranteed error bound. Report CSE-91-4, Computer Science Division, University of California, Davis (1991) 1. Reinert, K.: Introduction to Multiple Sequence Alignment. Algorithmische Bioinformatik WS 3, 1 3 (25) 11. Bioinformatics Multiple sequence alignment, Kharrazi, M., Shanmugasundaram, K., Memon, N.: Network Abuse Detection via Flow Content Characterization. In: IEEE Workshop on Information Assurance and Security United States Military Academy (24) 13. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 2, (1987) 14. Tajima, F.: A Simple Graphic Method for Reconstructing Phylogenetic Trees from Molecular Data. In: Reconstruction of Phylogenetic Trees, Department of Population Genetics, National Institute of Genetics, Japan, pp (199) 15. The Neighbor-Joining Method, Weng, Z.: Protein and DNA Sequence Analysis BE561. Boston University (25) 17. Multiple alignment: heuristics, Levenshtein, V.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady, (1966) 19. Ester, M., Kriegel, H., Sander, J., Xiaowei, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996), Institute for Computer Science, University of Munich (1996) 2. Fabjañski, K.: Master s Thesis: Network Traffic Classification by Common Subsequence Finding. Warsaw University of Technology, The Faculty of Electronics and Information Technology, Warsaw (27)

Polygraph: Automatically Generating Signatures for Polymorphic Worms

Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome Brad Karp Dawn Song Presented by: Jeffrey Kirby Overview Motivation Polygraph Signature Generation Algorithm Evaluation