School of Engineering and Mathematical Sciences. Packet Pattern Matching for Intrusion Detection

Size: px

Start display at page:

Download "School of Engineering and Mathematical Sciences. Packet Pattern Matching for Intrusion Detection"

Everett Blake
6 years ago
Views:

1 School of Engineering and Mathematical Sciences Packet Pattern Matching for Intrusion Detection by Alireza Shams Project for the Degree of MSc In Telecommunications and Networks Supervisor: Prof Tom Chen London 19 September /9/2014 Page 1

2 Project Title: Packet Pattern Matching for Intrusion Detection Student: Alireza Shams Supervisor: Prof Tom Chen Abstract In today s networks, cyber-attacks cause damage or theft and disrupt services with enormous economic and financial impacts. Current methods to protect end-user systems rely on user cooperation to install new system patches or upgrade security software, with low speed. In addition, firewalls and intrusion detection systems in the network attempt to detect and block attacks. These systems require accurate and up-to-date signatures, and must operate in real time at high speeds. Software implementations cannot meet performance goals; a combination of software and hardware can be more effective for high performance pattern matching. There are so many different pattern matching and searching algorithms which have improved over years. So many researches have been done by students and experts over this widely used area. Even companies have spent massive budgets for researches, in order to acquire new methods or find the searching algorithms which suits their purposes. As searching and pattern matching algorithms are widely utilised for various range of purposes and software. From simple searching over dictionaries to intelligent searching over big data, these matching methods played fundamental roles in information technology and computer science. This research will investigate algorithms for single-pattern and multi-pattern matching that can enable high performance software or hardware based implementations for network filtering and intrusion detection applications. Comparison of methods and their results would be considered in order to acquire comprehensive understanding regarding basic methods which involve in packet pattern matching algorithms. 15/9/2014 Page 2

3 Abstract... 2 Table of Contents Introduction Previous Work Theory, Methodologies Developed/Used Packet Pattern Matching Algorithms Single Keyword matching algorithm Boyer Moore Algorithm Boyer Moore Horspool Algorithm Multi Keyword Pattern Matching Algorithms Aho Corasick Algorithm tuple Aho Corasick Algorithm Pattern Matching Algorithms and Softwares in Security Wireshark Snort Pattern Matching in Other Applications Antiviruses Spam Detection Softwares Hardware Experiment and Work Done Experiment on Aho Corasick Building The Searching Tree Parallel Boyer Moore Analysis Theoretical Approaches Practical Approaches Experiment over Boyer Moore and Boyer Moore Horspool Comparison of Aho Corasick and Boyer Moore Algorithms Discussion of Results Discussion of Aho Corasick Discussion of Boyer Moore Boyer Moore Performance Comparison of Boyer Moore and Boyer Moore Horspool Matching Algorithms Discussion of Boyer Moore in Comparison with Aho Corasick Acknowledgment Appendix C References /9/2014 Page 3

4 Chapter 1: Introduction Security policy enforcement is most effective when it is an inherent component of the network. Packet content scanning at high speed has become extremely important due to its applications in network security, network monitoring, load balancing, and traffic management in general. Host-based solutions are useful but have drawbacks: they do not scale well to large enterprises and detect threats only after they reach the targets. Continually updating antivirus software and installing patches are necessary but cumbersome for large numbers of enterprise network clients. Network-based defences can potentially block attacks before they reach hosts and protect large host populations more effectively. Firewalls and network intrusion detection systems (NIDS) are well-suited for this purpose. They scan packets to detect malicious intrusions or denial of service (DOS) attacks [1]. For instance Snort is a popular open source NIDS with a rule set consisting of thousands of community-written rules. It's important to note that Snort has no real GUI or easy to use administrative console. The other one Wireshark is a well-known open-source network and packet analyser for Unix and Windows. It is used for network troubleshooting, analysis, software and communications protocol development, and education. However, packet inspection and filtering at wire speed up to layer 7 decoding remains challenging especially when scanning for thousands of patterns. Another difficulty in pattern matching is that virus signature databases often have correlated patterns to match. Snort for example, implements pattern matching algorithms in software. It can handle link rates only up to 100 Mbps [2] under normal traffic conditions and worst case performance is even less. These rates are not sufficient to meet the needs of even medium-speed access or edge networks. But still so many projects and researches are going on different aspects of Packet Pattern Matching for Intrusion Detection and Prevention. High performance packet inspection and filtering must be carried out in combined software and hardware with appropriate algorithms that are the subject of current research. 15/9/2014 Page 4

5 Chapter 2: Previous Work Packet pattern-modelling is one of the well-studied classical problems in computer science. The most notable algorithms include Aho-Corasick and Commentz-Walter algorithms which can be considered as the extension of well-known KMP and Boyer-Moore single pattern matching algorithms respectively. Both these algorithms are suitable only for software implementation and suffer from throughput limitations. The current version of Snort uses an optimized Aho-Corasick algorithm [3]. In the past few years, several interesting algorithms and techniques have been proposed for multi-pattern matching in the context of network intrusion detection. The hardware-based techniques make use of commodity search technologies such as TCAM [1] or reconfigurable logic/fpgas [4] [5]. Some of the FPGA based techniques make use of the on-chip logic resources to compile patterns into parallel state-machines or combinatorial logic. Although very fast, these techniques are known to exhaust most of the chip resources with just a few thousand patterns and require bigger and expensive chips. Therefore, scalability with pattern set size is the primary concern with purely FPGA based approaches [3]. Pattern matching is a computationally intensive process included in network intrusion detection systems. Using efficient graphics processing unit (GPU)- based, a new network packet pattern-matching algorithm was proposed to leverage the computational power of GPUs to accelerate pattern-matching operations and subsequently increase the overall processing throughput [6]. According to the experimental results, the proposed algorithm achieved a maximal traffic processing throughput of over 2 Gbit/s. The results demonstrate that the proposed GPU-based algorithm can effectively enhance the performance of network intrusion detection systems. Packet content scanning (also known as Layer-7 filtering or payload scanning) is crucial to network security and network monitoring applications. It is useful for detecting and filtering packets containing worms and other attack code, and also for newly emerging edge network services. Examples of the emerging 15/9/2014 Page 5

6 edge network services include high-speed firewalls, which protect end hosts from security attacks; and HTTP load balancing, which intelligently redirects packets to different servers based on their HTTP requests and Extensible Mark-up Language (XML) processing, which facilitates the sharing of data across different systems [7]. Another approach is a memory-efficient parallel pattern matching scheme with heterogeneous bit-split string matchers for two subsets of short and long target patterns [8]. Several parameters are scaled to optimize memory usage of string matchers for each subset. Therefore, the memory-efficient pattern matching engine can be obtained by mapping short and long target patterns just using two different types of string matchers. The main objective of the proposed research is to develop models in order to pattern matching for intrusion detection. As utilization of networks become more general and public the security of networks should expose to discussion as a serious and considerable issue. The general problem of pattern matching is considered fundamental in computer science and has been researched thoroughly over the last decades. Still, when applied to the network domain of recent years, the traditional algorithms fail to face current challenges. The first challenge is the continual increase in Internet traffic rates, which requires a scalable design in terms of speed and memory usage. The second challenge arises from the increase in Web traffic compression due to the increasing popularity of Web surfing over mobile devices. The security device is forced to decompress this traffic prior to inspection, leading in turn to processing and space penalties. The third challenge is due to the requirement for a solution that is resilient to attacks that overload the security device. [9] In the hearth of every intrusion detection system (IDS) there is pattern matching algorithm which search for specified patterns[10][11]. Pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. [12] In order to utilize any Intrusion Detection/Prevention System first step would be pattern matching. There are so many different algorithms which can search 15/9/2014 Page 6

7 and find specified patterns. But as the matter of fact these methods have different tributes and traits which distinguish them. Comparing different methods of Pattern Matching Algorithm determine the pros and cons of each and will be helpful regarding efficient utilization of these methods. Although some of these methods and algorithms designed long ago but comparison of them provides many advantages e.g. great understanding and profound comprehension. Intrusion Detection/prevention is widespread knowledge in field of network security which contains combination of various methods in computer science from artificial intelligence to big data clustering. But packet pattern matching can be worthwhile beginning of this complicated and deployed knowledge. The definition of an intrusion is to compromise a computer system by breaking the security of such a system or causing it to enter into an insecure state. The act of intruding or gaining unauthorized access to a system typically leaves traces that can be discovered by intrusion detection systems. [13] More specifically, the intelligence and information provided by an IDS is contingent upon how the system is being used, and is as important as the chosen IDS itself. Indeed, there are many ways to use IDSs. If and when an IDS discovers an intrusion, regardless of how it has been defined, it is common for a system to make a record or report of the intrusion, typically by way of logging or generating an alert that is sent off to an appropriate party. [14] 15/9/2014 Page 7

8 Chapter 3: Theory, methodologies developed/used Provided dissertation based on research over two fundamental algorithms in Packet Pattern Matching and the experiment over efficiency of them which followed by practical conclusion. The main idea was collecting some valuable information regarding Boyer-Moore and Aho-Corasick matching algorithms and providing appropriate pragmatic comparison. There are so many different applications available commercially and online which can capture and monitor the information and data packets. Wireshark used for capturing some data which exchanged between my laptop as end user and the internet provider. Regardless of the content of captured data, I used Java and Matlab programming as the platform of my matching process. So in conclusion of the aforementioned I captured some data with Wireshark and did some experiments in Java and Matlab programming languages over Pattern Matching algorithms. 15/9/2014 Page 8

9 Chapter 4: Packet Pattern Matching Algorithms Pattern matchers are programs that takes X string as input and produce as output the locations in X at which patterns, or keywords, appear as substrings. The simplest patterns are single keywords that match themselves. And broader class of patterns be set of keywords. There are three other different pattern-matching problems which are approximate string matching, multidimensional matching patterns algorithm and Hardware matching algorithms. [15] Last three matching methods weren t considered in this research. First two most important algorithms from aforementioned categories are compared and surveyed in this dissertation. 1. Single-Keyword Pattern Matching Algorithms: Boyer-Moore 2. Multiple-Keyword Pattern Matching Algorithms: Aho-Corasick In fact Boyer-Moore and Aho-Corasick are two most initial algorithms of pattern-matching which have been utilized in so many different Network Intrusion Detection/Perevention Systems (NIDS/P). [15] In this section I reviewed a classification of intrusion detection systems and explained two aforementioned methods. Considering this classification helps more efficient concentration on comparing these methods. An understanding of this classification will be helpful for comprehension of this thesis. At first I explained the Single-Keyword Pattern Matching Algorithm (Boyer-Moore) and then provided Multiple-Keyword Pattern Matching Algorithm (Aho-Corasick) which followed by comparison of these two methods. 15/9/2014 Page 9

10 4.1 Single Keyword matching algorithm In single keyword matching algorithms a fixed and finite non-empty set of keywords and an input string will be given, and the aim is locating all occurrences of a given pattern in the input text string. [15] But we should consider that all the multi-keyword algorithm can be simply resulted from single-keyword pattern matching algorithm. Even though it can be irrational approach as there are several multi-pattern matching algorithm which can perform very well. In Network Intrusion Detection Systems (NIDS) data packets are input string with finite number of characters and our matching algorithm will try to find determined keywords along these packets. In fact alphabet with size of 256 (quite large) and different combinations comprise packets and obviously size of the keywords are extremely small in comparison with the number of input strings (data packets). Furthermore, the set of keywords is known before the algorithm begins processing the input. Should this not be the case, if efficient modifications to the keyword set are needed, the searching process is known as dynamic string matching. [16] In the single-keyword pattern matching algorithm I assumed that the keyword x has the length m and the input string y has the length n. Each keyword x with m character from our alphabet (256 size) is one of 256 different combinations. We do not consider the case of multiple input strings as the algorithms are simply repeated for all input strings. Of course customarily, the pre-computation only needs to be done once before processing the first of the input strings. Note that in our pseudocode for the pattern matching algorithms presented in this thesis, we use the convention of outputting all the indexes of the character (byte) in y that matches the leftmost character in the keyword x (or in a keyword from the keyword set in the multiple-pattern matching algorithms presented in following chapters). So in the next subject I considered the basic algorithm of single-keyword matching algorithm which is Boyer-Moore searching algorithm. 15/9/2014 Page 10

11 4.1.1 Boyer Moore Algorithm Consideration of Boyer-Moore as popular choice resulted in contribution of this algorithm in some other important algorithms. The suffix-based solution is utilized by Boyer-Moore which is in fact most efficient and basic keyword pattern matching algorithm. Although the idea of a large alphabet is subjective, in string matching 256 is fairly large and what is of importance here is that as the size of the alphabet (number of characters in the alphabet) grows so does the potential for larger shifts. This algorithm searches from left to right over the input, but performs character comparisons within its sliding window of size m in reverse order (right to left). It uses two pre-computed functions on the keyword which help it run in time as fast as O n/m (the best case) by skipping over parts of the input that are not necessary to check during matching. [15] Boyer-Moore algorithm is actually combination of two simple rules which can perform separately but they are more effective and sufficient together. 1. Bad character shift rule 2. Good suffix shift rule At first Boyer-Moore algorithm produce bad character shift table in order to make the search more sufficient (See Algorithm 4.1). The table contains all characters in alphabet with stored numbers (integers) that represent how far the algorithm may shift when a mismatch occurs (See Table 4.1). All characters which are not in the keyword would get indices equal to the length of the keyword. Because of the reverse order comparison, if a character in the input is encountered that does not appear in the keyword we can shift entirely past it as in the example below (steps 1 through 3).But all characters of keyword would get indices which present their distance from the right most character in the keyword. Therefore, if the character x started the keyword and did not appear anywhere else in the keyword then the bad character shift table s (call it bctable) value at position a bctable[x] would be m (keyword length) 1 (see in Table 4.1). 15/9/2014 Page 11

12 When the character does exist in the keyword this bad character shift allows the algorithm to immediately realign the keyword s right most appearance of the character that was mismatched to the character that caused the mismatch in the input (see steps below). Because this kind of realignment could potentially result in a negative shift in some situations there is another more complex pre-computed function which is also looked up at match time and the maximum value returned from both functions is used. Second pre-computation would perform in the last step (see step 6), which in the example below would lastly align the keyword under the letters virus in the input whereby a match is found. [17] [18] Below I provided a simple example which can be helpful. I chose simple input and the keyword virus below (See Example 4.1). Example 4.1: Input: Keyword: Character V I R U S All other characters Integer(Shift) Table 4.1 Bad Character table of keyword: Virus Step 1: Allocating shift values for characters exist in keyword (other characters have value 5) Step 2: At firs the in the input is mismatched with the in the keyword. Input: Keyword: 15/9/2014 Page 12

13 Step 3: Pattern will shift for 5, which is regarding value of 5 for (Because all values of characters except keyword are 5).Next will be comparison of in the input (files) with the in the keyword. Input: Keyword: Step 4: Pattern will shift for 5, which is regarding value of 5 for e (Because all values of characters except keyword are 5).Next will be comparison of in the input with the in the keyword. Input: Keyword: Step 5: Pattern will shift for 5, which is regarding value of 5 for (Because all values of characters except keyword are 5).Next will be comparison of ( ) in the input with the s in the keyword. Input: Keyword: 15/9/2014 Page 13

14 Step 6: At this step the s in input will match with s in keyword so we found our first match which is actually the last word of keyword in the input. Input: Keyword: Algorithm 4.1 Boyer-Moore Bad Character Shift Pre-computation Procedure Compute Bad Character Shifts BM(x,m) Input: X array of m bytes representing the keyword m integer representing the keyword length alphabet size 256 table new Array[alphabet size] Bad Character Shift Table for i = 0 alphabet size 1 do table[i] m end for for i = 0 m 1 do table[x[i]] m i 1 end for return table end procedure So the aforementioned about the bad character shift rule constitute the first stage of Boyer-Moore searching algorithm. But using bad character rules alone cannot perform well enough. As the keyword contains replicated segments. The second step done by the Boyer-Moore is producing good suffix shift table which can improve the performance of the Boyer-Moore with the aid of bad 15/9/2014 Page 14

15 character shift table. This table is stored in an array of size m+1. Boyer-Moore algorithm creates an array with information about how the keyword matches against shifts of itself. The good suffix shift table actually provide more efficient shifting rules in order to prevent idle comparisons. [19] Imagine our keyword contains couple of replicated segments which have just a character difference. If we avoid repeating matching process for all of the replicated segments by building a table which stored these special traits of the keyword, then we can improve the Boyer-Moore performance appropriately. That is what good suffix shift table offer in order to boost the efficiency of Boyer-Moore searching algorithm. As an example let us say a mismatch occurs between the character x i a of the keyword and the character y j i b of the input during an attempt at position j (of the keyword sliding window), and let u x i 1...m 1 y j i 1...j m 1 which is the part already matched. The good suffix shift consists in aligning the segment u with its rightmost occurrence in x that is preceded by a character different from x i which was the mismatched character (see Figure 4.1 part i). If there exists no such segment then the second form of the shift consists in aligning the longest suffix v of u with a matching prefix of x (see Figure 4.1 part ii) [15]. Fig 4.1 Boyer-Moore algorithm s good suffix shift examples [15] 15/9/2014 Page 15

16 Boyer-Moore Good Suffix Shift Pre-computation Algorithm 4.2 procedure Compute Good Suffix Shifts BM(x,m) Input: x array of m bytes representing the keyword m integer representing the keyword length table newarray[m] Good Suffix Shift Table suf_table newarray[m] Temporary Suffix Information Array suf table[m 1] m g m 1 j 0 for i = m 2 0 do Building suf table if i > g and suf_table[i + m 1 j] < i g then suf table[i] Ã suf table[i + m 1 j] else if i < g then g i end if j i while g 0 and x[g] = x[g + m 1 j] do g g 1 end while suf_table[i] j g end if end for for i = 0 m do Building table table[i] = m end for j 0 for i = m 1 1 do if i = 1 or suf_table[i] = i + 1 then while j < m 1 i do if table[j] = m then table[j] m 1 i end if j j + 1 end while end if end for for i = 0 m 2 do table[m 1 suf table[i]] m 1 i end for return table end procedure 15/9/2014 Page 16

17 After all explanations regarding these two basic rules and their tables which produce the Boyer-More algorithm, now we can conclude the performance of O m size of alphabet in pre-computation. With a large size of alphabet this algorithm is frequently expected to make large shifts of length m bypassing large sections of the input. In the best case scenario a running time of O n/m can be achieved as the aforementioned, but in the worst case performance is still quadratic or O nm. [16] So as a result the performance of Boyer-Moore searching algorithm is highly keyword length dependent. It means if we are searching for long single keyword with non-replicated segments (various characters), Boyer-Moore searching algorithm can perform very well according to its properties. But in practice the performance of Boyer-Moore matching algorithm can be very fast and really efficient. As it was utilised as a fundamental algorithm for many of software and hardware which involve in searching and matching e.g. security software, big data and searching engines. As we discussed before, Boyer-Moore algorithm is one of the exact string matching algorithms that used in single pattern matching. The algorithm uses two functions, which is used to move the sliding window to the right. After two previous pseudocode corresponding bad character shift rule and good suffix shift rule now we need to acquire the Boyer-Moore searching algorithm. Boyer-Moore matching algorithm is actually combination of these two simple rules, but just require some pre-computation to be prepares for the action. The Boyer-Moore algorithm pseudocode is given below (in here pseudocode is only shown in order to provide better understanding, for more information about the Boyer-Moore codes see the references ): 15/9/2014 Page 17

18 Boyer-Moore single-keyword Matching Algorithm 4.3 procedure BM(x,m, y, n) Input: x array of m bytes representing the keyword m integer representing the keyword length y array of n bytes representing the text input n integer representing the text length good table array of m elements Algorithm 4.2 bad table array of 256 elements Algorithm 4.1 Pre Computation good_table Compute Good Suffix Shifts BM(x,m) bad_table Compute Bad Character Shifts BM(x,m) j 0 while j n m do Matching i m 1 while i 0 and x[i] = y[i + j] do i i 1 end while if i < 0 then output j j j + good_table[0] else j j +MAX(good_table[i], ba_table[y[i + j]] m +1+ i end if end while end procedure 15/9/2014 Page 18

19 4.1.2 Boyer-Moore-Horspool Algorithm As we discussed before about performance of Boyer-Moore and how fast it can be in practice but there have been several different improvements made to it. It can be beneficial in here to introduce the first improvement which was made by Horspool [20]. Horspool complementary algorithm although is just slightly change over main Boyer-Moore matching algorithm but in practise provided brilliant results. In fact Horspool noted the bad character shift was usually the longest shift and provided better solution in order to prevent some unnecessary comparisons which explained below. In each shift over input we compare the rightmost character of keyword with input and continue to find the keyword or a mismatch. So regardless of the result over the action (match or mismatch), we shift the pointer corresponding next occurrence of rightmost character over keyword. It means simply in our bad character shift rule for integer regarding rightmost character, instead of 0 we put integer with regards to next occurrence of rightmost character in keyword. An example provided below for more clarification. Keyword: abca a c b Other characters Boyer-Moore bad character table 4.2 a c b Other characters Boyer-Moore-Horspool bad character table 4.3 So as you can see it was just simple alteration but with worthwhile outcome. The Pseudocode for Boyer-Moore-Horspool matching algorithm is provided in algorithm 4.4. [15] 15/9/2014 Page 19

20 But Horspool wasn t the only proposed algorithm for improvement of Boyer- Moore matching algorithm. So many more new suffix-based matching algorithm have been provided that follow-up the rules of Boyer-Moore matching algorithm. Improvements to the Boyer-Moore-Horspool algorithm for searching in English text are discussed by Baeza-Yates (1989) and Sunday (1990). [21] Boyer-Moore-Horspool single-keyword Matching algorithm 4.4 procedure BMH(x,m, y, n) Input: x array of m bytes representing the keyword m integer representing the keyword length y array of n bytes representing the text input n integer representing the text length alphabet size 256 Pre computation table new Array[alphabet_size] Horspool Shift Table fori = 0 alphabet_size 1 do table[i] m end for for j = 0 m 1 do table[x[j]] m j 1 end for i 0 Matching while i n m do j m 1 while j > 0 and x[j] = y[i + j] do j j 1 end while if j = 0 then output i end if i i + table[y[i + m 1]] Shift end while end procedure Algorithm 4.4. [15] 15/9/2014 Page 20

21 4.2 Multi-Keyword Pattern Matching Algorithms Multi-Keyword matching algorithms pertaining to finding some various combination of keywords. In fact we have different keywords with different length. We supposed our keywords as the array x 0,1,2,,p 1 with size p and the array m 0,1,2,,p 1 with size p as integer values which representing the length of keywords. So we chose M as sum of the lengths of all keywords and same input y with length n. Also we can utilise Single-Keyword matching algorithm several time to find our patterns. But obviously in one hand it cost so much for our system and on the other hand long runtime can be another undesirable result. So we require different methods and algorithms from our single-keyword ones. The algorithms which can provide appropriate runtime and efficient memory requirements. In here as we need to find some different keywords so memory requirements play important role. But we should always consider runtime as an effective factor. In this research the majority of experiments considered availability of enough memory, so main concern would be runtime. The aforementioned corresponding Boyer-Moore algorithm, sublinear running times algorithms are the ones which have cn steps, where c is 1/d and d 1. Where d is a constant and length of keyword in single-keyword algorithm or we consider d as length of shortest keyword in multiple-keyword algorithm. Where a sublinear running time typically indicates O n. [22] [23] Sublinearity has been used to determine performance of new methods of string matching algorithms by average-case analysers [24] [25]. There are some different algorithms and methods which offered by researchers and experts for multi-pattern matching, but the most fundamental and well-known is Aho-Corasick [24]. In here I provided some information and experiments in order to great comprehension over Aho-Corasick matching algorithm. 15/9/2014 Page 21

22 4.2.1 Aho-Corasick Algorithm There are many approaches to recognizing patterns that involve using finite automata (also referred to as finite state machines). The Aho-Corasick algorithm is one such classic algorithm. [24] The Aho-Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. A finite-state machine (FSM) or finitestate automaton (plural: automata), or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition. [25] Aho-Corasick DFAs are widely utilized with different matching and detecting software. Snort s Aho-Corasick DFA [2] has 77,182 states for 31,094 pattersraising question of how to store it efficiently in memory. The alternatives naturally trade memory space with execution time. In addition, most security tools (including Snort) divide their patterns to several sets, according to the type of traffic. [9] In order to better understanding of Aho-Corasick I provided some helpful information about Finite Automata Machines and deterministic finite automaton (DFA) which are basic classes of pattern matching. Deterministic finite state machine is a finite state machine that accepts/rejects finite strings of symbols and only produces a unique computation (or run) of the automaton for each input string.[27] 'Deterministic' refers to the uniqueness of the computation. In search of simplest models to capture the finite state machines, McCulloch and Pitts were among the first researchers to introduce a concept similar to finite automaton in [28, 29] 15/9/2014 Page 22

23 Each automaton accept sequence of 0s and 1s as input. For each state in the diagram there is a transition arrow which leads to next state with 0 or 1. By reading symbols, a DFA transfer deterministically from a state to another by following the transition arrows. In our example we use five-tuple S, Σ, δ, q0, F deterministic finite automaton (DFA) consisting: S is a finite set of states Σ is a finite set of input symbols δ : S Σ S is a transition function which returning the next state, given the current state and any symbol from the input s 0 S is the initial state F S is a set of accepting states Aho-Corasick algorithm has a method which is slightly different which construct an automaton (Aho-Corasick DFA) from set of patterns. The idea is that a finite automaton is constructed using the set of keywords during the precomputation phase of the algorithm and the matching involves the automaton scanning the input text string reading every character in y exactly once and taking constant time for each read of a character.[27] Given the DFA, a pattern is matching algorithm Aho-Corasick traversing the automaton symbol by symbol from s 0 as initial state; a pattern is detected if a state in F is reached in this matching. Fig. 4.2 shows the Aho-Corasick DFA for the pattern-set {E, BE, BD, BCD, CDBCAB, BCAA. [28] Final states are s1, s3, s4, s6, s12 and s14. And all are highlighted in figure below as final states. 15/9/2014 Page 23

24 Fig 4.2 This Deterministic Finite Automaton (DFA) has 15 states. It starts with s 0 and the arrows are A, B, C, D and E which construct our alphabet. Snort utilizes a full-matrix encoding Aho-Corasick DFAs as packet pattern matching algorithm. Transitions are stored in S rows and Σ columns array which provided below [24]. Every entry at position i, j results from the value of δ Si, j which demonstrate the number of bits in each entry is more than. Typically with one byte input simultaneously we have Σ 256, the as a result overall memory footprint would be equal to 256 S log2 S. For Snort s Aho-Corasick DFAs, this translates to a combined footprint of 75.15MB. On the other hand, the main advantage of this encoding is that a 15/9/2014 Page 24

25 transition consists of a single memory load operation that reveals directly the next state. [9] A B C D E S S S S S S S S S Table 4.4 Full-matrix Encoding tuple Aho-Corasick Algorithm Aho-Corasick algorithm utilizes a trie (keyword tree) in order to store the set of keywords in a string matching special automaton. A tree for storing strings in which there is one node for every common prefix. (The name comes from retrieval and is pronounced, "tree.") The strings are stored in extra leaf nodes.[30] In here I provided the pictures regarding comparison of normal trie and Aho- Corasick automaton. 15/9/2014 Page 25

26 Fig 4.2 Normal tree vs. Aho-Corasick automaton To provide Aho-Corasick automaton we need linear running time for adding the length of keywords. This process include building the keyword tree for our set of keywords and then converting our tree to an automaton (DFA) by defining the functions g and f and labelling states in A with keywords matched. In order to provide this Aho-Corasick tree we need space or memory that can be considered simultaneously when our automaton built in pre-computation as it is the structured required by matching algorithm. As we have large alphabet and a set of keyword (sometimes it can be long set of keywords) so we need large memory for our stored data (tree). In the worst case scenario required data would be O M Σ where Σ is the size of the alphabet Σ. When our automaton is constructed the matching process would be straightforward and contains some simple steps through the input characters and changing the state of the automaton which happens simultaneously. At every step we check if there s a match by seeing if the current state is an accepting state. Our Aho-Corasick pattern matching algorithm always operates in O(n) running time. [17] Our notations for Aho-Corasick automaton with 7-tuple Q,, A, Σ, g, f, o, where: 15/9/2014 Page 26

27 Q is a finite set of states, q Q is initial state, A Q and is set of accepting states, Σ is the input alphabet, g is a function from Q x Σ into Q, called the good (or go to) transition function, f is a function from Q into Q, called the fail (or failure) transition function, o is a function from Q into Q, called the output function. If our automaton which is in state q, reads input character (byte) a, it shifts to state g q, a if predefined otherwise it shifts to state f(q). Also if the automaton is in state q which belongs to the set A then is an accepting state. We considered function o as an output function which determine whether q A or not. Aho-Corasick algorithm has a function with the name output, which do the aforementioned test and returns the keywords matched at the accepting state. Actually Aho-Corasick automaton indicates whether there is a match in our keyword in comparison with the input. The pseudocode for the Aho-Corasick matching algorithm is given below: [17] Aho-Corasick Multiple-Keyword Matching Algorithm 4.5 procedure AC(y, n, q0) Input: y array of n bytes representing the text input n integer representing the text length q0 initial state state q0 for i = 1 n do Matching while g(state,y[i])=fail do while g(state,y[i]) is undefined state f(state) use the failure function end while state g(state, y[i]) if o(state) 6= 0 then output i This an accepting state end if end for end procedure 15/9/2014 Page 27

28 So after provided explanation about pattern algorithms Boyer-Moore and Aho- Corasick, now we just need to capture some data online and try to find some experimental patterns by using these two methods.therefore in next chapter firstly I added some useful information about Wireshark (capturing data) and software security then did some experiments with different input and tried to find the keyword(s) in them. 15/9/2014 Page 28

29 Chapter 5: Pattern Matching Algorithms and Softwares in Security In this chapter you can find some helpful information regarding Wireshark as Data Capturer, and also the other security softwares which utilise Pattern Matching algorithms as their substantial part. So in order to profound perception over the subject of Packet Pattern Matching for Intrusion detection we need to collect some information about detection softwares like Snort, network protocol analyser Wireshark and also other softwares which use Pattern Matching methods. 5.1 Wireshark Wireshark is the world's foremost network protocol analyser. It lets you see what's happening on your network at a microscopic level. It is the de facto (and often de jure) standard across many industries and educational institutions. Wireshark development thrives thanks to the contributions of networking experts across the globe. It is the continuation of a project that started in [31] Wireshark is very similar to tcpdump, but has a graphical front-end, plus some integrated sorting and filtering options. Wireshark allows the user to put network interface controllers that support promiscuous mode into that mode, in order to see all traffic visible on that interface, not just traffic addressed to one of the interface's configured addresses and broadcast/multicast traffic. However, when capturing with a packet analyser in promiscuous mode on a port on a network switch, not all of the traffic travelling through the switch will necessarily be sent to the port on which the capture is being done, so capturing in promiscuous mode will not necessarily be sufficient to see all traffic on the network. Port mirroring or various network taps extend capture to any point on the network. Simple passive taps are extremely resistant to tampering. On Linux, BSD, and OS X, 15/9/2014 Page 29

30 with libpcap or later, Wireshark 1.4 and later can also put wireless network interface controllers into monitor mode. [32] In order to use Wireshark I downloaded the software and captures some data which was the exchanged data between my system and the network provider. In fact the data which I transferred to the network was secure data so just considered as the input in order to simulate appropriate and real environment for my research. All data which captured by Wireshark is available on appendix C Snort Snort is an open source intrusion prevention system capable of real-time traffic analysis and packet logging. With over 4 million downloads and nearly 500,000 registered users, it is the most widely deployed intrusion prevention system in the world. [2] Snort is now developed by Sourcefire, of which Roesch is the founder and CTO.[34] In 2009, Snort entered InfoWorld's Open Source Hall of Fame as one of the "greatest [pieces of] open source software of all time".[35] There are several Packet Pattern Matching Algorithms which have been utilised by Snort for the purpose of matching patterns in packets. Snort configured to use matching algorithms in order to find stored signatures over the packet payloads. In the beginning, Snort simply used Boyer-Moore searching algorithm (single Keyword matching algorithm) in order to search for signatures over input packets. But Aho-Corasick provided great boost over performance of Snort and specially increased its speed. [36] Originally Snort had no multiple-keyword pattern matching algorithm at first by only using Boyer-Moore. [37] It caused slow performance of Snort which was very hard for real-time searching over packet s contents. So Aho-Corasick searching algorithm vastly improved the performance. In fact some other algorithms utilised by Snort for better performance which actually have been 15/9/2014 Page 30

31 new approaches over Aho-Corasick algorithm (as it is major algorithm with great performance). There are several different factors which determine the eligibility of searching algorithms. For example Aho-Corasick algorithm works very well in general but large alphabets and large keyword sets influence its great performance. This is due to the memory that required in order to perform well (building searching tree). I provided a table to compare some different searching algorithms which can be helpful for perceiving snort performance. Aho-Corasick Linear worst-case running time X Supports many keyword lengths X Supports a large alphabet Supports a large keyword set Simple to understand & implement X Sublinear running time on average Table 5.1 Wu-Manber algorithm X X X So as we can observe over the table 5.1 Aho-Corasick can perform very well but in combination with Wu-Manber the performance even can improve. Actually combination of both methods would be beneficial approach over searching algorithms. As the matter of fact Snort works with matching algorithm with consideration of trade-offs in various aspects. It is becoming generally accepted that for a Network intrusion Detection Systems, any pattern matching algorithm whose worst-case characteristics are not linearly bounded to the size of the search input are ruled out of consideration, as Network Intrusion Detection System can easily fall prey to algorithmic complexity attacks if they have poor worst case behaviour. [17] Snort is actually works with different signatures, and try to match packets payloads with those rules (signature). But the problem would be new signatures and new patterns which doesn t exist in previous rules. So it can be 15/9/2014 Page 31

32 a little tricky as there are always new attacks and new patterns which haven t been recorded by Snort database. 5.3 Pattern Matching in Other Applications Multiple-pattern matching is widely used in many other applications such as virus detection, spam detection, content scanning, filtering and even hardware. [17] So it can be beneficial to briefly explain these applications and their utilization over matching algorithms Antiviruses The most well-known security application is antivirus software. First generation of antiviruses where very similar to Network Intrusion Detection Systems like Snort. In string scanning the antivirus try to find patterns or sequences which are unlikely to appear in clean (non-virus) programs or files. [38] In fact these patterns are stored as virus signatures which need to be detect over files. So if the file contains these stored sequences then it would be considered as virus. All virus signatures are stored in databases as the aforementioned in Snort. Obviously, all antiviruses contain more advanced algorithms and techniques than just pattern matching and searching algorithms but scanning methods are widely used and strong techniques. [39] For fast scanning antiviruses, only utilization of simple matching algorithms won t be efficient. Because their performance are not fast enough for searching through big data and information. Therefore some softwares use singlekeyword and multi-keyword at the same time for faster results. But they need more complicated algorithms and even combination of some sophisticated algorithms Spam Detection Softwares Most of anti-spam software programs do not hand out their pattern matching algorithms, but from results over their open-source software codes there are many which utilize single-keyword pattern matching algorithms. We conjecture 15/9/2014 Page 32

33 that the reason stemmed from the processing latency does not require high speed scanning demands in comparison with Network intrusion Detection Systems or Antiviruses which definitely need real-time searching methods. [17] But even though there are low performance demands over spam softwares, but some of the spam detection software could profit by utilization of the some pattern matching algorithms e.g. Aho-Corasick algorithm. Some of spam detection software like SpamAssassin [40], apply pattern matching algorithms but in utterly different way. As in spam detection software searching algorithms are actually are used in the beginning. Spam detection software need to consider different attributes of , that s the reason for utilization of scoring approach to flag spam.it means with matching pattern they give the different range of scores, and just finding couple of pattern would not definitively mark message as spam or valid . But the basics are the same and for finding malicious patterns they need pattern matching algorithms Hardware In general software based Network Intrusion Detection Systems require platform and engage our system resources (software demands). So migration of significant amounts of pattern matching algorithm work to hardware engines would be advantageous. The Aho-Corasick matching algorithm is used by some companies for their hardware products as matching algorithm, because of its properties (deterministic algorithm).in hardware utmost importance for implemented algorithms is their deterministic qualities. But as we discussed Aho-Corasick algorithm required huge amount of memory for implementation. There can often be a large number of transitions in one of these state machines. This is due to many factors like the variance in patterns and their lengths, and the large size of the alphabet to deal with. For example, a singlecharacter pattern can cause all states to have a transition to it. [41] Aho-Corasick algorithm state machine have been seen drastically reduce the space required for implementation. On the other hand implementing transitions and their priorities reduce the memory as well. [42] 15/9/2014 Page 33

34 As the state transitions can be kept in table of memory in hardware, matching correct transition can be easier and quicker acquired only by utilization of previous developed techniques in hardware structures e.g. ternary content addressable memory (TCAM) [44] and field programmable gate arrays (FPGA) [42,43] In the aforementioned hardware implementation methods the processing speed of whole system increased with following correct transition and process any pattern which is the same as new state. But in addition to memory requirements for hardware, we need to consider the updating process with new determined signatures. It can be sophisticated for some reasons like reboot or reload all system for adding new signatures to our hardware. As we discussed in comparison with software based Network intrusion Detection Systems it can be much simpler as just required several new rules to be added to database. But some Network intrusion Detection System provider facilitate signature update simultaneously with running matching process. 15/9/2014 Page 34

35 Chapter 6: Experiment and Work Done In this chapter I provided the details of my experiments and following outcome as the result. 6.1 Experiment on Aho-Corasick After all explanation about Aho-Corasick algorithm for searching patterns now we need to do some experiment and realize the utilisation of Aho-Corasick practically. Aho-Corasick algorithm at first build a finite state machine for searching process and construct the search tree which has the first character of each string of keywords as the top nodes. When algorithm starts, Aho-Corasick match the first character against top nodes of the searching tree. When a match is found, the next character in the input text will be compared by next node on determined tree. When a terminal node (final node on tree) is found, the matched string is output. If a non-matching character is found (when the input character is different from characters for the node), that node is removed from the list and process starts again. So first we need to find the code for building the searching tree Building the searching Tree A tree node contains characters of keywords, a collection of all the nodes which construct our keyword(s), and a mark which indicates whether this string exist in our keyword(s). But we have to be careful if we reached the last character which matched it doesn t mean we finished our process regarding that tree. For example, if we matched "virus", there might still be nodes for "viruses, so we need to continue for two more nodes. During matching, the tree node must be able to say, given a character, whether that character will result in a transition to a following node. It indicates whether searching process require more matching or delete the string from previous state. The pseudocode is provided in Appendix C. [33] 15/9/2014 Page 35

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade