MULTI-PATTERN STRING MATCHING ALGORITHMS

Size: px

Start display at page:

Download "MULTI-PATTERN STRING MATCHING ALGORITHMS"

Lynne Higgins
5 years ago
Views:

1 MULTI-PATTERN STRING MATCHING ALGORITHMS By XINYAN ZHA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010

2 c 2010 Xinyan Zha 2

3 I dedicate this Ph.D dissertation to my parents. Thank you for your support these years. 3

4 ACKNOWLEDGMENTS Thanks to all who have helped in writing this dissertation. Especially thanks to my advisor Dr.Sahni for all the research guidance these years. Thanks to the National Science Foundation for the funding support. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Overview of the Contributions Outline of the Dissertation A HIGHLY COMPRESSED AHO-CORASICK AUTOMATA FOR EFFICIENT INTRUSION DETECTION The Aho-Corasick Automaton The Method of Tuck et al.[32] To Compress Non-Optimized Automaton Popcounts With Fewer Additions Our Method to Compress the Non-Optimized Aho-Corasick Automaton Classification of Automaton States Node Types Bitmap Low degree node Path compressed node Memory Accesses Bitmap Node with Type I Summaries, W = Low Degree Node, W = Ol, 1 l 5, Nodes, W = O Nodes, W = 32 and Path Compressed Node of Tuck, W = 32 and Summary Mapping States to Nodes Experimental Results Number of Nodes Memory Requirement Popcount COMPRESSED OBJECT ORIENTED NFA FOR MULTI-PATTERN MATCHING The Object Oriented NFA for Multi-pattern Matching Compressed OO NFA Experimental Results

6 4 THE COMPRESSED AHO-CORASICK AUTOMATA ON IBM CELL PROCESSOR The Cell/Broadband Engine Architecture Cell-oriented Algorithm Design Step (2): Branch Replacement and Hinting Step (3): Loop Unrolling, Data Alignment Step (4): Branch Removal, Select-bits Intrinsics Step (5): Strength Reduction Step (6): Horizontal Unrolling Experimental Results Related Work FAST IN-PLACE FILE CARVING FOR DIGITAL FORENSICS In-place Carving Using Scalpel Multipattern Boyer-Moore Algorithm Multicore Searching Asynchronous Read Multicore In-place Carving Experimental Results Run Time of Scalpel Buffer Size Multipattern Matching Multicore Searching Asynchronous Read Multicore In-place Carving Scalpel 1.6 vs. FastScalpel MULTI-PATTERN MATCHING ON MULTICORES and GPUS The NVIDIA Tesla Architecture Multipattern Boyer-Moore Algorithm GPU-to-GPU Strategy Addressing the Deficiencies Deficiency D1 reading from device memory Deficiency D2 writing to device memory Host-to-Host Strategies Completion Time Completion Time Using Enhanced GPUs Experimental Results GPU-to-GPU Aho-Corasick algorithm Multipattern Boyer Moore algorithm

7 Comparison with multicore computing on host Host-to-Host CONCLUSION REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 2-1 Lookup table for 4-bit blocks Distribution of states in a 3000 string Snort database Memory accesses to process a node for W = 32 and W = Memory accesses to process a node for W = 128 and W = Number of nodes of each type, Ol and O counts are for Type I summaries Number of Ol O nodes for Type II and Type III summaries Memory requirement for data set Memory requirement for data set Number of popcount additions, data set Number of popcount additions, data set Search times (milliseconds) for English patterns Memory required (bytes) for English patterns Number of different type nodes for English patterns Compression ratio for English patterns Compression Ratios obtained by our technique on two sample dictionaries of comparable uncompressed size The impact of the optimization steps on the performance of our compressed AC NFA algorithm Aggregate throughput on an IBM Cell chip with 8 SPUs (Gbps) Example headers and footers in Scalpel s configuration file Examples of in-place file carving output In-place carving time by Scalpel 1.6 for a 16GB falshdisk In-place carving time by Scalpel 1.6 with different buffer size with 48 carving rules Search time for a 16GB flash drive Speedup in search time relative to Boyer-Moore

9 5-7 Time to search using dualcore strategy with 24 rules In-place carving time using Algorithm Asynchronous In-place carving time using SRMS In-pace carving time using SRSS In-place carving time using MARS In-place carving time and speedup using FastScalpel and Scalpel Run time for AC versions Speedup of AC1, AC2, AC3, and AC4 relative to AC Run time for mbm versions Speedup of mbm1 relative to mbm Run time for multithreaded AC on quad-core host Run time for strategy A host-to-host code

10 Figure LIST OF FIGURES page 2-1 An example string set Unoptimized Aho-Corasick automata for strings of Figure Optimized Aho-Corasick automata for strings of Figure A bitmap node of [57] A path compressed node of [57] Type I summaries Our bitmap node Our low degree node Our path compressed node Normalized memory requirement Normalized additions for popcount Algorithm 1. Completion of the OO graph Algorithm 2. Add an edge to the graph Algorithm 3.Object Oriented NFA Search Function Aho-Corasick NFA with failure pointers (all failure pointers point to state 0) The Object Oriented NFA (state cards representation) OO NFA (all states that have no matched character wiil return back to state 0) The DFA for our set of patterns OO bitmap node OO path compressed node OO copy node Chip layout of the Cell/Broadband Engine Architecture Bitmap node layout Path-compressed node layout with packing factor equal to four How two automata overlap the computation part with their DMA transfer wait time

11 4-5 The number of cycles processed per character with different vertical unrolling factors How the throughput grows with each optimization step Utilization of clock cycles following each optimization step DMA inter-arrival transfer delay from main memory to local store when 8 SPEs are used concurrently Aggregate throughput of our algorithm on an IBM QS22 blade (16 SPEs) How the percentage of matched patterns affects the aggregate throughput The trade-off between the compression ratio and the throughput in a Pareto space Control flow Scalpel 1.6 (a) Control flow Scalpel 1.6 (b) Control flow for 2-threaded search In-place carving using asynchronous reads Control flow for single core read and single core search (SRSS) Control flow for multicore asynchronous read and search (MARS1) Another control flow for multicore asynchronous read and search (MARS2) Multi-Pattern Search Algorithms Speedup Speedup of FastScalpel relative to Scalpel NVIDIA GT200 Architecture [56] Reverse trie for cac, acbacc, cba, bbaca, and cbaca (shift1(node), shift2(node) values are shown beside each node) GPU-to-GPU notation Overall GPU-to-GPU strategy using AC T threads collectively read a block and save in shared memory Host-to-host strategy A Host-to-host strategy B Notation used in completion time analysis Strategy A, t w t p, s = 4 (cases 1a and 2)

12 6-10 Strategy A, t w < t p, T w t w t p, s = 4 (cases 4a and 3) Strategy A, t w < t p, T w t w > t p, s = 4 (cases 1b, 1c, and 4b) Strategy A, enhanced GPU, s = Graphical representation of speedup relative to AC

13 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Chair: Sartaj Sahni Major: Computer Engineering MULTI-PATTERN STRING MATCHING ALGORITHMS By Xinyan Zha August 2010 Multi-pattern string matching is widely used in applications such as network intrusion detection, digital forensics and full text search. In this dissertation, we focus on space efficient multi-pattern string matching as well as on time efficient multicore algorithms. We develop a highly compressed Aho-Corasick automata for efficient intrusion detection. Our method uses bitmaps with multiple levels of summaries as well as aggressive path compaction. Our compressed automata takes 24% to 31% less memory than taken by the compressed automata of Tuck et al. [57] and the number of additions required to compute popcounts is reduced by about 90%. We propose a technique to perform high performance exact multi-pattern string matching on the IBM Cell/Broadband Engine(Cell) architecture, which has 9 cores per chip, one control unit and eight computation units. Our technique guarantees the remarkable compression factors of 1 : 34 and 1 : 58, respectively on the memory representation of English language dictionaries and random binary string dictionaries. Our memory-based implementation delivers a sustained throughput between 0.90 and 2.35 Gbps per cell blade, while supporting dictionary sizes up to 9, 260, 000 average patterns per Gbyte of main memory. We focus on Scalpel, a popular open source file recovery tool which performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a 13

14 disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multithreading as typically supported on multicore commodity PCs. Using these methods, we are able to do in-place file carving in essentially the time it takes to read the disk whose files are being carved. Since, using our methods, the limiting factor for performance is the disk read time, there is no advantage to using accelerators such as GPUs as has been proposed by others. To further speed in-place file carving, we would need a mechanism to read disk faster. Furthermore, we develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU and host-to-host. For the GPU-to-GPU case, we consider several refinements to a base GPU implementation and measure the performance gain from each refinement. For the host-to-host case, we analyze two strategies to communicate between the host and the GPU and show that one is optimal with respect to run time while the other requires less device memory. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho-Corasick GPU adaptation achieves a speedup between 8.5 and 9.5 relative to a single-thread CPU implementation and between 2.4 and 3.2 relative to the best multithreaded implementation. For the host-to-host case, the GPU AC code achieves a speedup of 3.1 relative to a single-threaded CPU implementation. However, the GPU is unable to deliver any speedup relative to the best multithreaded code running on the quad-core host. In fact, the measured speedups for the latter case ranged between 0.74 and Early versions of our multipattern Boyer-Moore adaptations ran 7% to 10% slower than corresponding versions of the AC adaptations and we did not refine the multipattern Boyer-Moore codes further. 14

15 CHAPTER 1 INTRODUCTION Intrusion detection systems (IDS) monitor events within a network or computer system with the objective of detecting attempts to compromise the confidentiality, integrity, availability, or to bypass the security mechanisms of a computer or network [4]. The intrusion detected by an IDS may manifest itself as a denial of service, unauthorized login, a user performing tasks that he/she is not authorized to do (e.g., access secure files, create new accounts, etc), execution of malware such as viruses and worms, and so on. An IDS accomplishes its objective by analyzing data gathered from the network, host computer, or application that is being monitored. The analysis usually takes one of two forms misuse (or signature) detection and anomaly detection. In misuse detection, the IDS maintains a database of signatures (patterns of events) that correspond to known attacks and searches the gathered data for these signatures. In anomaly detection the IDS maintains statistics that describe normal usage and checks for deviations from these statistics in the monitored data. While misuse detection usually has a low rate of false positives, it is able to detect only known attacks. Anomaly detection usually has a higher rate of false positives (because users keep changing their usage pattern thereby invalidating the stored statistics) but is able to detect new attacks never seen before. Network intrusion detection systems (NIDS) examine network traffic (both in- and out-bound packets) looking for traffic patterns that indicate attempts to break into a target computer, port scans, denial of service attacks, and other malicious behavior. Host intrusion detection systems (HIDS) monitor the activity within a computing system looking for activity that violates the computing systems internal security policy (e.g., a program attempting to access an unauthorized resource). Application intrusion detection systems (AIDS) monitor the activity of a specific application while protocol intrusion detection systems (PIDS) ensure that specific protocols such as HTTP behave as they 15

16 should. Each type of IDS has its capabilities and limitations and attempts have been made to put together hybrid IDSs that combine the capabilities of the described base IDSs. The evolution of Web 2.0 applications and business analytics applications is showing a more and more prevalent production and use of unstructured data. Natural Language Processing (NLP) applications can determine the language in which a document is written. web applications extract semantically tagged information (dates, places, delivery tracking numbers, etc.) from messages. Business analytics applications can automatically detect business events like the merger of two companies. Digital Forensics tools can recover the raw disk images by detecting streams of bytes that match file headers and footers. In above applications (and many others), it is crucial to process huge amounts of sequential text to extract matches against a predetermined set of strings (the dictionary). Arguably, the most popular way to perform this exact, multi-pattern string matching task is the Aho-Corasick [1] (AC) algorithm. However, AC, especially in its optimized form based on a Deterministic Finite Automaton (DFA), is not space-efficient. In fact, the state-transition table that its DFAs use can be highly redundant. Uncompressed DFAs have a low transition cost (and therefore a high throughput) but also large footprint and, consequently, a low dictionary capacity per unit of memory. For example, a dictionary of 200,000 patterns with average length 15 bytes occupies 1 Gbyte of memory when encoded for an uncompressed AC DFA. Low space efficiency limits the algorithm s applicability to domains that require very large dictionaries like automatic language identification, which employ dictionaries with millions of entries, coming from hundreds of distinct natural languages. In this dissertation, we address the space inefficiency of the AC DFA by exploring a variant of AC that employs compressed paths. Our work is inspired by that of Tuck et al [57] and is based on the Non-deterministic Finite Automaton (NFA) version of AC and achieve significant memory reduction. We also provide our IBM cell adaptations 16

17 of the compressed Aho-Corasick string matching algorithm. Furthermore, we propose fast in-place carving techniques of a popular digital forensics tool Scalpel as well as our GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU and host-to-host. 1.1 Overview of the Contributions First, we develop a method to compress the unoptimized (also known as nondeterministic) Aho-Corasick automaton that is used widely in intrusion detection systems. Our method uses bitmaps with multiple levels of summaries as well as aggressive path compaction. By using multiple levels of summaries, we are able to determine a popcount with as few as 1 addition. On Snort string databases, our compressed automata take 24% to 31% less memory than is taken by the compressed automata of Tuck et al. [57]. and the number of additions required to compute popcounts is reduced by about 90%. Next, we choose an established multi-core architecture, the IBM Cell Broadband Engine (CBE) and explore its potential for string matching applications. The IBM Cell presents software designers with non-trivial challenges that are representative of the next generation of multi-core architectures. With its 9 cores per chip, the IBM Cell/Broadband Engine (Cell) can deliver an impressive amount of compute power and benefit the string-matching kernels of network security, business analytics and natural language processing applications. However, the available amount of main memory on the system limits the maximum size of the dictionary supported. To counter this, we propose a technique that employs compressed Aho-Corasick automata to perform fast, exact multi-pattern string matching with very large dictionaries. Our technique achieves the remarkable compression factors of 1:34 and 1:58, respectively, on the memory representation of English-language dictionaries and random binary string dictionaries. We demonstrate a parallel implementation for the Cell processor that delivers a sustained throughput between 0.90 and 2.35 Gbps per Cell blade, while supporting dictionary sizes up to 9.2 million average patterns per Gbyte of main 17

18 memory, and exhibiting resilience to content-based attacks.this high dictionary density enables natural language applications of an unprecedented scale to run on a single server blade. Thirdly, we focus on a popular open source file recovery tool Scalepl which performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multithreading as typically supported on multicore commodity PCs. Using these methods, we are able to do in-place file carving in essentially the time it takes to read the disk whose files are being carved. Since, using our methods, the limiting factor for performance is the disk read time, there is no advantage to using accelerators such as GPUs as has been proposed by others. To further speed in-place file carving, we would need a mechanism to read disk faster. Last, we develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU and host-to-host. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho-Corasick GPU adaptation achieves a speedup of almost 8.50 to 9.53 relative to its CPU counterpart; the corresponding speedup attained by our multipattern Boyer-Moore adaptation is almost 3.19 to For the host-to-host case, we achieve a speedup of almost 2.90 to 3.12 utilizing our Aho-Corasick GPU adaptation versus doing the multipattern matching on the CPU itself using a single-core CPU code for Aho-Corasick algorithm. The Aho-Corasick algorithm runs faster than the multipattern Boyer-Moore algorithm both in its single-core CPU version and in its GPU adaptation. 18

19 1.2 Outline of the Dissertation The remainder of this dissertation is organized as follows. Chapter 2 presents our highly compressed Aho-Corasick automata for efficient network intrusion detection. Chapter 3 develops a compressed version of the Object Oriented NFA for multi-pattern matching. Chapter 4 presents our compressed Aho-Corasick automata algorithm for high performance exact multi-pattern string matching on an IBM Cell Broadband Engine. Chapter 5 shows fast in-place file carving techniques for digital forensics. Chapter 6 describes our adaptations of multipattern string matching algorithms on the Graphics Processing Units (GPUs). We conclude in Chapter 7. 19

20 CHAPTER 2 A HIGHLY COMPRESSED AHO-CORASICK AUTOMATA FOR EFFICIENT INTRUSION DETECTION 2.1 The Aho-Corasick Automaton The Aho-Corasick finite state automaton [1] for multi-string matching is widely used in IDSs. There are two versions of this automaton unoptimized and optimized. While both versions are finite state machines, In the unoptimized version, which we use in this paper, there is a failure pointer for each state while in the optimized version, no state has a failure pointer. In both versions, and each state has success pointers;each success pointer has a label, which is a character from the string alphabet, associated with it. Also, each state has a list of strings/rules (from the string database) that are matched when that state is reached by following a success pointer. This is the list of matched rules. In the unoptimized version, the search starts with the automaton start state designated as the current state and the first character in the text string, S, that is being searched designated as the current character. At each step, a state transition is made by examining the current character of S. If the current state has a success pointer labeled by the current character, a transition to the state pointed at by this success pointer is made and the next character of S becomes the current character. When there is no corresponding success pointer, a transition to the state pointed at by the failure pointer is made and the current character is not changed. Whenever a state is reached by following a success pointer, the rules in the list of matched rules for the reached state are output along with the position in S of the current character. This output is sufficient to identify all occurrences, in S, of all database strings. Aho and Corasick [1] have shown that when their unoptimized automaton is used, the number of state transitions is 2n, where n is the length of S. In the optimized version, each state has a success pointer for every character in the alphabet and so, there is no failure pointer. Aho and Corasick [1] show how to compute the success pointer for pairs of states and characters for which there is no success 20

21 Figure 2-1. An example string set abcaabb abcaabbcc acb acbccabb ccabb bccabc bbccabca Figure 2-2. Unoptimized Aho-Corasick automata for strings of Figure 2-1 pointer in the unoptimized automaton thereby transforming a unoptimized automaton into an optimized one. The number of state transitions made by an optimized automaton when searching for matches in a string of length n is n. Figure 2-1 shows an example string set drawn from the 3-letter alphabet {a,b,c}. Figures 2-2 and 2-3, respectively, show its unoptimized and optimized Aho-Corasick automata. For this example, we assume that the string alphabet is {A, B, C}. It is important to note that when we remove the failure pointers from an uncompressed Aho-Corasick automaton, the resulting structure is a trie [42] rooted at the automaton start node. However, an optimized automaton has the structure of a graph that may not 21

22 Figure 2-3. Optimized Aho-Corasick automata for strings of Figure 2-1 be a trie. This difference in the structure defined by the success pointers has a profound impact on our ability to compress unoptimized automata versus optimized automata. 2.2 The Method of Tuck et al.[32] To Compress Non-Optimized Automaton Assume that the alphabet size is 256 (e.g., ASCII characters). Although the development is generalized readily to any alphabet size, it is more convenient to do the development using a fixed and realistic alphabet size. A natural way to store the Aho-Corasick automaton, for a given database D of strings, is to represent each state of the unoptimized automaton by a node that has 256 success pointers, a failure pointer, and a list of rules that are matched when this state is reached via a success pointer. Assuming that a pointer takes 4 bytes and the rule list is simply pointed at by the node, each state node is 1032 bytes. Using bitmap and path compression, we may use nodes whose size is 52 bytes [24]. the following fields: 22

23 1. Success[0 : 255], where Success[i] gives the state to transition to when the ASCII code for the current character is i (Success[i] is null in case there is no success pointer for the current state when the current character is i). 2. RuleList... a list of rules that are matched when this state is reached via a success pointer. 3. Failure... the transition to make when there is no success transition, for the current character, from the current state. Assume that each pointer requires 4 bytes. So, each node requires 1024 bytes for the Success array and 4 bytes for the failure pointer. In keeping with Tuck et al. [57], when accounting for the memory required for RuleList, we shall assume that only a 4-byte pointer to this list is stored in the node and ignore the memory required by the list itself. Hence, the size of a state node for an unoptimized automaton is 1032 bytes. Using bitmap and path compression, the size of a node becomes 52 bytes [57]. In the optimized version, the Failure field is omitted and the memory required by a node is 1028 bytes. While each node of the optimized automaton requires 4 bytes less than required by each node of the unoptimized automaton, there is little opportunity to compress an optimized node as each of its 256 success pointers is non-null and the automaton does not have a tree structure. However, many of the success pointers in the nodes of a unoptimized automaton are null and the structure defined by the success pointers is a trie. Therefore, there is significant opportunity to compress these nodes. Following up on this observation, Tuck et al. [57] propose two transformations to compress the nodes in an unoptimized automaton: 1. Bitmap Compression. In its simplest form, bitmap compression replaces each 1032-byte node of an unoptimized automaton with a 44-byte node. Of these 44 bytes, 8 are used for the failure and rule list pointers. Another 32 bytes are used to maintain a 256-bit bitmap with the property that bit i of this map is 1 iff Success[i] null. The nodes corresponding to the non-null success pointers are stored in contiguous memory and a pointer (firstchild) 23

24 to the first of these stored in the 44-byte node. To make a state transition when the ASCII code for the current character is i, we first determine whether Success[i] is null by examining bit i of the map. In case this bit is null, the failure pointer is used. When this bit is not null, we determine the number of bits (popcount or rank) in bitmap positions less than i that are 1 and using this count, the size of a node (44-bytes), and the value of the first child pointer, determine the location of the node to transition to. Since, determining the popcount involves examining up to 255 bits, this operation is quite expensive (at least in software). To reduce the cost of determining the popcount, Tuck et al. [57] propose the use of summaries that give the popcount for the first 32 j, 1 j < 8 bits of the bitmap. Using these summaries the popcount for any i may be determined by adding together a summary popcount and up to 31 bit values. Each summary needs to be 8 bits long (the maximum value is 255) and 7 summaries are needed. The size of a bit compressed node with summaries is, therefore, 51 bytes. We note that the notion of using bitmaps and summaries for the compact representation of data structures (in particular, trees) was first advanced by Jacobson [24, 32] and has been used frequently in the context of data structures for network applications (see [14, 16, 52, 62], for example). While Jacobson [24, 32] suggests using several levels of summaries, [14, 57] use a single level. Also, Munro [32] has proposed a scheme that uses 3 levels of summaries, requires O(m) space, where m is the size of the bitmap, and enables the computation of the popcount by adding three summaries, one from each level. The size of a bitmap node becomes 52 bytes when we add in the node type and failure pointer offset fields that are needed to support path compression (Figure 2-4). 2. Path Compression. Path compression is similar to end-node optimization [16, 62]. An end-node sequence is a sequence of states at the bottom of the automaton (the start state is at the top of the automaton) that are comprised of states that have a single non-null 24

25 node type 1bit failptr offset 3bits L1 (S1,S2, S7) 8bits*7=56bits bitmap 256bits failure ptr 32bits rule ptr 32bits firstchild ptr 32bits Figure 2-4. A bitmap node of [57] success transition (except the last state in the sequence, which has no non-null success transition). States in the same end-node sequence are packed together into one or more path compressed nodes. The number of these states that may be packed into a compressed node is limited by the capacity of a path compressed node. So, for example, if there is an end-node sequence s 1, s 2,..., s 6 and if the capacity of a path compressed node is 4 states, then s 1,...s 4 are packed into one node (say A) and s 5 and s 6 into another (say B). For each s i packed into a path compressed node in this way, we need to store the 1-byte character for the transition plus the failure and rule list pointers for s i. Since several automaton states are packed into a single compressed node, a 4-byte failure pointer that points to a compressed node isn t sufficient. In addition, we need an offset value that tells us which state within the compressed node we need to transition to. Using 3 bits for the offset, we can handle nodes with capacity c 8. Note that now, 3c/8 bytes are needed for the offsets. Hence, a path compressed node whose capacity is c 8 needs 9c + 3c/8 bytes for the state information. Another 4 bytes are needed for a pointer to the next node (if any) in the sequence of path compressed nodes (i.e., a pointer from A to B). An additional byte is required to identify the node type (bitmap and compressed) and the size (number of states packed into this compressed node). So, the size of a compressed node is 9c + 3c/8 + 5 bytes. The node type bit is required now in bitmap nodes as well as is an offset for the failure pointer. Accounting for these fields, the size of a bitmap node becomes 52 bytes. Since a compressed node may be a sibling (states/nodes reachable by following a single 25

26 node type 1bit capacity 3bits firstchild ptr 32bits char1 8bits ruleptr 32bits failptr 32bits failptroff 3bits char5 8bits rule ptr 32bits failptr 32bits failptroff 3bits Figure 2-5. A path compressed node of [57] success pointer from any given state/node are siblings) of a bitmap node, we need to keep the size of both bitmap and path compressed nodes the same so that we can access easily the jth child of a bitmap node by performing arithmetic on the first child pointer. This requirement limits us to c = 5 and a path compressed node size that is 52 bytes. Figure 2-5 shows a path compressed node. On the 1533-string Snort database of 2003, the memory required by the bitmapped-path compressed automaton using 1 level of summaries is about 1/50 that required by the optimized automaton, about 1/27 that required by the Wu-Manber data structure, and about 10% less than that required by the SFK search data structure [57]. However, the average search time, using a software implementation, is increased by between 10% and 20% relative to that for the optimized automaton, by between 30% and 100% relative to the Wu-Manber algorithm, and is about the same as for SFK search. The real payoff from the Aho-Corasick automaton comes with respect to worst-case search time. The worst-case search time using the Aho-Corasick automaton is between 1/4 and 1/3 that when the Wu-Manber or SFK search algorithms are used. The worst-case search time for the bitmapped-path compressed unoptimized automaton is between 50% and 100% more than for the optimized automaton [57]. 2.3 Popcounts With Fewer Additions A serious deficiency of the compression method of [57] is the need to perform up to 31 additions at each bitmap node. This seriously degrades worst-case performance and increases the clamor for hardware support for a popcount in network processors 26

27 [57]. Since popcounts are used in a variety of network algorithms ([14, 16, 52, 62], for example) in addition to those for intrusion detection, we consider, in this section, the problem of determining the popcount independent of the application. This problem has been studied extensively by the algorithms community ([24, 31, 32], for example). In the algorithms community, the popcount problem is referred to as the bit-vector-rank problem, where the terms bitmap and bit vector are synonyms and popcount and rank are synonyms. We recast the best result for the bit-vector-rank problem using the bitmap-popcount terminology. Munro [31, 32] has proposed a method to determine the popcount for m-bit bitmap using 3 levels of summaries that together take o(m) bits of space. The popcount is determined by adding together 3 O(log m)-bit numbers, one from each of the 3 levels of summaries. Munro s method is described below: 1. Level 1 Summaries Partition the bitmap into blocks of s1 = log 2 2 m bits. The number of such blocks is n1 = m/s1. Compute the level 1 summaries S1(1 : n1), where S1(i) is the number of 1s in blocks 0 through i 1, 1 i n1. 2. Level 2 Summaries Each level 1 block j is partitioned into subblocks of s2 = 1 log 2 2 m bits. The number of such subblocks is n2 = s1/s2. S2(j, i) is the number of 1s in subblocks 0 through i 1 of block j, 0 j < n1, 1 i < n2. 3. Level 3 Summaries For the level 3 summaries, a lookup table Ts2 that gives the popcount for every possible position in every possible subblock is computed. The number of possible subblocks is 2 s2 = O( m) and there are s2 possible positions in a subblock. Also, each entry of the table has O(log s2) = O(log log m) bits. So, the size of the lookup table is O( m log m log log m) bits. Table 2-1 gives the lookup table T 4, which is for the case s2 = 4. T 4(i, j) is the number of 1s in positions 0 through j 1 in the binary 27

28 Table 2-1. Lookup table for 4-bit blocks i in binary T4(i,0) T4(i,1) T4(i,2) T4(i,3) representation of i; positions are numbered left to right beginning with 0 and a 4-bit representation of i is used. One may verify that the total space required by the summaries is o(m) bits and that a popcount may be determined by adding one summary from each of the three levels. For a 256-bit bitmap, using Munro s method [31, 32], the level-1 blocks are s1 = 64 bits long and there are n1 = 4 of these; each level-1 block is partitioned into n2 = 16 subblocks of size s2 = 4; and the lookup table Ts2 is T 4. Motivated by the work of Munro [31, 32], we propose 3 designs for summaries for a 256-bit bitmap. The first two of these use 3 levels of summaries and the third uses 2 levels. 1. Type I Summaries Level 1 Summaries For the level 1 summaries, the 256-bit bitmap is partitioned into 4 blocks of 64 bits each. S1(i) is the number of 1s in blocks 0 through i 1, 1 i 3. Level 2 Summaries For each block j of 64 bits, we keep a collection of level 2 summaries. For this purpose, the 64-bit block is partitioned into 16 4-bit subblocks. 28

29 256 bit B0 B1 B2 B3 64 bit SB0 SB1 SB2 SB14 SB15 4bit SSB0 SSB1 2bit Figure 2-6. Type I summaries S2(j, i) is the number of 1s in subblocks 0 through i 1 of block j, 0 j 3, 1 i 15. Level 3 Summaries Each 4-bit subblock is partitioned into 2 2-bit subsubblocks. S3(j, i, 1) is the number of 1s in subsubblock 0 of the ith 4-bit subblock of the jth 64-bit block, 0 j 3, 0 i 15. Figure 2-6 shows the setup for Type I summaries. When Type I summaries are used, the popcount for position q (i.e., the number of 1s preceding position q), 0 q < 256, of the bitmap is obtained as follows: Position q is in subblock sb = (q mod 64)/4 of block b = q/64. The subsubblock ssb is 0 when q mod 4 < 2 and 1 otherwise. The popcount for position q is S1(b) + S2(b, sb) + S3(b, sb, ssb) + bit(q 1), where bit(q 1) is 0 if q mod 2 = 0 and is bit q 1 of the bitmap otherwise; S1(0), S2(b, 0) and S3(b, sb, 0) are all 0. As an example, consider the case q = 203. This bit is in subblock sb = (203 mod 64)/4 = 11/4 = 2 of block b = 203/64 = 3. Since 203 mod 4 = 3, the subsubblock ssb is 1. The popcount for bit 203 is the number of 1s in positions 0 through the number in positions 192 through those in positions 200 through the number in position 202 = S1(3) + S2(3, 2) + S3(3, 2, 1) + bit(202). 29

30 Since we do not store summaries for b, sb, and ssb equal to zero, the code to compute the popcount takes the form if (b) popcount = S1(b) else popcount = 0; if (sb) popcount += S2(b,sb); if (ssb) popcount += S3(b,sb,ssb); if (q) popcount += bit(q-1); So, using Type I summaries, we can determine a popcount with at most 3 additions whereas using only 1 level of summaries as in [57], up to 31 additions are required. This reduction in the number of additions comes at the expense of memory. An S1( ) value lies between 0 and 192 and so requires 8 bits; an S2 value requires 6 bits and an S3 value requires 2 bits. So, we need 8 3 = 24 bits for the level-1 summaries, = 360 bits for the level-2 summaries, and = 128 bits for the level-3 summaries. Therefore, 512 bits (or 64 bytes) are needed for the summaries. In contrast, the summaries of the 1-level scheme of [57] require only 56 bits (or 7 bytes). 2. Type II Summaries These are exactly what is prescribed by Munro [31, 32]. S1 and S2 are as for Type I summaries. However, the S3 summaries are replaced by a summary table (Table 2-1) T 4(0 : 15, 0 : 3) such that T 4(i, j) is the number of 1s in positions 0 through j 1 of the binary representation of i. The popcount for position q of a bitmap is S1(b) + S2(b, sb) + T 4(d, e), where d is the integer whose binary representation is the bits in subblock sb of block b of the bitmap and e is the position of q within this subblock; S1 and SB are for the current state/bitmap. Since T 4(i, j) 3, we need 2 bits for each entry of T 4 for a total of 128 bits for the entire table. Recognizing that rows 2j and 2j + 1 are the same for every j, we may store only the even rows and reduce storage cost to 64 bits. A further reduction in storage cost for T 4 is possible by noticing that all values in column 0 of this array are 0 and so we need not store this column explicitly. Actually, since only 1 copy of this table is needed, there seems to be little value (for our intrusion detection system application) to 30

31 the suggested optimizations and we may store the entire table at a storage cost of 128 bits. The memory required for the level 1 and 2 summaries is = 384 bits (48 bytes), a reduction of 16 bytes compared to Type I summaries. When Type II summaries are used, a popcount is determined with 2 additions rather than 3 using Type I summaries and 31 using the 1-level summaries of [57]. 3. Type III Summaries These are 2 level summaries and using these, the number of additions needed to compute a popcount is reduced to 1. Level-1 summaries are kept for the bitmap and a lookup table is used for the second level. For the level-1 summaries, we partition the bitmap into 16 blocks of 16 bits each. S1(i) is the number of 1s in blocks 0 through i 1, 1 i 15. The lookup table T 16(i, j) gives the number of 1s in positions 0 through j 1 of the binary representation of i, 0 i < 65, 536 = 2 16, 0 j < 16. The popcount for position q of the bitmap is S1( q/16 ) + T 16(d, e), where d is the integer whose binary representation is the bits in block q/16 of the bitmap and e is the position of q within this subblock; S1 and SB are for the current state/bitmap = 120 bits (or 15 bytes) of memory are required for the level-1 summaries of a bitmap compared to 7 bytes in [57]. The lookup table T 16 requires bits as each table entry lies between 0 and 15 and so requires 4 bits. The total memory for T 16 is 512KB. For a table of this size, it is worth considering the optimizations mentioned earlier in connection with T 4. Since rows 2j and 2j + 1 are the same for all j, we may reduce table size to 256KB by storing explicitly only the even rows of T 16. Another 16KB may be saved by not storing column 0 explicitly. Yet another 16KB reduction is achieved by splitting the optimized table into 2. Now, column 0 of one of them is all 0 and is all 1 in the other. So, column 0 may be eliminated. We note that optimization below 256KB may not be of much value as the increased complexity of using the table will outweigh the small reduction is storage. 31

32 Table 2-2. Distribution of states in a 3000 string Snort database Degree Number of Nodes Percentage ,11,12,13,14,15 6,3,4,5,3,2 < < < < < < < ,18,21,51,78 1 < Our Method to Compress the Non-Optimized Aho-Corasick Automaton Classification of Automaton States The Snort database had 3,578 strings in April, Table 2-2 profiles the states in the corresponding unoptimized Aho-Corasick automaton by degree (i.e., number of non-null success pointers in a state). As can be seen, there are only 36 states whose degree is more than 8 and the number of states whose degree is between 2 and 8 is 869. An overwhelming number of states (24,417) have a degree that is less than 2. However, 1639 of these 24,417 states are not in end-node sequences. This profile motivated us to classify the states into 3 categories B (states whose degree is more than 8), L (states whose degree is between 2 and 8) and O (all other states). B states are those that will be represented using a bitmap, L states are low degree states, and O states are states whose degree is one or zero. In case the distribution of states in future string databases changes significantly, we can use a different classification of states. Next, a finer (2 letter) state classification is done as below and in the stated order. 32

33 BB All B states are reclassified as BB states. BL All L states that have a sibling BB state are reclassified as a BL states. BO All O states that have a BB sibling are reclassified as BO states. LL All remaining L states are reclassified as LL states. LO All remaining O states that have an LL sibling are reclassified as LO states. OO All remaining O states are reclassified as OO states Node Types Our compressed representation uses three node types bitmap, low degree, and path compressed. These are described below Bitmap A bitmap node has a 256-bit bitmap together with summaries; any of the three summary types described in Section 2.3 may be used. We note that when Type II or Type III summaries are used, only one copy of the lookup table (T 4 or T 16) is needed for the entire automaton. All bitmap nodes may share this single copy of the lookup table. When Type II summaries are used, the 128 bits needed by the unoptimized T 4 are insignificant compared to the storage required by the remainder of the automaton. For Type III summaries, however, using a 512KB unoptimized T 16 is quite wasteful of memory and it is desirable to go down to at least the 256KB version. The memory required for a bitmap node depends on the summary type that is used. When Type I summaries are used, each bitmap node (Figure 2-7) is 110 bytes (we need 57 extra bytes compared to the 52-byte nodes of [57] for the larger summaries and an additional extra byte because we use larger failure pointer offsets). When Type II summaries are used, each bitmap node is 94 bytes and the node size is 61 bytes when Type III summaries are used Low degree node Low degree nodes are used for states that have between 2 and 8 success transitions. Figure 2-8 shows the format of such a node. In addition to fields for the 33

34 node type 3bits firstchild type 3bits L1(B0,..B2) 8bit*3=24bits L2(SB0,..SB14) 6bit*4*15=360bits L3(SSB0) 2bit*16*4*1=128bits bitmap 256bits failptr offset 8bits failptr 32bits ruleptr 32bits firstchildptr 32bits Figure 2-7. Our bitmap node node type 3bits firstchild type 3bits size 3bits char_1 8bits... char_8 8bits failptroff 8bits failptr 32bits ruleptr 32bits firstchildptr 32bits Figure 2-8. Our low degree node node type, failure pointer, failure pointer offset, rule list pointer, and first child pointer, a low degree node has the fields char1,..., char8 for the up to 8 characters for which the state has a non-null success transition and size, which gives us the number of these characters stored in the node. Since this number is between 2 and 8, 3 bits are sufficient for the size field. Although it is sufficient to allocate 22 bytes to a low degree node, we allocate 25 bytes as this allows us to pack a path compressed node with up to 2 characters (i.e., an O2 node as described later) into a low degree node Path compressed node Unlike [57], we do not limit path compression to end-node sequences. Instead, we path compress any sequence of states whose degree is either 1 or 0. Further, we use variable-size path compressed nodes so that both short and long sequences may be compressed into a single node with no waste. In the path compression scheme of [57] an end-node sequence with 31 states will use 7 nodes and in one of these the capacity utilization is only 20% (only one of the available 5 slots is used). Additionally, the overhead of the type, next node, and size fields is incurred for each of the path compressed nodes. By using variable-size path compressed nodes, all the space in such a node is utilized and the node overhead is paid just once. In our implementation, we limit the capacity of a path compressed node to 256 states. This requires that the failure pointer offsets in all nodes be at least 8 bits. A path compressed node whose 34

35 node type 3bits firstchild type 3bits firstchildptr 32bits capacity(c) 8bits char_1 8bits failptr_1 32bits ruleptr_1 32bits failptroff_1 8bits... char_c 8bits failptr_c 32bits ruleptr_c 32bits failptroff_c 8bits Figure 2-9. Our path compressed node capacity is c, c 256, has c character fields, c failure pointers, c failure pointer offsets, c rule list pointers, 1 type field, 1 size field, and 1 next node field (Figure 2-9). We refer to the path compressed node of Figure 2-9 as an O node. Five special types of O nodes O1 through O5 also are used by us. An Ol node, 1 l 5, is simply an O node whose capacity is exactly l characters. For these special O-node types, we may dispense with the capacity field as the capacity may be inferred from the node type. The type fields (node type and first child type) are 3 bits. We use Type = 000 for a bitmap node, Type = 111 for a low degree node and Type = 110 for an O node. The remaining 5 values for Type are assigned to Ol nodes. Since the capacity of an O node must be at least 6, we actually store the node s true capacity minus 6 in its capacity field. As a result, an 8-bit capacity field suffices for capacities up to 261. However, since failure pointer offsets are 8 bits, using an O node with capacity between 257 and 261 isn t possible. So, the limit on O node capacity is 256. The total size of a path compressed node O is 10c + 6 bytes, where c is the capacity of the O node. The size of an Ol node is 10l + 5 as we do not need the capacity field in such a node Memory Accesses The number of memory accesses needed to process a node depends on the memory bandwidth W, how the node s fields are mapped to memory, and whether or not we get a match at the node. We provide the access analysis primarily for the case W = 32 bits. 35

36 2.4.4 Bitmap Node with Type I Summaries, W = 32 We map our bitmap node into memory by packing the node type, first child type, failure pointer offset fields as well as 2 of the 3 L1 summaries into a 32-bit block; 2 bits of this block are unused. The remaining L1 summary (S1(3)) together with S2(0, ) are placed into another 32-bit block. The remaining L2 summaries are packed into 32-bit blocks; 5 summaries per block; 2 bits per block are unused. The L3 summaries occupy 4 memory blocks; the bitmap takes 8 blocks; and each of the 3 pointers takes a block. When a bitmap node is reached, the memory block with type fields is accessed to determine the node s actual type. The rule pointer is accessed so we can list all matching rules. A bitmap block is accessed to determine whether we have a match with the input string character. If the examined bit is 0, the failure pointer is accessed and we proceed to the node pointed by this pointer; the failure pointer offset, which was retrieved from memory when the block with type fields was accessed, is used to position us at the proper place in the node pointed at by the failure pointer in case this node is a path compressed node. So, the total number of memory accesses when we do not have a match is 4. When the examined bit of the bitmap is 1, we compute a popcount. This may require between 0 and 3 memory accesses (for example, 0 are needed when bit 0 of the bitmap is examined or when the only summary required is S1(1) or S1(2)). Using the computed popcount, the first child pointer (another memory access) and the first child type (cannot be that of an O node), we move to the next node in our data structure. A total of 4 to 7 memory accesses are made Low Degree Node, W = 32 Next consider the case of a low degree node. We pack the type fields, size field, failure pointer offset field, and the char 1 field into a memory block; 7 bits are unused. The remaining 7 char fields are packed into 2 blocks leaving 8 bits unused. Each of the pointer fields occupies a memory block. When a low degree node is reached, we must access the memory block with type fields as well as the rule pointer. To determine 36

37 whether we have a match at this node, we do an ordered sequential search of the up to 8 characters stored in the node. Let i denote the number of characters examined. For i = 1, no additional memory access is required, one additional access is required when 2 i 5, and 2 accesses are required when 6 i 8. In case of no match we need to access also the failure pointer; the first child pointer is retrieved in case of a match. The total number of memory accesses to process a low degree node is 3 to 5 regardless of whether there is a match Ol, 1 l 5, Nodes, W = 32 For an O1 node, we place the type, failure pointer offset, and char 1 fields into a memory block; the rule, failure and first child pointers are placed into individual memory block. To process an O1 node, we first retrieve the type block and then the rule pointer. The rule pointer is used to list the matching rules. Then, we compare with char 1 that is the retrieved type block. If there is a match, we retrieve the first child pointer and proceed to the node pointed at. In case of no match, we retrieve the failure pointer, which together with the offset in the type block leads us to the next node. So, 3 accesses are needed when an O1 node is reached. The mapping for an O2 is similar to that used for an O1 node. This time, the type block contains char 1 and char 2, the additional rule pointer and failure offset pointers are placed in separate blocks. The number of memory accesses needed to process such a node is 3 when only char 1 is examined (this happens when there is a mismatch at char 1). When char 2 also is examined an additional rule pointer is retrieved. For a mismatch, we must retrieve the second failure pointer as well as its failure pointer offset. So, 5 accesses are needed. For a match, 4 accesses are required. So, in case of a mismatch in an O2 node, 3 or 5 accesses are needed; otherwise, 4 are needed. For O3 nodes, we place char 3 and its associated failure pointer offset into the memory block of O2 that contains the second failure pointer offset. The associated rule and failure pointers are placed in separate memory blocks. When all 3 characters are 37

38 matched, we need 6 memory accesses. When a mismatch occurs at char 1, there are 3 accesses; at char 2, there are 5 accesses; and at char 3, there are 6 accesses. An alternative mapping for an O3 node places the data fields into memory in the following order: node and first child type fields (1 byte total), pairs of character and rule pointer fields ((char j, rule pointer j), 5 bytes per pair), first child pointer (4 bytes), pairs of failure pointer and failure pointer offsets (5 bytes per pair). When i characters are examined, we retrieve (1 + 5i)/4 blocks to process the characters and their rule pointers. In case of a mismatch at character i, 2 additional accesses are needed to retrieve the corresponding failure pointer and its offset. In case of a match, a single additional memory access gets us the first child pointer. So, the total number of memory accesses is (1 + 5i)/4 + 2 when there is a mismatch and (1 + 5i)/4 + 1 when all characters in the nodes are matched. When this alternative matching is used, a mismatch at character i, 1 i 3 takes 4, 5, and 6 memory accesses, respectively. When there is no mismatch, 5 memory accesses are required. For an O4 node, we extend the original O3 mapping by placing char 3, char 4, and offset pointers 3 and 4 in one memory block; and offset pointer 2 in another. Rule and failure pointers occupy one block each. When all 4 characters are matched, we need 7 memory accesses. A mismatch at character i, 1 i 4, results in 3, 5, 6, and 7 accesses, respectively. An O5 node is mapped with chars 3, 4, 5 and offset pointer 3 in a memory block and offset pointers 2, 4, and 5 in another. When all 5 characters in an O5 node are matched, there are 8 memory accesses. When there is a mismatch at character i, 1 i 5, the number of memory accesses is 3, 5, 6, 8, and 9, respectively O Nodes, W = 32 and 1024 For simplicity, we extend the alternative mapping described above for O3 nodes. Fields are mapped to memory in the order: node type, first child type, and capacity fields (2 bytes total), pairs of character and rule pointer fields ((char j, rule pointer j), 5 bytes 38

39 per pair), first child pointer (4 bytes), pairs of failure pointer and failure pointer offsets (5 bytes per pair). The memory access analysis is similar to that for O3 nodes and the total number of memory accesses, when W = 32, is (2 + 5i)/4 + 2 when there is a mismatch and (2 + 5i)/4 + 1 when all characters in the nodes are matched. When W = 1024, an O node fits into a single memory block provided its capacity, c, is no more than 12. Hence, for c 12, a single memory access suffices to process this node. When c > 12, the memory access count using the above mapping is is (2 + 5i)/ Since i c 256, at most 12 memory access are need to process an O node when W = Path Compressed Node of Tuck, W = 32 and 1024 When W = 32, the type, size, failure offset 1, and char 1 through 3 fields of the path compressed node of [57] may be mapped into a single memory block. The char 4 and 5 fields together with the 4 remaining failure pointer offset fields may be mapped into another memory block. For a mismatch at char 1, we need to access block 1, rule pointer 1, and failure pointer 1 for a total of 3 memory accesses. For a failure at char i, 2 i size, we must access also block 2 and an additional i 1 rule pointers. The memory access count is 3 + i.notice that since [57] path compresses end-node sequences only,a failure must occur whenever we process a path compressed node whose size is less than 5 as the last state in such a node has no success transition (i.e., its degree is 0 in the Aho-Corasick automaton). Hence, for a match at this node, we may assume that the size is 5. The two blocks, 5 rule pointers, and the first child pointer are accessed. The total number of memory accesses is 8. When W = 1024, all 52 bytes of the path compressed node fit in a memory block. So, only 1 memory access is needed to process the node. Note that for an end-node sequence with 256 states, 53 path compressed nodes are used. The worst-case accesses to go through this end-node sequence is 53. Using our O node, 12 memory accesses are made in the worst case. 39

40 Table 2-3. Memory accesses to process a node for W = 32 and W = 64 W =32 W =64 Match Mismatch Match Mismatch B(I) 4 to to 6 3 B(II) 4 to to 5 3 B(III) 4 to to 4 3 L 3 to 5 3 to 5 2 to 3 2 to 3 O O2 4 3 or or 3 O3 6 3, 5, or or 3 O4 7 3 or 5 to to 4 O5 8 3, 5, 6, 4 2, 8, or 9 3 to 5 O 3, 3, 2, 2, 2+5i 2+5i 2+5i 2+5i TB[32] 4 to TO[32] 1 + i, 3, 1 + i, 2+ i, 2 6, i Summary Using a similar analysis, we can derive the memory access counts for different values of the memory bandwidth W, other summary types, and other node types. Table 2-3 and 2-4 give the access counts for the different node and summary types for a few sample values of W. The rows labeled B (bitmap), L (low degree), Ol (O1 through O5), and O refer to node types for our structure while those labeled TB (bitmap) and TO (one degree) refer to node types in the structure of Tuck et al. [57]. We note that the counts of Table 2-3 and 2-4 are specific to a certain mapping of the fields of a node to memory. Using a different mapping will change the memory access count. However, we believe that the mappings used in our analysis are quite reasonable and that using alternative mappings will not improve these counts in any significant manner Mapping States to Nodes We map states to nodes as follows and in the stated order. 1. Category BX, X {B, L, O}, states are mapped to 1 bitmap node each; sibling states are mapped to nodes that are contiguous in memory. Note that in the case of BL and BO states, only a portion of a bitmap node is used. 40

41 Table 2-4. Memory accesses to process a node for W = 128 and W = 1024 W =128 W =1024 Match Mismatch Match Mismatch B(I) 2 to B(II) 2 to B(III) 2 to L 1 to 2 1 to O O O O O5 2 2 or O 1, 1, 1, 1, 2+5i i 2+5i TB[32] TO[32] i Maximal sets of LX, X {L, O}, states that are siblings are packed into unused space in a bitmap node created in (1) using 25 bytes per LX state and the low degree structure of Figure 2-8. By this, we mean that if there are (say) 3 LX states that are siblings and there is a bitmap node with at least 75 bytes of unused space, all 3 siblings are packed into this unused space. If there is no bitmap node with this much unutilized space, none of the 3 siblings is packed into a bitmap node. The packing of sibling LX nodes is done in non-increasing order of the number of siblings. Note that by packing all siblings into a single bitmap node, we make it possible to access any child of a bitmap node using its first child pointer, the child s rank (i.e., index in the layout of contiguous siblings), and the size of the first child (this is determined by the type of the first child). Note that when an LO state whose child is an OO state is mapped in this way, it is mapped together with its lone OO-state child into a single 25-byte O2 node, which is the same size as a low degree node. 3. The remaining LX states are mapped into low degree nodes (LL states) or O2 nodes (LO states). LL states are mapped one state per low degree node. As before, when an LO state whose child is an OO state is mapped in this way, it is mapped together with its lone OO-state child into a single 25-byte O2 node. Sibling states are mapped to nodes that are contiguous in memory. 4. The chains of remaining OO states are handled in groups where a group is comprised of chains whose first nodes are siblings. In each group, we find the length, l, of the shortest chain. If l > 5, set l = 5. Each chain is mapped to an Ol node followed by an O node. The Ol nodes for the group are in contiguous memory. Note that an O node can only be the child of an Ol node or another O node. 41

42 Table 2-5. Number of nodes of each type, Ol and O counts are for Type I summaries Node Type B L Ol O TB TO DataSet DataSet Table 2-6. Number of Ol O nodes for Type II and Type III summaries Node Type Ol(II) O(II) Ol(III) O(III) DataSet DataSet Experimental Results We benchmarked our compression method of Section 4.2 against that proposed by Tuck et al. [57] using two data sets of strings extracted from Snort [51] rule sets. The first data set has 1284 strings and the second has 2430 strings. We name each data set by the number of strings in the data set Number of Nodes Table 2-5 and 2-6 give the number of nodes of type I, type II,type III and Tuck et al. [57] in the compressed Aho-Corasick structure for each of our string sets. The maximum capacity of an allocated O node was 141 for data set 1284 and 256 for data set Memory Requirement Although the total number of nodes used by us is less than that used by Tuck et al. [57], our nodes are larger and so the potential remains that we actually use more memory than used by the structure of Tuck et al. [57]. Table 2-7and 2-8 give the number of bytes of memory used by the structure of [57] as well as that used by our structure for each of the different summary types of Section 2.3. Recall that the size of a B node depends on the summary type that is used. As stated in Section 4.2, the B node size is 110 bytes for Type I summaries, 94 bytes for Type II summaries, and 61 bytes for Type III summaries. The memory numbers given in Table 2-7 and 2-8 do not include the 16 bytes (or less) needed for the single T 4 table used by Type II summaries or the 256KB needed by the T 16 table used by Type I summaries. In the case of Type II 42

43 Table 2-7. Memory requirement for data set 1284 Data set 1284 Methods [57] Type I Type II Type III Memory (bytes) Normalized *Excludes memory for T 4 and T 16 Table 2-8. Memory requirement for data set 2430 Data set 2430 Methods [57] Type I Type II Type III Memory(bytes) Normalized *Excludes memory for T 4 and T 16 summaries, adding in the 16 bytes needed by T 4 doesn t materially affect the numbers reported in Table 2-7 and 2-8. For Type III summaries, the 256KB needed for T 16 is more than what is needed for the rest of the data structure. However, as the data set size increases, this 256KB remains unchanged and fixed at 256KB. The row labeled Normalized gives the memory required normalized by that required by the structure of Tuck et al. [57]. The normalized values are plotted in Figure As can be seen, our structures take between 24% and 31% less memory than is required by the structure of [57]. With the 256KB required by T 16 added in for Type III summaries, the Type III representation takes twice as much memory as does [57] for the 1284 data set and 75% more for the 2430 data set. As the size of the data set increases, we expect Type II summaries to be more competitive than [57] on total memory required Popcount Table 2-9 and 2-10 give the total number of additions required to compute popcounts when using each of the data structures. For this experiment, we used 3 query strings obtained by concatenating a differing number of real s that were classified as spam by our spam filter. The string lengths varied from 1MB to 3MB and we counted the number of additions needed to report all occurrences of all strings in the Snort data sets (1284 or 2430) in each of the query strings. The last row of each figure 43

44 1 0.9 Tuck et al [31] Type I summaries Type II summaries Type III summaries memory size rule set size Figure Normalized memory requirement Tuck et al [31] Type I summaries Type II summaries Type III summaries Adds for popcount rule set size Figure Normalized additions for popcount 44

45 Table 2-9. Number of popcount additions, data set 1284 Methods [57] Type I Type II Type III strlen= M 1.37M 1.25M 0.76M strlen= M 4.15M 3.79M 2.29M strlen= M 8.25M 7.51M 4.55M strlen= M 13.74M 12.49M 7.56M strlen= M 20.75M 18.82M 11.37M Normalized Table Number of popcount additions, data set 2430 Methods [57] Type I Type II Type III strlen= M 1.46M 1.33M 0.79M strlen= M 4.43M 4.02M 2.42M strlen= M 8.78M 7.96M 4.80M strlen= M 14.67M 13.28M 8.00M strlen= M 22.25M 20.09M 12.08M Normalized is the total number of adds for all 3 query strings normalized by the total for the structure of [57]. The normalized values are plotted in Figure When Type III summaries are used, the number of popcount additions is only 7% that used by the structure of [57]. Type I and Type II summaries require about 13% and 12%, respectively, of the number of additions required by [57]. 45

46 CHAPTER 3 COMPRESSED OBJECT ORIENTED NFA FOR MULTI-PATTERN MATCHING 3.1 The Object Oriented NFA for Multi-pattern Matching The construction of an Object Oriented Non Deterministic Automata (NFA) for multi-pattern matching starts with the Aho-Corasick NFA. Failure transitions in the AC NFA are eliminated and null transitions are processed and described below. The following discussion follows closely the development in [33]. Figures 3-1, through 3-3 show the three functions: Completion of the OO graph, Add an edge to the graph, Object Oriented NFA Search Function. For the completion of the NFA graph algorithm, for each pattern we perform the following steps. Starting from the root, move to the next state according the pattern s first character. Move to the next state using the pattern s second character, call it current state. Repeat the following until the end of the pattern is reached. Check the root state to see if there is a transition on the current ith character. If so, we place a state card representing that transition next to the current state, copy all the transitions from that state to current state (expect any exiting transition on current state). Move to the next state using the pattern s next character. If we have placed state cards next to our previous state, check if there is a transition on our current character. If so place the state card for that transition next to our current state card and copy all the transitions from that state to the current state (except any exiting transitions on current state) The AC NFA (with failure transitions) for the pattern set {hers, his, she} is shown in Figure 3-4. To complete the OO NFA, we start with the NFA of Figure 3-4 and process the patterns one by one using the algorithm of Figure 3-1. When processing the first pattern hers, we start from root state 0, move to state 1 using the pattern s 1st character h. Then we move to state 2 using character e. We check the initial root state to see whether there is a transition on our current character e. There is none. So we move to the next state 3 using character r. Since there is 46

47 Algorithm 1. Completion of the OO graph Input. List of patterns, P Output, Completion of the OO graph with starting node root. begin queue <- empty currentstate=root->next[p[0]] for i <- 1 until strlen(p) begin currentstate=currentstate->next[p[i]] while queue <> empty do begin let temp be the next state in queue queue <- queue -{temp} temp=temp->next[p[i]] if (temp <> NULL) begin queue <- queue U {temp} addedges(currentstate,temp) end end while temp=root->next[p[i]] if (temp <> NULL) begin queue <- queue U {temp} addedges(currentstate,temp) end end for end Figure 3-1. Algorithm 1. Completion of the OO graph Algorithm 2. Add an edge to the graph Input. startstate, endstate begin for i <-0 until 255 begin temp = endstate->next[i] if temp <> NULL startstate->next[i]=temp end for end Figure 3-2. Algorithm 2. Add an edge to the graph 47

48 Algorithm 3.Object Oriented NFA Search Function Input. input text string begin for i <- 1 until strlen(text) begin if (currentstate <> NULL) begin currentstate=currentstate -> next[text[i]] if (currentstate <> NULL) checkmatchedrule else currentstate=root->next[text[i]] else currentstate=root->next[text[i]] end end for end Figure 3-3. Algorithm 3.Object Oriented NFA Search Function *The search function of OO NFA is quite simple.it makes at most 2 transitions on the input text character and everytime it goes back to the transition starting from the root state when there is no match. no transition on r, we move to the next state 4 using the last character s. Now there exists a transition from the root state 0 to state 7 on s. So we copy the state 7 card over next to state 4. For the second pattern his, when we move from state 5 to state 6 on s, we find there exists a transition from the root state 0 to state 7 on s. So we copy the state 7 card over next to state 4. For the third pattern she, when we move from state 7 to state 8 on character h, we copy state card 1 next to state 8 because there is a transition from the root state 0 to state 1 on h.then we move to the next state 9 on character e. Because we place a state card 1 next to our previous state 8, we need to check whether there exists a transition from that card 1 on our current character e. So we copy state card 2 next to state 9. Figure 3-5 demonstrates what is described above. 48

49 0 s h e r s i 5 s 6 7 h 8 e 9 Figure 3-4. Aho-Corasick NFA with failure pointers (all failure pointers point to state 0) Figure 3-4 converts this state card representation to a Nondeterministic Finite Automata (NFA). As we can see, four additional transitions have been added to complete the initial NFA of Figure 3-4. Figure 3-6 is the Aho-Corasick NFA with failure pointers on the same set of patterns. Figure 3-7 is the DFA obtained from both Figures 3-4 and 3-6 for our set of patterns. From the above figures, one observation we can make is the OO NFA is a partial transformed automata when converting Aho-Corasick NFA to final DFA. It keeps all the DFA transitions that start from states of depth 2 and ends at states of depth Compressed OO NFA The Object Oriented NFA may be compressed to obtain a compressed OO trie using the method of Tuck et al. [57]. Three types of nodes are employed in a compressed OO trie. Besides bitmap and path compressed nodes, we add COPY nodes that simply play a role as a soft link to an original node that has been copied over next to the current node. Figure 3-10 shows the format of a COPY node. Bitmap and path compressed nodes in compressed OO tries use the same format as in [57] except that the failure pointer and failure pointer offset fields are omitted in Figures 3-8 and

50 s h 8 7 h 1 0 S 7 e 2 r 3 s 4 h e r s 1 i *hers 4 h 8 s h 8 7 i S 6 *his s 5 s 6 h 8 h 8 7 h e 9 8 e *she 9 i 5 r 3 h e 2 1 e r 3 2 i 5 Figure 3-5. The Object Oriented NFA (state cards representation) 3.3 Experimental Results We benchmarked the OO method of Section 3.1 and our compression method of Section 3.2 against that proposed by Tuck et al. [57] and the Aho-Corasick automata[1] using six data sets of 1000,2000,3000,4000,5000,6000 English words. The experiments are performed on Linux system Fedora 7 envirnoment and all the programs are in C++. Table 3-1 gives search time for English patterns. Table 3-2 and 3-3 give the memory required for different multi-pattern data structures and the node distribution. 50

51 0 s h e r s i i 5 s r 6 h h 7 h 8 e 9 Figure 3-6. OO NFA (all states that have no matched character wiil return back to state 0) h h h 0 s h e r s s s h i s s s h i 5 8 s r s 6 h h 9 e s Figure 3-7. The DFA for our set of patterns Table 3-4 gives the compression ratio, relative to the original uncompressed data structure, achieved by a compressed data structure. Although the OO method of Section 3.1 is faster than the Aho-Corasick automata[1] by 25% 29% for search, the compressed OO trie method is slower than the compressed Aho-Corasick trie [57] by 8% 21%. Also the compression ratio for the OO method is not as good as that of the Aho-Corasick compression method [57].The OO compression ratio is about , much less than that of the Aho-Corasick compression ratio (

52 node type 2bit L1 (S1,S2, S7) 8bits*7=56bits bitmap 256bits rule ptr 32bits firstchild ptr 32bits Figure 3-8. OO bitmap node node type 2bit capacity 2bits firstchild ptr 32bits char1 8bits rule ptr 32bits char2 8bits rule ptr 32bits char3 8bits rule ptr 32bits char4 8bits rule ptr 32bits Figure 3-9. OO path compressed node node type 2bit startpos of oiginal node (32bit) startpos offset of original node (2bit) Figure OO copy node 40.8). The reason for this is the OO structure has more non-null next node pointers. These extra non-null next node pointers point to a large number of COPY nodes which do not exist in the Aho-Corasick compression method [57]. While the OO structure is preferred over other structures considered in this chapter in applications that are not memory constrained, the compressed Aho-Corasick trie is preferred when memory is severely limited. 52

53 Table 3-1. Search times (milliseconds) for English patterns Length OO AC OOC ACC Table 3-2. Memory required (bytes) for English patterns Num of Patterns OOC OO ACC AC ,691,460 5,642, ,772 5,664, ,794,312 11,313, ,380 11,357, ,944,576 15,733, ,540 15,795, ,499,232 19,766, ,764 19,843, ,989,200 23,769, ,524 23,861, ,587,328 28,075, ,716 28,184,676 Table 3-3. Number of different type nodes for English patterns Num of Patterns OO-Bitmap OO-Compressed OO-COPY AC-Bitmap AC-Compressed , , , , ,389 1,820 3, , ,650 2,759 4, , ,841 3,848 5, , ,816 4,927 6, , ,575 6,031 7,446 Table 3-4. Compression ratio for English patterns Num of Patterns OO/OOC AC/ACC OOC/ACC OO= OO method of Section 3.1, AC= Aho-Corasick automata[1] OOC= our compression method of Section 3.2, AAC= Aho-Corasick compression method [57] 53

54 CHAPTER 4 THE COMPRESSED AHO-CORASICK AUTOMATA ON IBM CELL PROCESSOR In this chapter, we develop a multicore algorithm for multi-pattern matching. Specifically, we choose an established multi-core architecture, the IBM Cell/Broadband Engine (Cell) for our work because it is a prominent architecture in the high-performance computing community, it has shown potential in string matching applications, and it presents software designers with non-trivial challenges that are representative of the next generations of multi-core architectures. With our proposed algorithm, we achieve an average compression ratio of 1:34 for English words and 1:58 for random binary patterns. Our implementation provides a sustained throughput between 0.90 and 2.35 Gbps per Cell blade in different application scenarios, while supporting dictionary densities up to 9.26 million average patterns per Gbyte of main memory. 4.1 The Cell/Broadband Engine Architecture The Cell processor [10] contains 9 heterogeneous cores on a silicon die. One of them is a traditional 64-bit processor with cache memories and 2-way simultaneous multi-threading, called Power Processor Element (PPE), and capable of running a full-featured operating system and traditional PowerPC applications. The other 8 cores are called Synergistic Processor Elements (SPEs). They have no caches, but rather a small amount of scratch-pad memory (256 kbyte) that the programmer must manage explicitly, by issuing DMA transfer from and to the main memory. The cores are connected with each other via the Element Interconnect Bus (EIB), a fast double ring on-chip network. Figure 4-1 shows the chip layout of the Cell architecture. The Cell delivers its best performance when the SPEs are kept highly utilized by streaming tasks that load data from main memory, process data locally and commit the results back to main memory. These tasks exhibit a regular, predictable memory access 54

55 Table 4-1. Compression Ratios obtained by our technique on two sample dictionaries of comparable uncompressed size. Dictionary(1) contains the 20,000 most common words in the English language.dictionary(2) contains 8,000 random binary patterns of same average length as in Dictionary (1). Dictionary Original Packing Compressed Compression AC Size Factor AC Size Ratio (1) English Mbytes Mbytes Mbytes (2) Binary Mbytes Mbytes Mbytes Mbytes pattern that the programmer can exploit to implement double buffering, and overlap computation and data-transfer over time. * Achieving high performance on the Cell with non-streaming applications is all but trivial, and algorithms based on DFAs like ours are arguably the most difficult to port. In fact, these algorithms exhibit unpredictable memory access patterns and a complex latency interaction between compute code and data-transfer code. These circumstances make it difficult to determine what represents the critical path in the code, and how to optimize it. Figure 4-3 shows a path-compressed node. 4.2 Cell-oriented Algorithm Design This section describes the implementation choices we made to adapt our AC NFA algorithm to the Cell processor. To compute popcounts efficiently, we employ the CNTB and SUMB instructions (available at the C level via the spu cntb() and spu sumb() intrinsics). These reduce the number of operations to compute the popcount from 31 additions (summary+bit0+bit1+...+bit30) to two spu instructions plus one summary addition. Sample code to compute the popcount for childnode i (0 i 255) of a compressed Aho-Corasick node is given below. popcount=get_summary(i); 55

56 bitblock=get_bitmapblock(i); charvector=spu_promote(bitblock,0); countbyteones=spu_cntb((charvector); countblockones=spu_sumb(countbyteones,countbyteones); popcount=popcount+spu_extract(countblockones,0); Also, we employ vector comparison instructions to get the longest match between the input and compressed paths. For alignment reasons, we only consider path-compressed nodes with packing factors (c) of 4, 8 and 12. Table 4-1 shows the corresponding compression ratio. Note that 4 is the best choice for the English dictionary and 8 is best for random binary patterns. For simplicity, we consider a packing factor of 4 in the experiments that follow. The difference in compression gain obtained with a packing factor of 8 is not significant enough to justify the increase in algorithm complexity. By using this compressed automata, we can compress dictionaries with an average compression ratio of 1:34 for English dictionaries and 1:58 for random binary patterns. We now describe the optimizations we employed to map our compressed AC algorithm to Cell architecture and their impact. Results were obtained with the IBM Cell SDK 3.0 on IBM QS22 blades. Table 4-2 shows the impact of the optimization steps on the performance and quality of code. We started from a naïve compressed AC implementation and we applied branch hinting, branch replacement with conditional expressions, vertical unrolling, data structure realignment, branch removal, arithmetic strength reduction and horizontal unrolling. The aggregate effect of these optimizations is to increase the throughput (by reducing the number of cycles absorbed per character), reducing the cycles per instruction (CPI), reducing stalls and increasing the dual issue rate (i.e. clock cycles in which both pipeline in an SPE issue a new instruction). These techniques help to decrease the CPI, the branch stall cycles rate, the dependency stall cycles. They also decrease the single instruction issue rate and 56

57 Table 4-2. The impact of the optimization steps on the performance of our compressed AC NFA algorithm when evaluated in the four application scenarios presented in Section 6.5. Packing factor for compressed-path node is 4. Typical Cycles/ CPI Insts Used NOP Branch Dep. Single Dual Optimization Step Throughput char per Regs Stall Stall Issue Issue Speedup (Gbps) (1 SPE) char Rate Rate Rate Rate Rate Scenario A: Full Text Search (0) Unoptimized PPE baseline implementation =1.0 (1) Naïve implementation on 8 SPEs % 10.4% 27.9% 47.3% 11.5% 17.1 (2) 1 Engine, branch hints, conditional expr % 23.7% 25.2% 38.0% 11.0% 18.5 (3) 4 Engines, loops unrolling, alignment % 19.5% 29.0% 37.7% 11.7% 19.7 (4) 1 Engine, branch removal % 2.7% 26.1% 44.9% 24.0% 21.5 (5) 1 Engine, cheaper pointer arithmetics % 1.8% 26.0% 46.3% 23.4% 21.6 (6) 4 Engines, horizontal unrolling % 3.2% 17.4% 43.8% 33.7% 25.1 Scenario B: Network Content Monitoring (0) Unoptimized PPE baseline implementation =1.0 (1) Naïve implementation on 8 SPEs % 15.0% 20.7% 52.0% 10.0% 8.0 (2) 1 Engine, branch hints, conditional expr % 8.6% 27.6% 48.7% 12.6% 10.8 (3) 4 Engines, loops unrolling, alignment % 20.0% 20.0% 40.9% 16.3% 12.1 (4) 1 Engine, branch removal % 1.0% 27.8% 48.5% 20.7% 12.4 (5) 1 Engine, cheaper pointer arithmetics % 1.4% 23.2% 44.6% 28.4% 17.9 (6) 4 Engines, horizontal unrolling % 2.4% 22.6% 44.9% 27.9% 25.3 Scenario C: Network Intrusion Detection (0) Unoptimized PPE baseline implementation =1.0 (1) Naïve implementation on 8 SPEs % 15.4% 27.7% 49.1% 5.3% 6.2 (2) 1 Engine, branch hints, conditional expr % 14.1% 21.8% 43.4% 17.8% 7.0 (3) 4 Engines, loops unrolling, alignment % 16.2% 25.3% 44.8% 11.6% 7.8 (4) 1 Engine, branch removal % 1.2 % 27.6% 48.5% 20.8% 7.9 (5) 1 Engine, cheaper pointer arithmetics % 3.4% 22.9% 42.9% 28.4% 9.8 (6) 4 Engines, horizontal unrolling % 2.5% 22.6% 44.7% 28.0% 16.1 Scenario D: Anti-Virus Scanning (0) Unoptimized PPE baseline implementation =1.0 (1) Naïve implementation on 8 SPEs % 12.8% 26.0% 46.5% 12.2% 4.9 (2) 1 Engine, branch hints, conditional expr % 14.8% 22.0% 42.7% 17.4% 6.1 (3) 4 Engines, loops unrolling, alignment % 16.4% 25.4% 44.5% 11.7% 7.6 4) 1 Engine, branch removal % 7.6% 24.7% 45.4% 19.9% 17.3 (5) 1 Engine, cheaper pointer arithmetics % 1.2% 23.1% 44.9% 28.3% 18.4 (6) 4 Engines, horizontal unrolling % 7.9% 20.6% 41.8% 27.3%

58 increase the dual instruction issue rate. Overall, the optimization effort results in a 16 to 25 times throughput speedup against the unoptimized PPE baseline implementation Step (2): Branch Replacement and Hinting Whenever possible, we restructure the control flow so to replace if statements with conditional expressions. We inspect the assembly output to make sure that the compiler renders conditional expression with select bits instructions rather than branches. A major if statement in the compressed AC NFA kernel does not benefit from this strategy, i.e., the one that branches depending on whether the node type is bitmap or path-compressed. The two branches are too different to reduce to conditional expressions. We reduce the misprediction penalty associated with this branch by hinting to mark the bitmap case as the more likely, as suggested by our profiling on realistic data Step (3): Loop Unrolling, Data Alignment We apply unrolling to a few relevant bounded innermost loops, and we apply data structure alignment. Our algorithm consists of two major parts: a compute part and a memory access part. Since the compressed AC is too large to fit entirely in the SPEs local stores, we store it in main memory. We safely ignore the impact of memory accesses required to load input text from main memory to local store and write back matches in the opposite direction. In fact, we implement both transfers in a double-buffered way, overlapping computation and data transfer in time. The below pseudo code shows the major part of the vertical unrolling method in the algorithm. while () { //Automaton1: waitdma update current node if (type==bitmap) Process BITMAP NODE 58

59 else //type==pathcompressed Process PATHCOMPRESSED NODE DMA transfer request //Automaton2: waitdma update current node if (type==bitmap) Process BITMAP NODE else //type==pathcompressed Process PATHCOMPRESSED NODE DMA transfer request... } When a single instance of an AC NFA runs, it computes its next-iteration node pointer and then fetches this node via a DMA transfer from main memory. DMA transfers have round-trip time of hundreds of clock cycles. To utilize these cycles, we run multiple concurrent automata, each checking matches in different segments of the input, unrolling their code together vertically. Multiple automata can pipeline memory accesses, overlapping the DMA transfer delays. Figure 4-4 shows how two automata overlap their computation part with their DMA transfer wait time. Figure 4-5 illustrates how different vertical unrolling factors affect the performance. We choose vertical unrolling factor 8 in our implementation as it gives the minimal DMA transfer delay. We also performed an experiment to find out the best DMA transfer size to make full use of the bandwidth and minimize the DMA transfer delay. The psudocode below shows how to measure the DMA transfer time with different DMA transfer size. i=0 59

60 DMA transfer request(transfer_size) recode time1 while (i<n) { waitdma DMA transfer request(transfer_size) i=i+1 } record time2 single_dma_transfer_time=(time2-time1)/n Figure 4-8 shows the optimal transfer size is 64 Bytes over the eight SPUs Step (4): Branch Removal, Select-bits Intrinsics After replacing if statements with conditional expressions, the branch miss stalls still account for about one fifth of the total compute cycles. We use IBM asmvis [3] to inspect the static timing analysis of our code at the assembly level. It helps us to get a clear view of what the compiler is doing, instruction by instruction. The inspection reveals that conditional expressions are often translated by the compiler as expensive branch instructions. In this case, our code still suffers from expensive branch miss penalties, which can cost as much as 26 clock cycles each. To eliminate branches, we manually replace conditional expressions with the spu sel intrinsic [9]. c =(a > b)? a : b; <==> select =spu_cmpgt(a,b); c_plus_1=spu_add(c,1); a_plus_b=spu_add(a,b); c=spu_sel(c, c_plus_1,select); d=spu_sel(a_plus_b,d,select); 60

61 The basic idea is to compute the two possible results for both branches and select one of the results using a select bit instruction. For example, the transformation reduces branch miss stalls from 19.5% to 2.7% of the cycle count for the full-text search scenario Step (5): Strength Reduction We manually apply operator strength reduction (i.e., replacing multiplication and divisions with shifts and additions) where the compiler did not. In addition, we use cheap pointer arithmetic to load four adjacent integer elements into a 128 bit vector. This reduces the load overhead. e.g. Manual strength reduction reduces the overall clock cycles 3% for the full text search scenario Step (6): Horizontal Unrolling After Steps 1 4, dependency stalls occupy about 25% of the computation time. Within the NFA compute code, one branch handles bitmap nodes, while the other one handles path-compressed nodes. In the code of both cases, there are frequent read-after-write data dependencies. To reduce the dependency stalls, we interleave the codes of multiple, distinct automata; we call this operation horizontal unrolling. These multiple automata process independent input streams against the same dictionary. They have distinct states and input/output buffers, and they require multiple, distinct DMA operations to perform the associated streamed double buffering.the buffer size is 4096 bytes in our experiments. The horizontal unroll factor must be chosen accurately to reflect the trade-off between the decreased dependency stalls and the potentially increased branch stalls. Our experiments show that unrolling 2 NFAs achieves the highest performance improvement, 10%. For example, for the full text search scenario, dependency stalls decreased from 26.0% to 17.4%, while branch stalls increase from 1.8% to 3.2%. 61

62 Table 4-3. Aggregate throughput on an IBM Cell chip with 8 SPUs (Gbps). Scenario Throughput (Gbps) Full-text search 1.14 Network content monitoring 1.43 Network intrusion detection 0.90 Anti-Virus scanning 1.25 Full-text search (100% match) 1.69 Anti-Virus scanning (100% match) Experimental Results In this section, we benchmark our software design in a set of representative scenarios. We use two dictionaries to generate compressed AC automata: Dictionary 1 contains the 20,000 most common words in the English language, while Dictionary 2 contains 8000 random binary patterns. We benchmark the algorithm on three input files: the King James Bible, a tcpdump stream of captured network traffic and a randomly generated binary file. Figure 4-9 and Table 4-3 show the aggregate throughput of our algorithm on a dual-chip blade (16 SPEs) in the six scenarios described below. Scenario A (Dictionary 1 against the Bible) is representative of full-text search systems. Scenario B (Dictionary 1 against the network dump) is representative of content monitoring systems. Scenario C (Dictionary 2 against the network dump) is representative of Network Intrusion Detection Systems (NIDSs). Scenario D (Dictionary 2 against binary patterns) is representative of anti-virus scanners. The last two scenarios in the figure are representative of systems (with Dictionary 1 and 2, respectively) under a malicious, content-based attack. In fact, a system whose performance degrades dramatically when the input exhibits frequent matches with the dictionary is subject to content-based attacks. An attacker that gains partial or full knowledge of the dictionary could provide the system with traffic specifically designed to overflow it. In scenarios five and six we provide our system with inputs entirely 62

63 composed of words from the dictionary. Our experiments show a desirable property of our algorithm: its performance actually increases in case of frequent hitting. The reason is that our NFA spends a similar amount of time to process a bitmap or a path-compressed node. For this reason, a mismatch takes a comparable amount of time to the match of an entire path. For this reason, the cycles spent per input character decrease when more input characters match the dictionary. Path-compressed nodes pack as many as 4 or 8 original AC nodes, and allow multi-character match at one time. Figure 4-10 shows how the percentage of matched patterns affects the aggregate throughput on the IBM cell blade with 16 SPUs for the virus scanning scenario. As the percentage of the matched patterns increases, the aggregate throughput increases as well. We explore the trade-offs between the AC compression ratio and the throughput in a Pareto space. We choose the English dictionary as the compression object and choose packing factors of 4, 8, 12 for path compressed nodes. As shown in Figure 4-11, the compression ratio decreases with increase in the packing factor. However, the throughput is better with a packing factor of 8 than with one of 4. The reason for that is the input data is a English input which has 100% match against the dictionary. So instead of matching 4 nodes in the path compressed node at one time, matching 8 nodes at one time gives better performance. However, a packing factor of 12 has some throughput degradation compared to a packing factor of 8. One conclusion we draw from this Pareto chart is the compression ratio affects the throughput, in order to get a better compression ratio, we have to sacrifice throughput. 4.4 Related Work Snort [50] and Bro [35 39] are two of the more popular public domain Network Intrusion Detection Systems (NIDSs). The current implementation of Snort uses the optimized version of the AC automaton [1]. Snort also uses SFK search and the Wu-Manber [66] multi-string search algorithm. 63

64 To reduce the memory requirement of the AC automaton, Tuck et al. [57] have proposed starting with the non-deterministic AC automaton and using bitmaps and path compression. In the network security domain, bitmaps have been used also in the tree bitmap scheme [16] and in shape shifting and hybrid shape-shifting tries 1 [52, 62]. Path compression has been used in several IP address lookup structures including tree bitmap [16] and hybrid shape-shifting tries [62]. These compression methods reduce the memory required to about 1/30 1/50 of that required by an AC DFA or a Wu-Manber structure, and to slightly less than what required by SFK search [57]. However, lookups on path-compressed data require more computation at search time, e.g., more additions at each node to compute popcounts, thus requiring hardware support to achieve competitive performance. Zha and Sahni [68] have suggested a compressed AC trie inspired by the work of Tuck et al. [57]: they use bitmaps with multiple levels of summaries, as well as an aggressive path compaction. Zha and Sahni s technique requires 90% fewer additions to compute popcounts than Tuck et al [57] s, and occupies 24% 31% less memory. Scarpazza et al. [46] propose a memory-based implementation of the deterministic AC algorithm that is capable of supporting dictionaries as large as the available main memory, and achieves a search performance of Gbps per Cell chip. Scarpazza et al. [47] also propose regular expression matching against small rule sets (which suits the needs of the search engine tokenizers) delivering 8-14 Gbps per Cell chip. 1 A trie is a tree-based data structure frequently used represents dictionaries and associative arrays that have strings as a key. 64

65 Figure 4-1. Chip layout of the Cell/Broadband Engine Architecture. node type 1bit failptr offset 3bits L1 (S1,S2, S7) 8bits*7=56bits bitmap 256bits failure ptr 32bits rule ptr 32bits firstchild ptr 32bits Figure 4-2. Bitmap node layout. node type 1bit capacity 3bits failptroff1 ~failptroff 4 firstchild ptr 32bits char1 ~ char 4 (32bits) failptr1~failptr4(32bits*4) ruleptr 1~ruleptr4 (32bits*4) Figure 4-3. Path-compressed node layout with packing factor equal to four. 65

66 wait DMA1 fire DMA 1 request wait DMA1 fire DMA 1 request Automaton1 DMA1 data transfer Automaton1 DMA1 data transfer Automata2 Automaton2 DMA2 data transfer Automaton2 DMA2 data transfer wait DMA2 fire DMA 2 request wait DMA2 fire DMA 2 request Figure 4-4. How two automata overlap the computation part with their DMA transfer wait time. Cycles/char Number of NFAs Figure 4-5. The number of cycles processed per character with different vertical unrolling factors. (Full-text search scenario). 66

67 2.5 2 Throughput(Gbps) Figure 4-6. How the throughput grows with each optimization step. (Full-text search scenario). clock cycles (on each SPE) per input character NOPs Branch Stalls Dependency Stalls Single-Issue Cycles Dual-Issue Cycles Figure 4-7. Utilization of clock cycles following each optimization step. (Full-text search scenario). 67

68 DMA Transfer Delay 300 DMA transfer Delay(ns) bytes 128 bytes 256 bytes 512 bytes 1024 bytes Number of SPEs Figure 4-8. DMA inter-arrival transfer delay from main memory to local store when 8 SPEs are used concurrently Throughput(Gbps) Full-text search Network content monitoring Intrusion Detection Virus scanning Full-text search (100% match) Virus scanning (100% match) Figure 4-9. Aggregate throughput of our algorithm on an IBM QS22 blade (16 SPEs). 68

69 2.5 Aggregate throughput (Gbps) % 50% 67% 75% 80% 100% The percentage of matched patterns Figure How the percentage of matched patterns affects the aggregate throughput. The input here is on English input data, with English Dictionary. 40 Compression ratio node packing 8 node packing 12 node packing Throughput (Gbps) Figure The trade-off between the compression ratio and the throughput in a Pareto space.the input here is on English input data, with English Dictionary. 69

70 CHAPTER 5 FAST IN-PLACE FILE CARVING FOR DIGITAL FORENSICS In this chapter, we focus on a popular open source file recovery tool Scalepl which performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multithreading as typically supported on multicore commodity PCs. Using these methods, we are able to do in-place file carving in essentially the time it takes to read the disk whose files are being carved. Since, using our methods, the limiting factor for performance is the disk read time, there is no advantage to using accelerators such as GPUs as has been proposed by others. To further speed in-place file carving, we would need a mechanism to read disk faster. 5.1 In-place Carving Using Scalpel 1.6 The normal way to retrieve a file from a disk is to search the disk directory, obtain the file s metadata (e.g., location on disk) from the directory, and then use this information to fetch the file from the disk. Often, even when a file has been deleted, it is possible to retrieve a file using this method as typically when a file is deleted, a delete flag is set in the disk directory and the remainder of the directory metadata associated with the deleted file unaltered. Of course, the creation of new files or changes to remaining files following a delete may make it impossible to retrieve the deleted file using the disk directory as the new files metadata may overwrite the deleted file s metadata in the directory and changes to the remaining files may use the disk blocks previously used by the deleted file. In file carving, we attempt to recover files from a target disk whose directory entries have been corrupted. In the extreme case the entire directory is corrupted and all files on the disk are to be recovered using no metadata. The recovery of disk files in the 70

71 Table 5-1. Example headers and footers in Scalpel s configuration file File type Header Footer gif \x47\x49\x46\x38\x37\x61 \x00\x3b gif \x47\x49\x46\x38\x39\x61 \x00\x3b jpg \xff\xd8\xff\xe0\x00\x10 \xff\xd9 htm <html </html> txt BEGIN\040PGP zip PK\x03\x04 \x3c\xac absence of directory metadata is done using header and footer information for the file types we wish to recover. Table 5-1 gives the header and footer for a few popular file types. This information was obtained from the Scalpel configuration file [40]. \x[0-f][0-f] denotes a hexadecimal value while \[0-3][0-7][0-7] is an octal value. So, for example, \x4f\123\i\scci decodes to OSI CCI. In file carving, we view a disk as being serial storage (the serialization being done by sequentializing disk blocks) and extract all disk segments that lie between a header and its corresponding footer as being candidates for the files to be recovered. For example, a disk segment that begins with the string <html and ends with the string </html> is carved into an htm file. Since a file may not actually reside in a consecutive sequence of disk blocks, the recovery process employed in file carving is clearly prone to error. Nonetheless, file carving recovers disk segments delimited by a header and its corresponding footer that potentially represent a file. These recovered segments may be analyzed later using some other process to eliminate false positives. Notice that some file types may have no associated footer (e.g., txt files have a header specified in Table 5-1 but no footer). Additionally, even when a file type has a specified header and a footer one of these may be absent in the disk because of disk corruption (for example). So, additional information (such as maximum length of file to be carved for each file type) is used in the file carving process. See [34] for a review of file carving methods. Scalpel [40] is an improved version of the file carver Foremost [19]. At present, Scalpel is the most popular open source file carver available. Scalpel carves files in two phases. In the first phase, Scalpel searches the disk image to determine the location 71

72 Table 5-2. Examples of in-place file carving output Filename Start Truncated Length Image gif/ gif NO 2746 /tmp/linux-image gif/ gif NO 4234 /tmp/linux-image jpg/ jpg NO 675 /tmp/linux-image htm/ htm NO 823 /tmp/linux-image txt/ txt NO 56 /tmp/linux-image zip/ zip NO /tmp/linux-image of headers and footers. This phase results in a database with entries such as those shown in Table 5-2. This database contains the metadata (i.e., start location of file, file length, file type, etc.) for the files to be carved. Since the names of the files cannot be recovered (as these are typically stored only in the disk directory, which is presumed to be unavailable), synthetic names are assigned to the carved files in the generated metadata database. The second phase of Scalpel uses the metadata database created in the first phase to carve files from the corrupted disk and write these carved files to a new disk. Even with maximum file length limits placed on the size of files to be recovered, a very large amount of disk space may be needed to store the carved files. For example, Richard et al. [41] reports a recovery case in which carving a wide range of file types for a modest 8GB target yielded over 1.1 million files, with a total size exceeding the capacity of one of our 250GB drives. As observed by Richard et al. [41], because of the very large number of false positives generated by the file carving process, file carving can be very expensive both in terms of the time taken and the amount of disk space required to store the carved files. To overcome these deficiencies of file carving, Richard et al. [41] propose in-place file carving, which essentially generates only the metadata database of Table 5-2. The metadata database can be examined by an expert and many of the false positives eliminated. The remaining entries in the metadata database may be examined further to recover only desired files. Since the runtime of a file carver is typically dominated 72

73 by the time for phase 2, on-line file carvers take much less time than do file carvers. Additionally, the size of even a 1 million entry metadata database is less than 60MB [41]. So, in-place carving requires less disk space as well. Although in-place file carving is considerably faster than file carving, it still takes a large amount of time. For example, in-place file carving of an 16GB flash drive with a set of 48 rules (header and footer combinations) using the first phase of Scalpel 1.6 takes more than 30 minutes on an AMD Athlon PC equipped with a 2.6GHZ Core2Duo processor and 2GB RAM. Marziale et al. [29] have proposed the use of massive threads as supported by a GPU to improve the performance of an in-place file carver. In this paper, we demonstrate that hardware accelerators such as GPUs are of little benefit when doing an in-place file carving. Specifically, by replacing the search algorithm used in Scalpel 1.6 with a multipattern search algorithm such as the multipattern Boyer Moore [7, 30, 66] and Aho-Corasick [1] algorithms and doing disk reads asynchronously, the overall time for in-place file carving using Scalpel 1.6 becomes very comparable to the time taken to just read the target disk that is being carved. So, the limiting factor is disk I/O and not CPU processing. Further reduction in the time spent searching the target disk for footers and headers, as possibly attainable using a GPU, cannot possibly reduce overall time to below the time needed to just read the target disk. To get further improvement in performance, we need improvement in disk I/O. There are essentially two tasks associated with in-place carving (a) identify the location of specified headers and footers in the target disk and (b) pair headers and corresponding footers while respecting the additional constraints (e.g., maximum file length) specified by the user. The time required for (b) is insignificant compared to that required for (a). So, we focus on (a). Scalpel 1.6 locates headers and footers by searching the target disk using a buffer of size 10MB. Figure 5-1 gives the high-level control flow of Scalpel 1.6. A 10MB buffer is filled from disk and then searched for headers and footers. This process is repeated 73

74 read buffer search buffer Figure 5-1. Control flow Scalpel 1.6 (a) for (i=1;i<p;i++) search for headeri for (i=1;i<p;i++) if (headeri found && footeri <> empty && currentpos-headeripos <maxcarvesize) search footeri Figure 5-2. Control flow Scalpel 1.6 (b) until the entire disk has been searched. When the search moves from one buffer to the next, care is exercised to ensure that headers/footers that span a buffer boundary are detected. Searching within a buffer is done using the algorithm of Figure 5-2. In each buffer, we first search for headers. The search for headers is followed by a search for footers. Only non-null footers that are within the maximum carving length of an already found header are searched for. To search a buffer for an individual header of footer, Scalpel 1.6 uses the Boyer-Moore pattern matching algorithm [8], which was developed to find all occurrences of a pattern P in a string S.. This algorithm begins by positioning the first character of P at the first 74

75 character of S. This results in a pairing of the first P characters of S with characters of P. The characters in each pair are compared beginning with those in the rightmost pair. If all pairs of characters match, we have found an occurrence of P in S and P is shifted right by 1 character (or by P if only non-overlapping matches are to be found). Otherwise, we stop at the rightmost pair (or first pair since we compare right to left) where there is a mismatch and use the bad character function for P to determine how many characters to shift P right before re-examining pairs of characters from P and S for a match. More specifically, the bad character function for P gives the distance from the end of P of the last occurrence of each possible character that may appear in S. So, for example, if the characters of S are drawn from the alphabet {a, b, c, d}, the bad character function, B, for P = abcabcd has B(a) = 4, B(b) = 3, B(c)= 2, and B(d) = 1. In practice, many of the shifts in the bad character function of a pattern are close to the length, P, of the pattern P making the Boyer-Moore algorithm a very fast search algorithm. In fact, when the alphabet size is large, the average run time of the Boyer-Moore algorithm is O( S / P ). Galil [20] has proposed a variation for which the worst-case run time is O( S ). Horspool [21] proposes a simplification to the Boyer-Moore algorithm whose performance is about the same as that of the Boyer-Moore algorithm. Even though the Boyer-Moore algorithm is a very fast way to find all occurrences of a pattern in a string, using it in our in-place carving application isn t optimal because we must use the algorithm once for each pattern (header/footer) to be searched. So, the time to search for all patterns grows linearly in the number of patterns. Locating headers and footers using the Boyer-Moore algorithm, as is done in Scalpel 1.6, takes O(mn) time where m is the number of file types being searched and n is the size of the target disk. Consequently, the run time for in-place carving grows linearly with both the number of file types and the size of the target disk. Doubling either the number of file types or the disk size will double the expected run time; doubling both will quadruple the 75

76 run time. However, when a multipattern search algorithm is used, the run time is O(n) (both expected and worst case). That is, the time is independent of the number of file types. Whether we are searching for 20 file types or 40, the time to find the locations of all headers and footers is the same! 5.2 Multipattern Boyer-Moore Algorithm Several multipattern extensions to the Boyer-Moore search algorithm have been proposed [5, 7, 30, 66]. All of these multipattern search algorithms extend the basic bad character function employed by the Boyer-Moore algorithm to a bad character function for a set of patterns. This is done by combining the bad character functions for the individual patterns to be searched into a single bad character function for the entire set of patterns. The combined bad character function B for a set of p patterns has B(c) = min{b i (c), 1 i p} for each character c in the alphabet. Here B i is the bad character function for the ith pattern. The Set-wise Boyer-Moore algorithm of [30] performs multipattern matching using this combined bad function. The multipattern search algorithms of [5, 7, 66] employ additional techniques to speed the search further. The average run time of the algorithms of [5, 7, 66] is O( S /minl), where minl is the length of the shortest pattern. Baeza and Gonnet [6] extend multipattern matching to allow for don t cares and complements in patterns. This extension isn t required for our in-place file carving application. 5.3 Multicore Searching Contemporary commodity PCs have either a dualcore or quadcore processor. We may exploit the availability of more than one core to speed the search for headers and footers. This is done by creating as many threads as the number of cores (experiments indicate that there is no performance gain when we use more threads than the number of cores). Each thread searches a portion of the string S. So, if the number of threads is 76

77 read buffer search left half buffer search right half buffer Figure 5-3. Control flow for 2-threaded search t, each thread searches a substring of size S /t plus the length of the longest pattern minus 1. Figure 5-3 shows the control flow when two threads are used to do the search. 5.4 Asynchronous Read Scalpel 1.6 fills its search buffer using synchronous (or blocking) reads of the target disk. In a synchronous read, the CPU is unable to do any computing while the read is in progress. Contemporary PCs, however, permit asynchronous (or non-blocking) reads of disk. When an asynchronous read is done, the CPU is able to perform computations that do not involve the data being read from disk while the disk read is in progress. When asynchronous reads are used, we need two buffers active and inactive. In the steady state, our computer is doing an asynchronous read into the inactive buffer while simultaneously searching the active buffer. When the search of the active buffer completes, we wait for the ongoing asynchronous read to complete, swap the roles of the active and inactive buffers, initiate a new asynchronous read into the current inactive buffer, and proceed to search the current active buffer. This is stated more formally in Figure 5-4. Let T read be the time needed to read the target disk and let T search be the time needed to search for headers and footers (exclusive of the time to read from disk). When synchronous reads are used as in Figures 5-1 and 5-2, the total time for in-place 77

78 Algorithm Asynchronous begin read activebuffer repeat if there is more input asynchronous read inactivebuffer search activebuffer wait for asynchronous read (if any) to complete swap the roles of the 2 buffers until done end Figure 5-4. In-place carving using asynchronous reads carving is approximately T read + T search (note that the time required for task (b) of in-place carving is relatively small). When asynchronous reads are used, all but the first buffer is read concurrently with the search of another buffer. So, the time for each iteration of the repeat-until loop is the larger of the time to read a buffer and that to search the buffer. When the buffer read time is consistently larger than the buffer search time or when the buffer search time is consistently larger than the buffer read time, the total in-place carving time using asynchronous reads is approximately max{t read, T search }. Therefore, using asynchronous reads rather than synchronous reads has the potential to reduce run time by as much as 50%. The search algorithms of Sections 5.1 and 6.2, other than the Aho-Corasick algorithm, employ heuristics whose effectiveness depends on both the rule set and the actual contents of the buffer being searched. As a result, it is entirely possible that when we search one buffer, the read time exceeds the search time while when another buffer is searched, the read time exceeds the search time. So, when these search methods are used, it is possible that the in-place carving time is somewhat more than max{t read, T search }. 5.5 Multicore In-place Carving In Section 5.3 we saw how to use multiple cores to speed the search for headers and footers. Task (a) of in-place carving, however, needs to both read data from disk and search the data that is read. There are several ways in which we can utilize the 78

79 read activebuffer read inactivebuffer search activebuffer swap active & inactive buffer roles Figure 5-5. Control flow for single core read and single core search (SRSS) available cores to perform both these tasks. The first is to use synchronous reads followed by multicore searching as described in Section 5.3. We refer to this strategy as SRMS (synchronous read multicore search). Extension to a larger number of cores is straightforward. The second possibility is to use one thread to read a buffer using a synchronous read and the second to do the search (Figure 5-5). We refer to this strategy as SRSS (single core read and single core search). A third possibility is to use 4 buffers and have each thread run the asynchronous read algorithm of Figure 5-4 as shown in Figures 5-6 and 5-7. In Figure 5-6 the threads are synchronized for every pair of buffers searched while in Figure 5-7, the synchronization is done only when the entire disk has been searched. So, using the strategy of Figure 5-6, each thread processes the same number of buffers (except when the number of buffers of data is odd). When the time to fill a buffer from disk consistently exceeds the time to search that buffer, the strategy of Figure 5-7 also processes the same number of buffers per thread. However, when the buffer fill time is less than the search time and there is sufficient variability in the time to search a buffer, it is possible, using the strategy of Figure 5-7, for one thread to process many more buffers than 79

80 read activebuffer1, activebuffer2 if there is more input asynchronous read inactivebuffer1 search activebuffer1 wait for asynchronous read (if any) to complete swap the roles of the 2 buffers if there is more input asynchronous read inactivebuffer2 search activebuffer2 wait for asynchronous read (if any) to complete swap the roles of the 2 buffers Figure 5-6. Control flow for multicore asynchronous read and search (MARS1) read activebuffer1, activebuffer2 repeat if there is more input asynchronous read inactivebuffer1 search activebuffer1 wait for asynchronous read (if any) to complete swap the roles of the 2 buffers until done repeat if there is more input asynchronous read inactivebuffer2 search activebuffer2 wait for asynchronous read (if any) to complete swap the roles of the 2 buffers until done Figure 5-7. Another control flow for multicore asynchronous read and search (MARS2) processed by the other thread. In this case, the strategy of Figure 5-7 will outperform that of Figure 5-6. For our application, the time to fill a buffer exceeds the time to search it excepts when the number of rules is large (more than 30) and the search is done using an algorithm such as Boyer Moore (as is the case in Scalpel 1.6), which is not designed for multipattern search. Hence, we expect both strategies to have similar performance. We refer to these strategies as MARS1 (multicore asynchronous read and search) and MARS2, respectively. 80

81 Table 5-3. In-place carving time by Scalpel 1.6 for a 16GB falshdisk Number of Carving Rules Total Time 967s 1069s 1532s 1788s 1905s Disk Read 833s 833s 833s 833s 833s Search 133s 232s 693s 947s 1063s Other 1s 4s 6s 8s 9s 5.6 Experimental Results We evaluated the strategies for in-place carving proposed in this paper using a dual processor,dual core AMD Athlon (2.6GHZ Core2Duo processor, 2GB RAM). We started with Scalpel 1.6 and shut off its second phase so that it stopped as soon as the metadata database of carved files was created. All our experiments used pattern/rule sets derived from the 48-rules in the configuration file in [44]. From this rule set we generated rule sets of smaller size by selecting the desired number of rules randomly from this set of 48 rules. We used the following search strategies: Boyer Moore as used in Scalpel 1.6 (BM); SBM-S (set-wise Boyer Moore-simple), which uses the combined bad character function given in Section 6.2 and the search algorithm employed in [30]; SBM-C (set-wise Boyer-Moore-complex) [7]; WuM [66]; and Aho Corasick (AC). Our experiments were designed to first measure the impact of each strategy proposed in the paper. These experiments were done using as our target disk a 16GB flash drive. All times reported in this paper are the average from repeating the experiment five times. A final experiment was conducted by coupling several strategies to obtain a new best performance Scalpel in-place carving program. This program is called FastScalpel. For this final experiment, we used flash drives and hard disks of varying capacity Run Time of Scalpel 1.6 Our first experiment analyzed the run time of in-place carving. Table 5-3 shows the overall time to do an in-place carve of our 16GB flash drive as well as time spent to read the disk and that spent to search the disk for headers and footers. The time spent on other tasks (this is the difference between the total time and the sum of the read 81

82 and search times) also is shown. As can be seen, the search time increases with the number of rules. However, the increase in search time isn t quite linear in the number of rules because the effectiveness of the bad character function varies from one rule to the next. For small rule sets (approximately 30 or less), the input time (time to read from disk) exceeds the search time while for larger rule sets, the search time exceeds the input time. The time spent on activities other than input and search is very small compared to that spent on search and input for all rule sets. So, to reduce overall time, we need to focus on reducing the time spent reading data from the disk and the time spent searching for headers and footers Buffer Size Scalpel 1.6 spends almost all of its time reading the disk and searching for headers and footers (Table 5-3). The time to read the disk is independent of the size of the processing buffer as this time depends on the disk block size used rather than the number of blocks per buffer. The search time too is relatively insensitive to the buffer size as changing the buffer size affects only the number of times the overhead of processing buffer boundaries is incurred. For large buffer sizes (say 100K and more), this overhead is negligible. Although the time spent on other tasks is relatively small when the buffer size is 10MB (as used in Scalpel 1.6), this time increases as the buffer size is reduced. For example, Scalpel 1.6 refreshes the progress bar following the processing of each buffer load. When the buffer size is reduced from 10MB to 100KB, this refresh is done 100 times as often. The variation in time spent on other activities results in a variation in the run time of Scalpel 1.6 with changing buffer size. Table 5-4 shows the in-place carving time by Scalpel 1.6 with different buffer size with 48 carving rules. This variation may be virtually eliminated by altering the code for the other components to (say) refresh the progress bar after every (say) 10 MB of data has been processed, thereby eliminating the dependency on buffer size. So, we can get the same performance using a much smaller buffer size. 82

83 Table 5-4. In-place carving time by Scalpel 1.6 with different buffer size with 48 carving rules Buffer Size 100KB 1MB 10MB 20MB Time 2030s 1895s 1905s 1916s Table 5-5. Search time for a 16GB flash drive Number of Carving Rules BM 133s 232s 693s 947s 1063s SBM-S 99s 108s 124s 132s 158s SBM-C 107s 117s 142s 155s 178s WuM 206s 205s 201s 219s 212s AC 63s 62s 64s 65s 64s Multipattern Matching Table 5-5 shows the time required to search our 16GB flash drive for headers and footers using different search methods. This time does not include the time needed to read from disk to buffer or the time to do other activities (see Table5-3). Table 5-6 and Figure 5-8 give the speedup achieved by the various multipattern search algorithms relative to the Boyer-Moore search algorithm that is used in Scalpel 1.6. As can be seen, the run time is fairly independent of the number of rules when the Aho-Corasick (AC) multipattern search algorithm is used. Although the theoretical expected run time of the remaining multipattern search algorithms (SBM-S, SBM-C, and WuM) is independent of the number of search patterns, the observed run time shows some increase with the increase in number of patterns. This is because of the variability in the effectiveness of the heuristics employed by these methods and the fact that our experiment is limited to a single rule set for each rule set size. Employing a large number of rule sets for each rule set size and searching over many different disks should result in an average time that does not increase with rule set size. The Aho-Corasick multipattern search algorithm is the clear winner for all rule set sizes. The speedup in search time when this method is used ranges from a low of 2.1 when we have 6 rules to a high of 17 when we have 48 rules. 83

84 Table 5-6. Speedup in search time relative to Boyer-Moore Number of Carving Rules SBM-S SBM-C WuM AC AC SBM-C SBM-S WuM number of file rules Figure 5-8. Multi-Pattern Search Algorithms Speedup Multicore Searching Table 5-7 gives the time to search our 16GB flash drive (exclusive of the time to read from the drive to the buffer and exclusive of the time spent on other activities) using 24 rules and the dualcore search strategy of Section 5.3. The column labeled unthreaded is the same as that labeled 24 in Table 5-5. Although the search task is easily partitioned into 2 or more threads with little extra work required to ensure that matches that cross partition boundaries are not missed, the observed speedup from using 2 threads on a dualcore processor is quite a bit less than 2. This is due to the overhead associated with spawning and synchronizing threads. The impact of this 84

85 Table 5-7. Time to search using dualcore strategy with 24 rules Algorithms Unthreaded 2 threads Speedup BM 693s 380s 1.82 SBM-S 124s 88s 1.41 SBM-C 142s 99s 1.43 WuM 201s 149s 1.35 AC 64s 58s 1.10 Table 5-8. In-place carving time using Algorithm Asynchronous Number of Carving Rules BM 843s 855s 968s 966s 1100s SBM-S 838s 837s 839s 888s 847s SBM-C 832s 843s 837s 829s 847s WuM 840s 841s 840s 843s 842s AC 832s 834s 828s 833s 828s overhead is very noticeable when the search time for each thread launch is relatively small as in the case of AC and less noticeable when this search time is large as in the case of BM. In the case of AC, we get virtually no speedup in total search time using a dualcore search while for BM, the speedup is Asynchronous Read Table 5-8 gives the time taken to do an in-place carving of our 16GB disk using Algorithm Asynchronous (Figure 5-4). The measured time is generally quite close to the expected time of max{t read, T search }. A notable exception is the time for BM with 24 rules where the in-place carving time is substantially more than max{833, 693} = 833 (see Table 5-3). This discrepancy has to do with variation in the effectiveness of the bad character heuristic used in BM from one buffer to the next as explained at the end of Section 5.4. Although using asynchronous reads, we are able to speedup Scalpel 1.6 by a factor of almost 2 when the number of rules is 48, this isn t sufficient to overcome the inherent inefficiency of using the Boyer-Moore search algorithm in this application over using one of the stated multipattern search algorithms. 85

86 Table 5-9. In-place carving time using SRMS Number of Carving Rules BM 961s 987s 1217s 1338s 1393s SBM-S 942s 944s 953s 958s 944s SBM-C 948s 937s 928s 935s 979s WuM 978s 977s 975s 987s 1042s AC 924s 925s 929s 927s 973s Table In-pace carving time using SRSS Number of Carving Rules BM s 932s 1006s SBM-S 849s 850s 849s 844s 881s SBM-C 852s 847s 844s 854s 845s WuM 843s 837s 870s 843s 833s AC 850s 852s 852s 852s 849s Multicore In-place Carving Tables 5-9 through 5-11, respectively, give the time taken by the multicore carving strategies SRMS, SRSS, and MARS2 of Section 5.5. When the Boyer-Moore search algorithm is used, a multicore strategy results in some improvement over Algorithm Asynchronous only when we have a large number of rules (in our experiments, 24 or more rules) as when the number of rules is small, the search time is dominated by the read time and the overhead of spawning and synchronizing threads. When a multipattern search algorithm is used, no performance improvement results from the use of multiple cores. Although we experimented only with a dualcore, this conclusion applies to a large number of cores, GPUs, and other accelerators as the bottleneck is the read time from disk and not the time spent searching for headers and footers Scalpel 1.6 vs. FastScalpel Based on our preliminary experiments, we modified the first phase of Scalpel 1.6 in the following way: 1. Replace the synchronous buffer reads of Scalpel 1.6 by asynchronous reads. 86

87 Table In-place carving time using MARS2 Number of Carving Rules BM 909s 912s 943s 938s 1011s SBM-S 907s 907s 908s 908s 909s SBM-C 904s 906s 905s 907s 917s WuM 906s 906s 907s 908s 908s AC 904s 903s 902s 904s 904s 2. Replace the Boyer-Moore search algorithm used in Scalpel 1.6 by the Aho-Corasick multipattern search algorithm We refer to this modified version as FastScalpel. Although FastScalpel uses the same buffer size (10MB) as used by Scalpel 1.6, we can reduce the buffer size to tens of KBs without impacting performance provided we modify the code for the other components of Scalpel 1.6 as described in Section The performance of FastScalpel relative to Scalpel 1.6 was measured using a variety of target disks. Table 5-12 gives the measured in-pace carving time as well as the speedup achieved by FastScalpel relative to Scalpel 1.6. Figure 5-9 plots the measured speedup. The 16GB disk used in these experiments is a flash disk while the 32GB and 75GB disks are hard drives. While speedup increases as we increase the size of the rule set, the speedup is relatively independent of the disk size and type. The speedup ranged from about 1.1 when the rule set size is 6 to about 2.4 when the rule set size is 48. For larger rule sets, we expect even greater speedup. Since the total time taken by FastScalpel is approximately equal to the time to read the disk being carved, further speedup is possible only by reducing the time to read the disk. This would require a higher bandwidth between the disk and buffer. 87

88 Table In-place carving time and speedup using FastScalpel and Scalpel 1.6 Number of Carving Rules Scalpel 1.6(16GB) 967s 1069s 1532s 1788s 1905s FastScalpel(16GB) 832s 834s 828s 833s 828s Speedup(16GB) Scalpel 1.6(32GB) 1581s 1737s 2573s 3263s 3386s FastScalpel(32GB) 1443s 1460s 1448s 1447s 1438s Speedup(32GB) Scalpel 1.6(75GB) 3766s 4150s 6348s 7801s 8307s FastScalpel(75GB) 3376s 3393s 3386s 3375s 3396s Speedup(75GB) GB Flashdisk 32GB Harddisk 75GB Harddisk number of file rules Figure 5-9. Speedup of FastScalpel relative to Scalpel

89 CHAPTER 6 MULTI-PATTERN MATCHING ON MULTICORES AND GPUS Our focus in this chapter is accelerating the Aho-Corasick and Boyer-Moore multipattern string matching algorithms through the use of a GPU. A GPU operates in traditional master-slave fashion (see [43], for example) in which the GPU is a slave that is attached to a master or host processor under whose direction it operates. Algorithm development for master-slave systems is affected by the location of the input data and where the results are to be left. Generally, four cases arise [63 65] as below. 1. Slave-to-slave. In this case the inputs and outputs for the algorithm are on the slave memory. This case arises, for example, when an earlier computation produced results that were left in slave memory and these results are the inputs to the algorithm being developed; further, the results from the algorithm being developed are to be used for subsequent computation by the slave. 2. Host-to-host. Here the inputs to the algorithm are on the host and the results are to be left on the host. So, the algorithm needs to account for the time it takes to move the inputs to the slave and that to bring the results back to the host. 3. Host-to-slave. The inputs are in the host but the results are to be left in the slave. 4. Slave-to-host. The inputs are in the slave and the results are to be left in the host. In this chapter, we address the first two cases only. In our context, we refer to the first case (slave-to-slave) as GPU-to-GPU. 6.1 The NVIDIA Tesla Architecture Figure 6-1 gives the architecture of the NVIDIA GT200 Tesla GPU, which is an example of NVIDIA s general purpose parallel computing architecture CUDA (Compute Unified Driver Architecture) [13]. This GPU comprises 240 scalar processors (SP) or cores that are organized into 30 streaming multiprocessors (SM) each comprised of 8 SPs. Each SM has 16KB of on-chip shared memory, bit registers, and constant and texture cache. Each SM supports up to 1024 active threads. There also is 4GB of global or device memory that is accessible to all 240 SPs. The Tesla, like other 89

Figure 6-1. NVIDIA GT200 Architecture [56] GPUs, operates as a slave processor to an attached host. In our experimental setup, the host is a 2.8GHz Xeon quad-core processor with 16GB of memory.

90 Figure 6-1. NVIDIA GT200 Architecture [56] GPUs, operates as a slave processor to an attached host. In our experimental setup, the host is a 2.8GHz Xeon quad-core processor with 16GB of memory. A CUDA program typically is a C program written for the host. C extensions supported by the CUDA programming environment allow the host to send and receive data to/from the GPU s device memory as well as to invoke C functions (called kernels) that run on the GPU cores. The GPU programming model is Single Instruction Multiple Thread (SIMT). When a kernel is invoked, the user must specify the number of threads to be invoked. This is done by specifying explicitly the number of thread blocks and the number of threads per block. CUDA further organizes the threads of a block into warps of 32 threads each, each block of threads is assigned to a single SM, and thread warps are executed synchronously on SMs. While thread divergence within a warp is permitted, when the threads of a warp diverge, the divergent paths are executed serially until they converge. A CUDA kernel may access different types of memory with each having different capacity, latency and caching properties. We summarize the memory hierarchy below. 90

Highly Compressed Aho-Corasick Automata For Efficient Intrusion Detection

Highly Compressed Aho-Corasick Automata For Efficient Intrusion Detection Xinyan Zha & Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL 32611 {xzha, sahni}@cise.ufl.edu