STRING matching algorithms check and detect the presence

Size: px

Start display at page:

Download "STRING matching algorithms check and detect the presence"

Horatio Cole
5 years ago
Views:

1 436 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures Antonino Tumeo, Member, IEEE, Oreste Villa, Member, IEEE, and Daniel G. Chavarría-Miranda Abstract String matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. This paper compares several software-based implementations of the Aho-Corasick algorithm for high-performance systems. We focus on the matching of unknown inputs streamed from a single source, typical of security applications and difficult to manage since the input cannot be preprocessed to obtain locality. We consider shared-memory architectures (Niagara 2, x86 multiprocessors, and Cray XMT) and distributed-memory architectures with homogeneous (InfiniBand cluster of x86 multicores) or heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C1060 GPUs). We describe how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets. Index Terms Aho-Corasick, string matching, GPGPU, Cray XMT, multithreaded architectures, high-performance computing. Ç 1 INTRODUCTION STRING matching algorithms check and detect the presence of one or more known symbol sequences inside a data set. Besides their well-known application to databases and text processing, they are the basis of several critical realworld applications. String matching algorithms are key components of DNA and protein sequence analysis, data mining, security systems, such as Intrusion Detection Systems (IDS) for Networks (NIDS), Applications (APIDS), Protocols (PIDS), or Systems (Host-based IDS HIDS), antivirus software, and machine learning problems [1], [2]. All these applications process large quantities of textual data and require extremely high performance to produce meaningful results in an acceptable time. Among all string matching algorithms, one of the most studied, especially for text processing and security applications, is the Aho- Corasick (AC) algorithm [3], due to its exact, multipattern approach and its ability to perform the search in time linearly proportional to the length of the input stream. A large amount of research has been done to design efficient implementations of string matching algorithms using Field Programmable Gate Arrays (FPGAs) [4], [5], highly multithreaded solutions like the Cray XMT [6], multicore processors [7], or heterogeneous processors like the Cell Broadband Engine [8], [9]. Recently, Graphic Processing Units (GPUs) have been demonstrated as a suitable platform for some classes of string matching algorithms for NIDS such as SNORT [10], [11], [12]. Most previous approaches mainly focused on speed. Only in the last few years, aspects such as performance stability,. The authors are with the High Performance Computing Group, Pacific Northwest National Laboratory (PNNL), Richland, WA {antonino.tumeo, oreste.villa, daniel.chavarria}@pnl.gov. Manuscript received 19 July 2010; revised 21 Feb. 2011; accepted 17 May 2011; published online 16 June Recommended for acceptance by S. Aluru. For information on obtaining reprints of this article, please send to: tpds@computer.org, and reference IEEECS Log Number TPDS Digital Object Identifier no /TPDS independent from the size of the input and from the number of patterns to search, have started to gain interest. Actually, string matching applications not only require high performance, but also the ability to deal with very large dictionaries. For example, NIDS must be able to scan inputs from modern Ethernet links at 10 Gbps, while considering a number of malicious threats, already well over 1 million and exponentially growing [13]. They should do that in real time, without hindering the overall performance of the system, reducing the available bandwidth, or raising latencies. In general, hardware solutions support only small dictionary sizes due to lack of memory and are difficult to customize, while platforms such as the Cell/BE are extremely complex to program. With the emergence of multicore and multithreaded architectures and general-purpose computation on GPUs, softwarebased solutions have started to become a feasible platform for high throughput string matching applications. String matching is a good candidate for execution on these parallel architectures, since the search can be parallelized by dividing the input data set in smaller subsets, each one processed by a single thread or core. However, obtaining the maximum performance on all of them still requires a significant effort to efficiently map the algorithm to their features, and it may not even be sufficient to reach the desired throughput. Furthermore, performance variability when dealing with different sizes of inputs and dictionaries has always been a significant limit for softwarebased solutions. This is particularly true on cache-based architectures: if the matching patterns are resident in the cache, the matching algorithm performs very well; however, if they are not in the cache and have to be retrieved from main memory, the matching algorithm performs poorly. In many applications, when the input is not known and the data cannot be adequately preprocessed to guarantee some locality, such as data streamed from a single-input source like a network, the algorithm accesses data in unpredictable locations of the main memory, leading to highly variable performance /12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

2 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 437 In this paper, we present and compare several softwarebased implementations of the AC algorithm for highperformance systems. We focus our attention on streaming applications with unknown inputs, a situation typical of security systems, and often problematic for the search process. We look carefully how each solution achieves the objectives of supporting large dictionaries (up to 190,000 patterns), obtaining high performance, enabling flexibility and customization, and limiting performance variability. We present optimized implementations of the algorithm on a range of high-performance architectures, with shared or distributed memory, and with homogeneous or heterogeneous processing elements. For the shared-memory solutions, we consider a Cray XMT with up to 128 processors (128 threads per processor), a dual-socket Niagara 2 (eight cores per processor, eight threads per core), and a dual-socket Intel Xeon 5560 (Nehalem architecture, four cores per processor, two threads per core). For the distributed-memory systems, we evaluate a homogeneous cluster of Xeon 5560 processors (10 nodes, two processors per node) interconnected through Infiniband QDR and a heterogeneous cluster, where the Xeon 5560 processors are accelerated with NVIDIA Tesla C1060 GPUs (10 nodes, two GPUs per node). This paper extends the work in [6], by introducing and comparing new platforms with various architectural features. To the best of our knowledge, no previous work on software-based string matching algorithms included such a broad evaluation in terms of high-performance systems, input sets, and algorithm optimizations. The paper is organized as follows. Section 2 presents the algorithmic design on the various machines. Section 3 discusses our experimental results and the comparison among all the machines. Section 4 presents our conclusions. A comprehensive survey of the related work on string matching and background material on the AC algorithm and on the systems evaluated in this paper are included in the supplemental material, which can be found on the Computer Society Digital Library at doi.ieeecomputersociety.org/ /tpds ALGORITHM DESIGN and OPTIMIZATION In this section, we introduce the overall algorithmic design and then focus on the specific modifications and optimizations for the various platforms. Starting from the same basis, we designed sharedmemory implementations using pthreads for the Niagara 2, the dual Xeon (referred as x86 SMP Shared-memory Multiprocessor in the rest of the paper), and the proprietary programming model of the Cray XMT. For the GPU kernel, we used NVIDIA CUDA. For the heterogeneous and homogeneous clusters, instead, we designed an MPI load balancer, which can be used alone, or integrated with pthreads or CUDA. For all the implementations, our algorithm design is based on the following cornerstones:. minimize the number of memory references. reduce memory contention. We represent the patterns in a Deterministic Finite-state Automaton (DFA). The supplemental material, which can be found on the Computer Society Digital Library at doi.ieeecomputersociety.org/ /tpds , discusses how the automaton is generated. For each possible input symbol, there is always a valid transition to another node in the automaton. This key feature guarantees that, for each input symbol, there is always the same amount of work to perform. For a given dictionary, the data structures in main memory (DFA and input symbols) are read-only. For all the implementations, the common parallelization strategy is to use multiple threads or processes that concurrently execute the algorithm. Each thread or process has a current_node and operates on a distinct section of the input. The threads of shared-memory implementations and the threads running on a specific GPU access the same DFA, while different MPI processes and different GPUs access their own copy of the DFA. At runtime the input stream is buffered and split into chunks, which are then assigned to each processing element. The size of the chunks depends on the granularity of the parallelization. Eventually, they are chunked again in a hierarchical way. This is, for example, what happens in the MPI with CUDA implementation, where each GPU receives a buffer and then partitions it among its own threads, or in the MPI with pthreads implementation, where each MPI process running on a node gets its own buffer that is then partitioned among the threads executed by the different cores. The chunks overlap partially to allow matching of those patterns that cross a boundary. The overlapping is equal to the length of the longest pattern in the dictionary minus 1 symbol. The inefficiency of the overlapping (replicated work) is measured as (longest pattern-1)/(size of the chunk). As already discussed in [6], we represent the DFA graph as a State Transition Table (STT). The STT is a table composed of as many rows as there are nodes in the DFA and as many columns as there are symbols of the alphabet. Each STT line represents a node in the original DFA. Each entry (cell) of a STT line (indexed by a current_node and a symbol) stores the address of the beginning of the STT line that represents the next_node for that transition in the DFA. For the Cray XMT and pthreads pattern matchers, the STT lines are 256-byte aligned such that the least-significant byte of the address is equal to zero. This property allows us to store in the least-significant bit of each STT cell the boolean information indicating if that transition is final or not. Since we want to retrieve the next_node address dereferencing the current node þ symbol pointer, STT lines must always have the same size. Alphabet symbols not used in the dictionary have to be explicitly represented as transitions to the root node. The supplemental materials, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds , discusses in deeper details the basic design, which is at the base of the Cray XMT and the pthreads implementation. 2.1 Cray XMT Implementation The approach for exploiting the highly multithreaded architecture of the Cray XMT focuses mostly on reducing latency variability. If the latency is constant or slowly variable, the system is able to schedule a sufficient number of threads to hide it. On the Cray XMT, the main cause of variability in the memory access time is the presence of hotspots. Hotspots are memory regions frequently accessed by multiple

3 438 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 threads simultaneously. Nevertheless, the XMT employs a hardware hashing mechanism which spreads data in all the system s memory banks with a granularity of 64 bytes (block) [14] (see supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds ). However, if different blocks corresponding to different memory banks have different access ratios, the pressure on the memory banks is not equally balanced, producing variability in the access time. In our implementation, there are two reasons why this can happen.. Each STT cell is large 8 bytes (address of the STT line + Boolean flag for final transitions). Therefore, each STT line (representing a DFA node) has size corresponding to 8 bytes multiplied by the alphabet size. If we consider the 256-symbol ASCII alphabet, each STT line requires 2,048 bytes, or 32 blocks (64 bytes per block). If the scanned input has a particular symbol frequency (i.e., English text, decimal numbers, etc.), the input symbols will only be a subset of the ASCII alphabet and concurrent threads will be accessing only a subset of the 32 blocks per STT line, producing hotspots.. Typically, a few states in the first levels of a Bread- First Search (BFS) of the DFA graph are responsible for the majority of accesses. As a result, the memory blocks containing those states form hotspots. For inputs that are similar to the dictionary, instead, the transitions tend to be distributed more equally on the levels, leading to reduced or absent hotspots. The supplemental material, which can be found on the Computer Society Digital Library at TPDS , includes the analysis of the access patterns for a dictionary of 20,000 English words when scanned against different inputs. To alleviate the above problems, as shown in [6] we propose the following solutions:. Alphabet shuffling. The alphabet symbols in a STT line can be shuffled using a relatively simple linear transformation, ensuring that contiguous symbols in the alphabet are spread out over multiple memory blocks. The shuffling function can be inexpensively and effectively computed as symbol 0 ¼ðsymbol fixed offsetþ 8. This transformation, in conjunction with the hardware hashing mechanism on the XMT s memory, guarantees that accesses are spread out over distinct memory blocks for those inputs (i.e., English text, decimal numbers, etc.) that have characters located relatively close to each other in the ASCII ordering. The hardware memory hashing allows us designing a shuffling function that does not depend on the state number (line in the STT table). In contrast, systems where the memory hierarchy is not hashed, like the Niagara 2 or the x86 SMP, require more complex shuffling functions.. State replication. We replicate the STT states corresponding to the first levels of the BFS exploration of the DFA. Addresses of the different replicas of the same logic state (STT line) are randomly stored during the creation of the STT in the STT cells pointing to that state. This ensures that the memory pressure is equally balanced when different threads access the blocks for that state. This mechanism is greatly simplified by the underlying hardware hashing, since different replicas are spread in different memory banks. On cache-based architectures (x86 SMP and Niagara 2), these optimizations would not bring significant benefits, but rather reduce the spatial locality of the cache. 2.2 GPU Implementation The GPU implementation starts from the same principles of the basic design, but requires some specific adaptations. Each CUDA thread independently performs the matching on a chunk of the input text. This allows using reasonably sized chunks, while maintaining a high utilization of each thread. The main differences reside in the STT. The rows still represent the states, the columns still represents the symbols. However, states are addressed by indices and not by pointers. Each cell of the table, which in our GPU code has a size of 32 bits, contains, thus, the index of the next state in the first 31 bits and the flag that tags final states in the last bit. So, we remove the requirement for the 256-bytes alignment of the lines, reducing the memory usage, which is critical for GPUs that can address only up to 4 GB. Nevertheless, the organization of the memory controllers for the CUDA architecture still requires to further optimize the STT layout. Similarly to the Cray XMT implementation, memory hotspots may happen. The reasons are that each memory controller manages a 256-byte wide memory partition and, depending on the number of memory controllers and partitions, on the alphabet, on the input streams, on the number of matches, the accesses may concentrate only on a partition, thus saturating its own memory controller. This problem is known in the GPGPU community as partition camping [15]. On the Tesla C1060 there are eight partitions, thus, with rows of 256 cells (ASCII alphabet) of 32 bits (1,024 bytes in total), a partition has the cells for the same 64 symbols on odd or even rows. So, with inputs that have particular symbol frequencies, the accesses may concentrate only on few partitions. To solve this problem, we added padding to each line of the STT corresponding to the size of a partition, generating a more even access pattern. The unpredictable nature of the loads on the STT makes it impossible to coalesce the accesses of half-warps in a single-memory transaction: with high probability, they are not sequential and not aligned. Thus, we decided to bind the STT to the texture memory. Binding data to texture memory makes the GPU fetch the data through the texture units, which do not require coalescence (but still suffer from partition camping), and exploit the texture caches, which are optimized for 2D locality. The texture cache has an active set around 8 KB and, when there is a hit, the latency for loading the data is only a few cycles (the exact performance is not disclosed by NVIDIA, but empirical evaluation puts it around 6-7 cycles). The benefits in using the texture cache derive mainly from the hits on the first levels of the STT. There is a limit on the size of textures in CUDA: when bound to linear memory, they can allocate up to 2 27 elements. So, when big STTs are used, we bind them to texture only partially.

4 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES Distributed-Memory Implementation For distributed-memory architectures, we wrapped our string matching engine in an MPI load balancer. We developed a master/slave scheduler, where the master MPI process distributes the work, and the slaves perform the effective computation. From the same scheduler, we developed three slightly different implementations for the various configurations of the clustered architectures. Fig. 1. Optimizations on the input text for coalesced reads. The input text is transposed so that each thread reads four consecutive input symbols from its respective chunk with a single load. This allows us retaining the benefits of caching for the majority of light matching cases, without significantly influencing the performance in heavy matching cases, where there are many cache misses. Using shared memory for caching STT lines is not effective for large dictionaries: with the maximum block size of 512 threads, only eight lines (each line has a size of 4 bytes 256 symbols ¼ 1;024 bytes) can be stored (replicated for each block) in the 16 KB of memory. However, for the smallest dictionary in our benchmark set (20k-pattern), the first two levels already have 54 lines (0.11 percent of 49,849). Another important optimization is performed on the input text. In applications streaming data from a single source, the input text is sequentially buffered in the host memory. Thus, if the input text is directly moved to the global memory of the graphic card, each CUDA thread would start loading it with a stride corresponding to the chunk size. Consequently, memory accesses of a half-warp will be uncoalesced. This can be thought as having the input text organized in a matrix, where each chunk corresponds to a row. Fig. 1a shows the situation. To solve this problem, we apply a transposition after copying the input text to the global memory. However, the input text is transposed in groups of four symbols. This is due to the optimal size for loads on the Tesla C1060 GPU (T10 architecture). Since we use the ASCII alphabet, each symbol is an 8-bit character. The minimum granularity for load transactions of a half-warp (group of 16 threads simultaneously performing memory operations) is 32 bytes (16 bits per threads), while the optimal size is 64 bytes (32 bits per thread). Thus, the input chunks are transposed in blocks of four symbols, which are then read with a single, coalesced load in the main loop of the pattern matching algorithm. The final result of the transposition, with chunks now on columns, is shown in Fig. 1b. When the number of chunks is not an integer multiple of the number of threads in a half-warp (16), we also add padding so that each row starts aligned in memory to respect coalescence rules. The transposition is performed directly on the GPU with a fast and optimized kernel, and it is transparent to the rest of the system. The matching results are collected per chunk. A reduction is applied to gather all the results and send them back to the host through a single-memory copy operation. Again, the reduction operation is performed on the GPU with an optimized kernel.. An MPI-only solution for the homogeneous x86 cluster, where each slave process is mapped to a core and wraps a single-threaded version of our string matching engine.. An MPI with phtreads solution, where each slave process is mapped to an entire node of the cluster and wraps a multithreaded version of our string matching engine.. An MPI with CUDA solution, where each slave process is mapped to a GPU and wraps a CUDA kernel. The MPI load balancer uses a multibuffering scheme with a configurable number of buffers and buffer size. There are a few obvious reasons for this choice. As explained before, the performance of cache-based architectures in the matching process strictly depends on the number of matches in the input text. If there are few matches, only few states of the STT are accessed and the procedure is fast. If there are many matches, instead, a lot of states in different memory locations are accessed, generating many cache misses and lowering the speed. This is also true for our GPU implementation, which uses the cached texture memory for the STT. Consequently, statically dividing the input text in chunks among the MPI slave processes would generate an unbalanced execution, especially if the chunks exhibit substantially different matching behavior. The slowest chunk to process would, in fact, determine the performance. A more appropriate and balanced solution is setting a relatively small fixed buffer size for the slave processes and allowing the master to send new data to a slave as soon as previous data have been consumed. Furthermore, if the computational kernels are well optimized, communication bandwidth among the MPI processes may become the main bottleneck. Thus, the only possible approach to maximize the performance is to overlap communication with computation. This is accomplished by using nonblocking communication together with multiple buffers. 3 EXPERIMENTAL RESULTS We have implemented various versions of the Aho-Corasick algorithm as described in Section 2 for the Cray XMT, the x86 SMP, the Niagara 2, the GPU cluster, and the x86 cluster. Our implementations have two main phases: building the state transition table and executing string matching against the built STT. The STT building phase is performed offline and stored in a file representation. We focus our experiments on the string matching phase, since it is the critical portion of the algorithm in real-world applications.

5 440 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Fig. 2. Scaling of the optimized implementation on the Cray XMT. The optimized code scales almost linearly. Our experiments utilize four different dictionaries:. Dictionary 1. An 190;000-pattern data set with mostly text entries with an average length of 16 bytes.. Dictionary 2. An 190;000-pattern data set with mixed text and binary entries and an average length of 16 bytes.. English. A 20,000-pattern data set with the most common words from the English language and an average length of 8.5 bytes.. Random. A 50,000-pattern data set with entries generated at random from the ASCII alphabet with a uniform distribution and an average length of 8 bytes. Dictionaries with more text-like entries have higher frequencies of alphabetical ASCII symbols. We also use four different input streams for each dictionary:. Text. Which corresponds to the English text of the King James Bible.. TCP. Which corresponds to captured TCP/IP traffic.. Random. Which corresponds to a random sample of characters from the ASCII alphabet.. Itself. Which corresponds to feeding the dictionary itself as an input stream for string matching. Using the dictionary itself as an input will exhibit the heaviest matching behavior, thus influencing significantly the performance of the algorithm. We report all the results in Gigabits per second (Gbps). The machines used for the evaluation of the algorithm are configured as follows:. Niagara 2: two Niagara 2 processors (1,165 GHz), 32 GB of memory,. XMT: 128 nodes, with a total of 1 TB of memory, Seastar2 interconnection. x86 SMP: two Xeon 5560 processors (2.8 GHz), 24 GB of memory. x86 cluster: 10 nodes, each one configured as the x86 SMP, interconnected with Infiniband QDR (24 Gbps). GPU cluster: 10 nodes with two Tesla C1060 (4 GB of memory) each. We consider a situation in which the input text is streamed from a single source and buffered in memory before being processed. This is analogous to the approach adopted in NIDS for real-time analysis, where some buffers are processed when, at the same time, others are filled with new data. Since we want to compare sustained performance, we size the buffers to minimize data starving. For this reason, we use an input buffer of 100 MB. The input is saved in the shared memory for SMP architectures and on the master node for MPI solutions. The input is then chunked for parallel processing. For pthreads the input is equally divided among the threads. For the XMT and the GPU implementation, instead, each thread processes a fixed size, and we generate a sufficient number of threads to reach maximum utilization. For these platforms, we empirically evaluated the chunk size and found that 2 KB is a reasonable tradeoff in terms of number of threads generated and utilization of each thread. With this chunk size, as discussed in Section 2, the inefficiency due to chunk overlapping is limited to 0.7 percent in the worst case (patterns of 16 bytes). Fig. 2 shows the results of the optimized implementation for the Cray XMT, exploiting both alphabet shuffling and state replication. The performance is very close to linear scaling, with minimal differences among the various input sets for all the dictionaries. For comparison, the performance of the basic implementation on the Cray XMT (shown in the supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds ) remains far from linear scaling. With more than 48 processors, it even present slowdowns, in particular, with the text and TCP inputs, due to significant memory hotspotting. It is interesting to compare the XMT with the x86 SMP and the Niagara 2, both sharedmemory architectures. Both the machines run a pthreads version of the basic algorithm, compiled, respectively, with the Intel C Compiler (icc) 11.1 and the Sun C compiler 5.9. On the x86 SMP, we executed the benchmarks with HyperThreading enabled, increasing the number of threads from 1 to 16 (eight per processor). On the Niagara 2, we used from 1 to 128 threads (64 per processor), but we report only the most relevant runs. The results are, respectively, reported in Figs. 3 and 4. For the x86 SMP, we see that the scaling is not linear and that, in a few low matching cases, increasing the number of threads can slightly reduce the performance. The main reason for this behavior is the complex hierarchy of caches in Nehalem. In particular, the shared L3 may get different access patterns when changing the number of threads and the size of chunks processed by each thread. In general, HyperThreading brings some performance benefits, but, when the number of threads outgrows the number of cores, the scaling is progressively reduced, sometimes reaching saturation. This is evident in heavy matching cases, where there is a high probability of missing in the cache and randomly accessing the memory, thus getting limited by the memory bandwidth. There is high variability in performance among low and heavy matching cases, with

6 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 441 Fig. 3. Scaling on the x86 SMP. The variability for the various dictionary/input combinations is high. Fig. 4. Scaling on the dual Niagara 2. Light and medium matching cases reach similarly high performance, while heavy matching cases are limited to lower throughputs. Fig. 5. Comparison among all the shared-memory solutions for the various dictionary/input combinations. Only the Cray XMT shows a very stable performance. A single Tesla C1060 performs, on average, like the x86 SMP and the dual Niagara 2. results significantly different depending on the type of input streams matched against the dictionaries. With Niagara 2, we obtain significant speedups, albeit not linear, with up to 80 threads. At 80 threads, we start getting reduced speedups and over 96 threads the speedups become marginal. Niagara 2 obtains stable performance (i.e., similar results for different dictionaries and input streams) in light and medium matching conditions. However, in heavy matching conditions it does not, due to the thrashing of the small second level cache and the memory hotspots. We compare the results of the XMT, the x86 SMP, and the Niagara 2, all shared-memory platforms, to the throughput of the optimized GPU implementation on a single Tesla C1060. A Tesla C1060 board can be considered as a sharedmemory platform too, since all the processing units of the GPU shares the same STT. We compiled the kernel with CUDA 3.0. Fig. 5 presents this comparison, and also shows the performance variability of these architectures with our data sets. Without the optimizations presented in Section 2.2, the GPU would hardly obtain speedups with respect to a sequential x86 implementation, in particular, for the noncoalescent accesses. In general, the Tesla is competitive with both the Niagara 2 and the x86 SMP. The GPU achieves high performance especially when there is light matching and the dictionaries are small. In heavy matching conditions, the GPU is fast only with the smallest dictionaries, the 20k-pattern English and the 50k-pattern random, which fit in a single texture, thus removing any noncoalescent access and divergence due to partial caching. All the implementations that exploit caching for the STTs show high variability. The GPU implementation s worst performance is 95 percent slower than its best, the x86 SMP implementation 92.2 percent slower than its best, and the Niagara 2, 90.3 percent slower than its best. Even if GPUs and Niagara 2 exploit multithreading to hide the memory latencies, they remain still limited by their memory subsystem. The XMT with 48 processors, instead, shows a performance variability of only 2.5 percent with the different combinations of dictionaries and input streams, and, on average, is faster than the GPU, the x86 SMP, and the Niagara 2. On the XMT with 128 processors, the performance rises up to 28 Gbps, the highest reported in the literature for a software solution with very large dictionaries, with a variability of 12%.

7 442 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Fig. 6. Scaling on the x86 cluster. Only light and medium matching cases saturate the Infiniband bandwidth. On the homogeneous cluster, using an MPI-only implementation does not allow us to fully exploit each node. This solution, in fact, requires allocating, inside each node, an MPI process for each hardware thread. However, due to the distributed-memory abstraction of MPI, this means replicating all the STT data structures. With our largest dictionaries (Dictionaries 1 and 2) the 24 GB of memory of each node is soon consumed and not more than two or three processes can be allocated. Furthermore, with small dictionaries, even if there is significant scaling up to seven processes, the replication of the STTs increases the memory traffic and significantly limits the peak performance with respect to the pthreads implementation. The supplemental material, which can be found on the Computer Society Digital Library at /TPDS , shows the behavior of the MPIonly implementation on the x86 SMP, equivalent to a single node of the homogeneous cluster. For these reasons, for the scaling tests with the full, 10-node x86 cluster, we only show the pthreads implementation with the MPI load balancer. The GPU cluster implementation, instead, uses an MPI process for each GPU, thus each node runs two MPI processes. Fig. 6 shows the performance obtained while incresing the number of nodes on the x86 cluster from 1 to 10 (thus, from 2 to 20 CPUs). Fig. 7, instead, shows the performance of the GPU cluster, again increasing the number of nodes from 1 to 10 (from 2 to 20 GPUs). In the cluster implementations, the master MPI process streams the input text to the other nodes in blocks of 3 MB each. We measured the maximum internode bandwidth of our cluster with the Ohio State University MPI MicroBenchmarks [16], and we verified a maximum performance of 3 GB/s (24 Gbps) with buffers over 1 MB. We also measured the PCI-Express 2.0 bandwidth between the host processors on a node and an attached GPU with the NVIDIA benchmark, and obtained a peak of 3:7 GB/s for both host-to-device and device-to-host bandwidth, which is expected since an x16 link (8 GB/s) is shared among two GPUs. The GPU-accelerated cluster reaches a saturation point with all the data sets. With an adequate number of nodes, even the heavy matching benchmarks reach the same performance of the light matching benchmarks. This does not happen for the x86 cluster, where the heavy matching benchmarks continue to scale, but never reaches the performance of the light matching tests. At most five nodes (10 GPUs) are required to reach saturation. The main bottleneck appears the Infiniband bandwidth. However, the saturation is reached around 19 Gbps, under the maximum verified bandwidth of 24 Gbps for the network connection. The reason is the overhead generated by the MPI layer, which becomes more significant when increasing the number of nodes. The x86 cluster reaches slightly higher saturated performance with respect to the GPU cluster, mostly because the data blocks being delivered to the GPU go through an additional hop on the PCI-Express bus. The supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds , also briefly discusses the case of known inputs and the cost/performance tradeoffs of the different systems. 4 CONCLUSIONS We have presented several software implementations of the Aho-Corasick pattern matching algorithm for highperformance systems, and carefully analyzed their performance. We considered the various tradeoffs in terms of peak performance, performance variability, and data set size. We presented optimized designs for the various architectures, discussing several algorithmic strategies, for shared-memory solutions, GPU-accelerated systems, and distributed-memory systems. Fig. 7. Scaling on the GPU cluster. Infiniband bandwidth gets saturated with at most five nodes (10 GPUs).

literature, at 28 Gbps (using 128 processors) for a software solution with very large dictionaries.

A dual Niagara 2 obtains stable performance only in low and medium matching conditions, while a dual Xeon 5560 has more varied results, obtaining high peak rates for light matching conditions, but

8 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 443 We found that the absolute performance obtained on the Cray XMT is one of the highest reported in the literature, at 28 Gbps (using 128 processors) for a software solution with very large dictionaries. Through multithreading and memory hashing the XMT is able to maintain stable performance across very different sets of dictionaries and input streams. A dual Niagara 2 obtains stable performance only in low and medium matching conditions, while a dual Xeon 5560 has more varied results, obtaining high peak rates for light matching conditions, but progressively reducing its performance as the number of matches increases. Our optimized GPU implementation, which exploits the texture cache, obtains varied results depending on the dictionaries and the input streams, but reaches, on average, the same performance of the dual Niagara 2 and the dual Xeon 5560 on a single Tesla C1060. A Cray XMT machine with 48 processors is able, on average, to outperform them, while maintaining substantially the same performance (2% variability) on all the dictionaries and all input sets. On clustered architectures, our GPU implementation saturates the communication bandwidth with all the data sets when using 10 GPUs. Nevertheless, even with a higher Infiniband bandwidth, it would still be limited by the PCI-Express bandwidth. On the x86 cluster, an MPI-only solution with large dictionaries is not practical, while a mixed solution with MPI for internode communication and pthreads for intranode computation does not reach the same performance of the GPU cluster on heavy matching benchmarks. Today, software approaches for pattern matching on high-performance systems can reach high throughputs, with moderate programming efforts and simpler code structures with respect to custom solutions on FPGAs and multimedia processors such as the IBM CELL/BE. By covering such a wide range of machines, we think that our work may lay a foundation for a better understanding of the behavior of such irregular parallel algorithms on modern architectures. REFERENCES [1] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, [2] Pattern Recognition and String Matching, D. Chen and X. Chen, eds. Springer, [3] A.V. Aho and M.J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM, vol. 18, no. 6, pp , [4] Y.H. Cho and W.H. Mangione-Smith, Deep Packet Filter! with Dedicated Logic and Read Only Memories, Proc. IEEE 12th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM), pp , [5] C.R. Clark and D.E. Schimmel, Scalable Pattern Matching for High Speed Networks, Proc. IEEE 12th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM), pp , [6] O. Villa, D. Chavarria-Miranda, and K. Maschhoff, Input- Independent, Scalable and Fast String Matching on the Cray XMT, Proc. IEEE 23rd Int l Symp. Parallel AND Distributed Processing (IPDPS), pp. 1-12, [7] D. Pasetto, F. Petrini, and V. Agarwal, Tools for Very Fast Regular Expression Matching, Computer, vol. 43, pp , [8] O. Villa, D.P. Scarpazza, and F. Petrini, Accelerating Real-Time String Searching with Multicore Processors, Computer, vol. 41, no. 4, pp , [9] D.P. Scarpazza, O. Villa, and F. Petrini, Exact Multi-Pattern String Matching on the Cell/B.E. Processor, Proc. Fifth Conf. Computing Frontiers (CF), pp , [10] M. Roesch, Snort: Lightweight Intrusion Detection for Networks, Proc. 13th USENIX Conf. System Administration (LISA), pp , [11] N. Jacob and C. Brodley, Offloading IDS Computation to the GPU, Proc. 22nd Ann. Computer Security Applications Conf. (ACSAC), pp , [12] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, and S. Ioannidis, Gnort: High Performance Network Intrusion Detection Using Graphics Processors, Proc. 11th Int l Symp. Recent Advances in Intrusion Detection (RAID), pp , [13] Symantec Corporation Symantec Global Internet Security Threat Report Whitepaper, Apr [14] J. Feo, D. Harper, S. Kahan, and P. Konecny, ELDORADO, Proc. Second Conf. Computing Frontiers (CF), pp , [15] G. Ruetsch and P. Micikevicius, NVIDIA Whitepaper: Optimizing Matrix Transpose in CUDA, [16] Ohio State Univ. MPI MicroBenchmarks, ohio-state.edu/benchmarks/, Antonino Tumeo received the MS degree in informatic engineering, in 2005, and the PhD degree in computer engineering, in 2009, from Politecnico di Milano, Italy. Since February 2011, he has been a research scientist at Pacific Northwest National Laboratory (PNNL). He joined PNNL in 2009 as a postdoctoral research associate. Previously, he was a postdoctoral researcher at Politecnico di Milano. His research interests include modeling and simulation of high-performance architectures, hardware-software codesign, FPGA prototyping, and GPGPU computing. He is a member of the IEEE. Oreste Villa received the MS degree in electronic engineering in 2003 from the University of Cagliari in Italy and the ME degree in 2004 in embedded systems design from the University of Lugano in Switzerland. He joined PNNL, in May 2008, after receiving the PhD degree from Politecnico di Milano for his research on Designing and Programming Advanced Multicore Architectures. While receiving the PhD degree, he was an intern student at PNNL, conducting research in programming techniques and algorithms for advanced multicore architectures, cluster fault tolerance, and virtualization techniques for HPC. He is a research scientist at the Pacific Northwest National Laboratory with a research focus on computer architectures and simulation, accelerators for scientific computing and irregular applications. He is a member of the IEEE. Daniel G. Chavarría-Miranda received the MS and PhD degrees in computer science from the Rice University in He is a senior scientist in the High-Performance Computing Group at the Pacific Northwest National Laboratory. His expertise is in programming models, compilers, and languages for large-scale HPC systems. He also served as a co-pi for a DoD-funded Center for Adaptive Supercomputing Software- Multithreaded Architectures (CASS-MT), with a focus on scalable highly irregular applications and systems software. He has also served as the principal investigator for several PNNLfunded Laboratory Directed Research and Development projects in the application of reconfigurable and hybrid systems to scientific codes. He has been a member of the ACM since For more information on this or any other computing topic, please visit our Digital Library at

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most