STRING matching algorithms check and detect the presence

Size: px
Start display at page:

Download "STRING matching algorithms check and detect the presence"

Transcription

1 436 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures Antonino Tumeo, Member, IEEE, Oreste Villa, Member, IEEE, and Daniel G. Chavarría-Miranda Abstract String matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. This paper compares several software-based implementations of the Aho-Corasick algorithm for high-performance systems. We focus on the matching of unknown inputs streamed from a single source, typical of security applications and difficult to manage since the input cannot be preprocessed to obtain locality. We consider shared-memory architectures (Niagara 2, x86 multiprocessors, and Cray XMT) and distributed-memory architectures with homogeneous (InfiniBand cluster of x86 multicores) or heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C1060 GPUs). We describe how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets. Index Terms Aho-Corasick, string matching, GPGPU, Cray XMT, multithreaded architectures, high-performance computing. Ç 1 INTRODUCTION STRING matching algorithms check and detect the presence of one or more known symbol sequences inside a data set. Besides their well-known application to databases and text processing, they are the basis of several critical realworld applications. String matching algorithms are key components of DNA and protein sequence analysis, data mining, security systems, such as Intrusion Detection Systems (IDS) for Networks (NIDS), Applications (APIDS), Protocols (PIDS), or Systems (Host-based IDS HIDS), antivirus software, and machine learning problems [1], [2]. All these applications process large quantities of textual data and require extremely high performance to produce meaningful results in an acceptable time. Among all string matching algorithms, one of the most studied, especially for text processing and security applications, is the Aho- Corasick (AC) algorithm [3], due to its exact, multipattern approach and its ability to perform the search in time linearly proportional to the length of the input stream. A large amount of research has been done to design efficient implementations of string matching algorithms using Field Programmable Gate Arrays (FPGAs) [4], [5], highly multithreaded solutions like the Cray XMT [6], multicore processors [7], or heterogeneous processors like the Cell Broadband Engine [8], [9]. Recently, Graphic Processing Units (GPUs) have been demonstrated as a suitable platform for some classes of string matching algorithms for NIDS such as SNORT [10], [11], [12]. Most previous approaches mainly focused on speed. Only in the last few years, aspects such as performance stability,. The authors are with the High Performance Computing Group, Pacific Northwest National Laboratory (PNNL), Richland, WA {antonino.tumeo, oreste.villa, daniel.chavarria}@pnl.gov. Manuscript received 19 July 2010; revised 21 Feb. 2011; accepted 17 May 2011; published online 16 June Recommended for acceptance by S. Aluru. For information on obtaining reprints of this article, please send to: tpds@computer.org, and reference IEEECS Log Number TPDS Digital Object Identifier no /TPDS independent from the size of the input and from the number of patterns to search, have started to gain interest. Actually, string matching applications not only require high performance, but also the ability to deal with very large dictionaries. For example, NIDS must be able to scan inputs from modern Ethernet links at 10 Gbps, while considering a number of malicious threats, already well over 1 million and exponentially growing [13]. They should do that in real time, without hindering the overall performance of the system, reducing the available bandwidth, or raising latencies. In general, hardware solutions support only small dictionary sizes due to lack of memory and are difficult to customize, while platforms such as the Cell/BE are extremely complex to program. With the emergence of multicore and multithreaded architectures and general-purpose computation on GPUs, softwarebased solutions have started to become a feasible platform for high throughput string matching applications. String matching is a good candidate for execution on these parallel architectures, since the search can be parallelized by dividing the input data set in smaller subsets, each one processed by a single thread or core. However, obtaining the maximum performance on all of them still requires a significant effort to efficiently map the algorithm to their features, and it may not even be sufficient to reach the desired throughput. Furthermore, performance variability when dealing with different sizes of inputs and dictionaries has always been a significant limit for softwarebased solutions. This is particularly true on cache-based architectures: if the matching patterns are resident in the cache, the matching algorithm performs very well; however, if they are not in the cache and have to be retrieved from main memory, the matching algorithm performs poorly. In many applications, when the input is not known and the data cannot be adequately preprocessed to guarantee some locality, such as data streamed from a single-input source like a network, the algorithm accesses data in unpredictable locations of the main memory, leading to highly variable performance /12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

2 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 437 In this paper, we present and compare several softwarebased implementations of the AC algorithm for highperformance systems. We focus our attention on streaming applications with unknown inputs, a situation typical of security systems, and often problematic for the search process. We look carefully how each solution achieves the objectives of supporting large dictionaries (up to 190,000 patterns), obtaining high performance, enabling flexibility and customization, and limiting performance variability. We present optimized implementations of the algorithm on a range of high-performance architectures, with shared or distributed memory, and with homogeneous or heterogeneous processing elements. For the shared-memory solutions, we consider a Cray XMT with up to 128 processors (128 threads per processor), a dual-socket Niagara 2 (eight cores per processor, eight threads per core), and a dual-socket Intel Xeon 5560 (Nehalem architecture, four cores per processor, two threads per core). For the distributed-memory systems, we evaluate a homogeneous cluster of Xeon 5560 processors (10 nodes, two processors per node) interconnected through Infiniband QDR and a heterogeneous cluster, where the Xeon 5560 processors are accelerated with NVIDIA Tesla C1060 GPUs (10 nodes, two GPUs per node). This paper extends the work in [6], by introducing and comparing new platforms with various architectural features. To the best of our knowledge, no previous work on software-based string matching algorithms included such a broad evaluation in terms of high-performance systems, input sets, and algorithm optimizations. The paper is organized as follows. Section 2 presents the algorithmic design on the various machines. Section 3 discusses our experimental results and the comparison among all the machines. Section 4 presents our conclusions. A comprehensive survey of the related work on string matching and background material on the AC algorithm and on the systems evaluated in this paper are included in the supplemental material, which can be found on the Computer Society Digital Library at doi.ieeecomputersociety.org/ /tpds ALGORITHM DESIGN and OPTIMIZATION In this section, we introduce the overall algorithmic design and then focus on the specific modifications and optimizations for the various platforms. Starting from the same basis, we designed sharedmemory implementations using pthreads for the Niagara 2, the dual Xeon (referred as x86 SMP Shared-memory Multiprocessor in the rest of the paper), and the proprietary programming model of the Cray XMT. For the GPU kernel, we used NVIDIA CUDA. For the heterogeneous and homogeneous clusters, instead, we designed an MPI load balancer, which can be used alone, or integrated with pthreads or CUDA. For all the implementations, our algorithm design is based on the following cornerstones:. minimize the number of memory references. reduce memory contention. We represent the patterns in a Deterministic Finite-state Automaton (DFA). The supplemental material, which can be found on the Computer Society Digital Library at doi.ieeecomputersociety.org/ /tpds , discusses how the automaton is generated. For each possible input symbol, there is always a valid transition to another node in the automaton. This key feature guarantees that, for each input symbol, there is always the same amount of work to perform. For a given dictionary, the data structures in main memory (DFA and input symbols) are read-only. For all the implementations, the common parallelization strategy is to use multiple threads or processes that concurrently execute the algorithm. Each thread or process has a current_node and operates on a distinct section of the input. The threads of shared-memory implementations and the threads running on a specific GPU access the same DFA, while different MPI processes and different GPUs access their own copy of the DFA. At runtime the input stream is buffered and split into chunks, which are then assigned to each processing element. The size of the chunks depends on the granularity of the parallelization. Eventually, they are chunked again in a hierarchical way. This is, for example, what happens in the MPI with CUDA implementation, where each GPU receives a buffer and then partitions it among its own threads, or in the MPI with pthreads implementation, where each MPI process running on a node gets its own buffer that is then partitioned among the threads executed by the different cores. The chunks overlap partially to allow matching of those patterns that cross a boundary. The overlapping is equal to the length of the longest pattern in the dictionary minus 1 symbol. The inefficiency of the overlapping (replicated work) is measured as (longest pattern-1)/(size of the chunk). As already discussed in [6], we represent the DFA graph as a State Transition Table (STT). The STT is a table composed of as many rows as there are nodes in the DFA and as many columns as there are symbols of the alphabet. Each STT line represents a node in the original DFA. Each entry (cell) of a STT line (indexed by a current_node and a symbol) stores the address of the beginning of the STT line that represents the next_node for that transition in the DFA. For the Cray XMT and pthreads pattern matchers, the STT lines are 256-byte aligned such that the least-significant byte of the address is equal to zero. This property allows us to store in the least-significant bit of each STT cell the boolean information indicating if that transition is final or not. Since we want to retrieve the next_node address dereferencing the current node þ symbol pointer, STT lines must always have the same size. Alphabet symbols not used in the dictionary have to be explicitly represented as transitions to the root node. The supplemental materials, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds , discusses in deeper details the basic design, which is at the base of the Cray XMT and the pthreads implementation. 2.1 Cray XMT Implementation The approach for exploiting the highly multithreaded architecture of the Cray XMT focuses mostly on reducing latency variability. If the latency is constant or slowly variable, the system is able to schedule a sufficient number of threads to hide it. On the Cray XMT, the main cause of variability in the memory access time is the presence of hotspots. Hotspots are memory regions frequently accessed by multiple

3 438 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 threads simultaneously. Nevertheless, the XMT employs a hardware hashing mechanism which spreads data in all the system s memory banks with a granularity of 64 bytes (block) [14] (see supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds ). However, if different blocks corresponding to different memory banks have different access ratios, the pressure on the memory banks is not equally balanced, producing variability in the access time. In our implementation, there are two reasons why this can happen.. Each STT cell is large 8 bytes (address of the STT line + Boolean flag for final transitions). Therefore, each STT line (representing a DFA node) has size corresponding to 8 bytes multiplied by the alphabet size. If we consider the 256-symbol ASCII alphabet, each STT line requires 2,048 bytes, or 32 blocks (64 bytes per block). If the scanned input has a particular symbol frequency (i.e., English text, decimal numbers, etc.), the input symbols will only be a subset of the ASCII alphabet and concurrent threads will be accessing only a subset of the 32 blocks per STT line, producing hotspots.. Typically, a few states in the first levels of a Bread- First Search (BFS) of the DFA graph are responsible for the majority of accesses. As a result, the memory blocks containing those states form hotspots. For inputs that are similar to the dictionary, instead, the transitions tend to be distributed more equally on the levels, leading to reduced or absent hotspots. The supplemental material, which can be found on the Computer Society Digital Library at TPDS , includes the analysis of the access patterns for a dictionary of 20,000 English words when scanned against different inputs. To alleviate the above problems, as shown in [6] we propose the following solutions:. Alphabet shuffling. The alphabet symbols in a STT line can be shuffled using a relatively simple linear transformation, ensuring that contiguous symbols in the alphabet are spread out over multiple memory blocks. The shuffling function can be inexpensively and effectively computed as symbol 0 ¼ðsymbol fixed offsetþ 8. This transformation, in conjunction with the hardware hashing mechanism on the XMT s memory, guarantees that accesses are spread out over distinct memory blocks for those inputs (i.e., English text, decimal numbers, etc.) that have characters located relatively close to each other in the ASCII ordering. The hardware memory hashing allows us designing a shuffling function that does not depend on the state number (line in the STT table). In contrast, systems where the memory hierarchy is not hashed, like the Niagara 2 or the x86 SMP, require more complex shuffling functions.. State replication. We replicate the STT states corresponding to the first levels of the BFS exploration of the DFA. Addresses of the different replicas of the same logic state (STT line) are randomly stored during the creation of the STT in the STT cells pointing to that state. This ensures that the memory pressure is equally balanced when different threads access the blocks for that state. This mechanism is greatly simplified by the underlying hardware hashing, since different replicas are spread in different memory banks. On cache-based architectures (x86 SMP and Niagara 2), these optimizations would not bring significant benefits, but rather reduce the spatial locality of the cache. 2.2 GPU Implementation The GPU implementation starts from the same principles of the basic design, but requires some specific adaptations. Each CUDA thread independently performs the matching on a chunk of the input text. This allows using reasonably sized chunks, while maintaining a high utilization of each thread. The main differences reside in the STT. The rows still represent the states, the columns still represents the symbols. However, states are addressed by indices and not by pointers. Each cell of the table, which in our GPU code has a size of 32 bits, contains, thus, the index of the next state in the first 31 bits and the flag that tags final states in the last bit. So, we remove the requirement for the 256-bytes alignment of the lines, reducing the memory usage, which is critical for GPUs that can address only up to 4 GB. Nevertheless, the organization of the memory controllers for the CUDA architecture still requires to further optimize the STT layout. Similarly to the Cray XMT implementation, memory hotspots may happen. The reasons are that each memory controller manages a 256-byte wide memory partition and, depending on the number of memory controllers and partitions, on the alphabet, on the input streams, on the number of matches, the accesses may concentrate only on a partition, thus saturating its own memory controller. This problem is known in the GPGPU community as partition camping [15]. On the Tesla C1060 there are eight partitions, thus, with rows of 256 cells (ASCII alphabet) of 32 bits (1,024 bytes in total), a partition has the cells for the same 64 symbols on odd or even rows. So, with inputs that have particular symbol frequencies, the accesses may concentrate only on few partitions. To solve this problem, we added padding to each line of the STT corresponding to the size of a partition, generating a more even access pattern. The unpredictable nature of the loads on the STT makes it impossible to coalesce the accesses of half-warps in a single-memory transaction: with high probability, they are not sequential and not aligned. Thus, we decided to bind the STT to the texture memory. Binding data to texture memory makes the GPU fetch the data through the texture units, which do not require coalescence (but still suffer from partition camping), and exploit the texture caches, which are optimized for 2D locality. The texture cache has an active set around 8 KB and, when there is a hit, the latency for loading the data is only a few cycles (the exact performance is not disclosed by NVIDIA, but empirical evaluation puts it around 6-7 cycles). The benefits in using the texture cache derive mainly from the hits on the first levels of the STT. There is a limit on the size of textures in CUDA: when bound to linear memory, they can allocate up to 2 27 elements. So, when big STTs are used, we bind them to texture only partially.

4 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES Distributed-Memory Implementation For distributed-memory architectures, we wrapped our string matching engine in an MPI load balancer. We developed a master/slave scheduler, where the master MPI process distributes the work, and the slaves perform the effective computation. From the same scheduler, we developed three slightly different implementations for the various configurations of the clustered architectures. Fig. 1. Optimizations on the input text for coalesced reads. The input text is transposed so that each thread reads four consecutive input symbols from its respective chunk with a single load. This allows us retaining the benefits of caching for the majority of light matching cases, without significantly influencing the performance in heavy matching cases, where there are many cache misses. Using shared memory for caching STT lines is not effective for large dictionaries: with the maximum block size of 512 threads, only eight lines (each line has a size of 4 bytes 256 symbols ¼ 1;024 bytes) can be stored (replicated for each block) in the 16 KB of memory. However, for the smallest dictionary in our benchmark set (20k-pattern), the first two levels already have 54 lines (0.11 percent of 49,849). Another important optimization is performed on the input text. In applications streaming data from a single source, the input text is sequentially buffered in the host memory. Thus, if the input text is directly moved to the global memory of the graphic card, each CUDA thread would start loading it with a stride corresponding to the chunk size. Consequently, memory accesses of a half-warp will be uncoalesced. This can be thought as having the input text organized in a matrix, where each chunk corresponds to a row. Fig. 1a shows the situation. To solve this problem, we apply a transposition after copying the input text to the global memory. However, the input text is transposed in groups of four symbols. This is due to the optimal size for loads on the Tesla C1060 GPU (T10 architecture). Since we use the ASCII alphabet, each symbol is an 8-bit character. The minimum granularity for load transactions of a half-warp (group of 16 threads simultaneously performing memory operations) is 32 bytes (16 bits per threads), while the optimal size is 64 bytes (32 bits per thread). Thus, the input chunks are transposed in blocks of four symbols, which are then read with a single, coalesced load in the main loop of the pattern matching algorithm. The final result of the transposition, with chunks now on columns, is shown in Fig. 1b. When the number of chunks is not an integer multiple of the number of threads in a half-warp (16), we also add padding so that each row starts aligned in memory to respect coalescence rules. The transposition is performed directly on the GPU with a fast and optimized kernel, and it is transparent to the rest of the system. The matching results are collected per chunk. A reduction is applied to gather all the results and send them back to the host through a single-memory copy operation. Again, the reduction operation is performed on the GPU with an optimized kernel.. An MPI-only solution for the homogeneous x86 cluster, where each slave process is mapped to a core and wraps a single-threaded version of our string matching engine.. An MPI with phtreads solution, where each slave process is mapped to an entire node of the cluster and wraps a multithreaded version of our string matching engine.. An MPI with CUDA solution, where each slave process is mapped to a GPU and wraps a CUDA kernel. The MPI load balancer uses a multibuffering scheme with a configurable number of buffers and buffer size. There are a few obvious reasons for this choice. As explained before, the performance of cache-based architectures in the matching process strictly depends on the number of matches in the input text. If there are few matches, only few states of the STT are accessed and the procedure is fast. If there are many matches, instead, a lot of states in different memory locations are accessed, generating many cache misses and lowering the speed. This is also true for our GPU implementation, which uses the cached texture memory for the STT. Consequently, statically dividing the input text in chunks among the MPI slave processes would generate an unbalanced execution, especially if the chunks exhibit substantially different matching behavior. The slowest chunk to process would, in fact, determine the performance. A more appropriate and balanced solution is setting a relatively small fixed buffer size for the slave processes and allowing the master to send new data to a slave as soon as previous data have been consumed. Furthermore, if the computational kernels are well optimized, communication bandwidth among the MPI processes may become the main bottleneck. Thus, the only possible approach to maximize the performance is to overlap communication with computation. This is accomplished by using nonblocking communication together with multiple buffers. 3 EXPERIMENTAL RESULTS We have implemented various versions of the Aho-Corasick algorithm as described in Section 2 for the Cray XMT, the x86 SMP, the Niagara 2, the GPU cluster, and the x86 cluster. Our implementations have two main phases: building the state transition table and executing string matching against the built STT. The STT building phase is performed offline and stored in a file representation. We focus our experiments on the string matching phase, since it is the critical portion of the algorithm in real-world applications.

5 440 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Fig. 2. Scaling of the optimized implementation on the Cray XMT. The optimized code scales almost linearly. Our experiments utilize four different dictionaries:. Dictionary 1. An 190;000-pattern data set with mostly text entries with an average length of 16 bytes.. Dictionary 2. An 190;000-pattern data set with mixed text and binary entries and an average length of 16 bytes.. English. A 20,000-pattern data set with the most common words from the English language and an average length of 8.5 bytes.. Random. A 50,000-pattern data set with entries generated at random from the ASCII alphabet with a uniform distribution and an average length of 8 bytes. Dictionaries with more text-like entries have higher frequencies of alphabetical ASCII symbols. We also use four different input streams for each dictionary:. Text. Which corresponds to the English text of the King James Bible.. TCP. Which corresponds to captured TCP/IP traffic.. Random. Which corresponds to a random sample of characters from the ASCII alphabet.. Itself. Which corresponds to feeding the dictionary itself as an input stream for string matching. Using the dictionary itself as an input will exhibit the heaviest matching behavior, thus influencing significantly the performance of the algorithm. We report all the results in Gigabits per second (Gbps). The machines used for the evaluation of the algorithm are configured as follows:. Niagara 2: two Niagara 2 processors (1,165 GHz), 32 GB of memory,. XMT: 128 nodes, with a total of 1 TB of memory, Seastar2 interconnection. x86 SMP: two Xeon 5560 processors (2.8 GHz), 24 GB of memory. x86 cluster: 10 nodes, each one configured as the x86 SMP, interconnected with Infiniband QDR (24 Gbps). GPU cluster: 10 nodes with two Tesla C1060 (4 GB of memory) each. We consider a situation in which the input text is streamed from a single source and buffered in memory before being processed. This is analogous to the approach adopted in NIDS for real-time analysis, where some buffers are processed when, at the same time, others are filled with new data. Since we want to compare sustained performance, we size the buffers to minimize data starving. For this reason, we use an input buffer of 100 MB. The input is saved in the shared memory for SMP architectures and on the master node for MPI solutions. The input is then chunked for parallel processing. For pthreads the input is equally divided among the threads. For the XMT and the GPU implementation, instead, each thread processes a fixed size, and we generate a sufficient number of threads to reach maximum utilization. For these platforms, we empirically evaluated the chunk size and found that 2 KB is a reasonable tradeoff in terms of number of threads generated and utilization of each thread. With this chunk size, as discussed in Section 2, the inefficiency due to chunk overlapping is limited to 0.7 percent in the worst case (patterns of 16 bytes). Fig. 2 shows the results of the optimized implementation for the Cray XMT, exploiting both alphabet shuffling and state replication. The performance is very close to linear scaling, with minimal differences among the various input sets for all the dictionaries. For comparison, the performance of the basic implementation on the Cray XMT (shown in the supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds ) remains far from linear scaling. With more than 48 processors, it even present slowdowns, in particular, with the text and TCP inputs, due to significant memory hotspotting. It is interesting to compare the XMT with the x86 SMP and the Niagara 2, both sharedmemory architectures. Both the machines run a pthreads version of the basic algorithm, compiled, respectively, with the Intel C Compiler (icc) 11.1 and the Sun C compiler 5.9. On the x86 SMP, we executed the benchmarks with HyperThreading enabled, increasing the number of threads from 1 to 16 (eight per processor). On the Niagara 2, we used from 1 to 128 threads (64 per processor), but we report only the most relevant runs. The results are, respectively, reported in Figs. 3 and 4. For the x86 SMP, we see that the scaling is not linear and that, in a few low matching cases, increasing the number of threads can slightly reduce the performance. The main reason for this behavior is the complex hierarchy of caches in Nehalem. In particular, the shared L3 may get different access patterns when changing the number of threads and the size of chunks processed by each thread. In general, HyperThreading brings some performance benefits, but, when the number of threads outgrows the number of cores, the scaling is progressively reduced, sometimes reaching saturation. This is evident in heavy matching cases, where there is a high probability of missing in the cache and randomly accessing the memory, thus getting limited by the memory bandwidth. There is high variability in performance among low and heavy matching cases, with

6 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 441 Fig. 3. Scaling on the x86 SMP. The variability for the various dictionary/input combinations is high. Fig. 4. Scaling on the dual Niagara 2. Light and medium matching cases reach similarly high performance, while heavy matching cases are limited to lower throughputs. Fig. 5. Comparison among all the shared-memory solutions for the various dictionary/input combinations. Only the Cray XMT shows a very stable performance. A single Tesla C1060 performs, on average, like the x86 SMP and the dual Niagara 2. results significantly different depending on the type of input streams matched against the dictionaries. With Niagara 2, we obtain significant speedups, albeit not linear, with up to 80 threads. At 80 threads, we start getting reduced speedups and over 96 threads the speedups become marginal. Niagara 2 obtains stable performance (i.e., similar results for different dictionaries and input streams) in light and medium matching conditions. However, in heavy matching conditions it does not, due to the thrashing of the small second level cache and the memory hotspots. We compare the results of the XMT, the x86 SMP, and the Niagara 2, all shared-memory platforms, to the throughput of the optimized GPU implementation on a single Tesla C1060. A Tesla C1060 board can be considered as a sharedmemory platform too, since all the processing units of the GPU shares the same STT. We compiled the kernel with CUDA 3.0. Fig. 5 presents this comparison, and also shows the performance variability of these architectures with our data sets. Without the optimizations presented in Section 2.2, the GPU would hardly obtain speedups with respect to a sequential x86 implementation, in particular, for the noncoalescent accesses. In general, the Tesla is competitive with both the Niagara 2 and the x86 SMP. The GPU achieves high performance especially when there is light matching and the dictionaries are small. In heavy matching conditions, the GPU is fast only with the smallest dictionaries, the 20k-pattern English and the 50k-pattern random, which fit in a single texture, thus removing any noncoalescent access and divergence due to partial caching. All the implementations that exploit caching for the STTs show high variability. The GPU implementation s worst performance is 95 percent slower than its best, the x86 SMP implementation 92.2 percent slower than its best, and the Niagara 2, 90.3 percent slower than its best. Even if GPUs and Niagara 2 exploit multithreading to hide the memory latencies, they remain still limited by their memory subsystem. The XMT with 48 processors, instead, shows a performance variability of only 2.5 percent with the different combinations of dictionaries and input streams, and, on average, is faster than the GPU, the x86 SMP, and the Niagara 2. On the XMT with 128 processors, the performance rises up to 28 Gbps, the highest reported in the literature for a software solution with very large dictionaries, with a variability of 12%.

7 442 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Fig. 6. Scaling on the x86 cluster. Only light and medium matching cases saturate the Infiniband bandwidth. On the homogeneous cluster, using an MPI-only implementation does not allow us to fully exploit each node. This solution, in fact, requires allocating, inside each node, an MPI process for each hardware thread. However, due to the distributed-memory abstraction of MPI, this means replicating all the STT data structures. With our largest dictionaries (Dictionaries 1 and 2) the 24 GB of memory of each node is soon consumed and not more than two or three processes can be allocated. Furthermore, with small dictionaries, even if there is significant scaling up to seven processes, the replication of the STTs increases the memory traffic and significantly limits the peak performance with respect to the pthreads implementation. The supplemental material, which can be found on the Computer Society Digital Library at /TPDS , shows the behavior of the MPIonly implementation on the x86 SMP, equivalent to a single node of the homogeneous cluster. For these reasons, for the scaling tests with the full, 10-node x86 cluster, we only show the pthreads implementation with the MPI load balancer. The GPU cluster implementation, instead, uses an MPI process for each GPU, thus each node runs two MPI processes. Fig. 6 shows the performance obtained while incresing the number of nodes on the x86 cluster from 1 to 10 (thus, from 2 to 20 CPUs). Fig. 7, instead, shows the performance of the GPU cluster, again increasing the number of nodes from 1 to 10 (from 2 to 20 GPUs). In the cluster implementations, the master MPI process streams the input text to the other nodes in blocks of 3 MB each. We measured the maximum internode bandwidth of our cluster with the Ohio State University MPI MicroBenchmarks [16], and we verified a maximum performance of 3 GB/s (24 Gbps) with buffers over 1 MB. We also measured the PCI-Express 2.0 bandwidth between the host processors on a node and an attached GPU with the NVIDIA benchmark, and obtained a peak of 3:7 GB/s for both host-to-device and device-to-host bandwidth, which is expected since an x16 link (8 GB/s) is shared among two GPUs. The GPU-accelerated cluster reaches a saturation point with all the data sets. With an adequate number of nodes, even the heavy matching benchmarks reach the same performance of the light matching benchmarks. This does not happen for the x86 cluster, where the heavy matching benchmarks continue to scale, but never reaches the performance of the light matching tests. At most five nodes (10 GPUs) are required to reach saturation. The main bottleneck appears the Infiniband bandwidth. However, the saturation is reached around 19 Gbps, under the maximum verified bandwidth of 24 Gbps for the network connection. The reason is the overhead generated by the MPI layer, which becomes more significant when increasing the number of nodes. The x86 cluster reaches slightly higher saturated performance with respect to the GPU cluster, mostly because the data blocks being delivered to the GPU go through an additional hop on the PCI-Express bus. The supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds , also briefly discusses the case of known inputs and the cost/performance tradeoffs of the different systems. 4 CONCLUSIONS We have presented several software implementations of the Aho-Corasick pattern matching algorithm for highperformance systems, and carefully analyzed their performance. We considered the various tradeoffs in terms of peak performance, performance variability, and data set size. We presented optimized designs for the various architectures, discussing several algorithmic strategies, for shared-memory solutions, GPU-accelerated systems, and distributed-memory systems. Fig. 7. Scaling on the GPU cluster. Infiniband bandwidth gets saturated with at most five nodes (10 GPUs).

8 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 443 We found that the absolute performance obtained on the Cray XMT is one of the highest reported in the literature, at 28 Gbps (using 128 processors) for a software solution with very large dictionaries. Through multithreading and memory hashing the XMT is able to maintain stable performance across very different sets of dictionaries and input streams. A dual Niagara 2 obtains stable performance only in low and medium matching conditions, while a dual Xeon 5560 has more varied results, obtaining high peak rates for light matching conditions, but progressively reducing its performance as the number of matches increases. Our optimized GPU implementation, which exploits the texture cache, obtains varied results depending on the dictionaries and the input streams, but reaches, on average, the same performance of the dual Niagara 2 and the dual Xeon 5560 on a single Tesla C1060. A Cray XMT machine with 48 processors is able, on average, to outperform them, while maintaining substantially the same performance (2% variability) on all the dictionaries and all input sets. On clustered architectures, our GPU implementation saturates the communication bandwidth with all the data sets when using 10 GPUs. Nevertheless, even with a higher Infiniband bandwidth, it would still be limited by the PCI-Express bandwidth. On the x86 cluster, an MPI-only solution with large dictionaries is not practical, while a mixed solution with MPI for internode communication and pthreads for intranode computation does not reach the same performance of the GPU cluster on heavy matching benchmarks. Today, software approaches for pattern matching on high-performance systems can reach high throughputs, with moderate programming efforts and simpler code structures with respect to custom solutions on FPGAs and multimedia processors such as the IBM CELL/BE. By covering such a wide range of machines, we think that our work may lay a foundation for a better understanding of the behavior of such irregular parallel algorithms on modern architectures. REFERENCES [1] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, [2] Pattern Recognition and String Matching, D. Chen and X. Chen, eds. Springer, [3] A.V. Aho and M.J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM, vol. 18, no. 6, pp , [4] Y.H. Cho and W.H. Mangione-Smith, Deep Packet Filter! with Dedicated Logic and Read Only Memories, Proc. IEEE 12th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM), pp , [5] C.R. Clark and D.E. Schimmel, Scalable Pattern Matching for High Speed Networks, Proc. IEEE 12th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM), pp , [6] O. Villa, D. Chavarria-Miranda, and K. Maschhoff, Input- Independent, Scalable and Fast String Matching on the Cray XMT, Proc. IEEE 23rd Int l Symp. Parallel AND Distributed Processing (IPDPS), pp. 1-12, [7] D. Pasetto, F. Petrini, and V. Agarwal, Tools for Very Fast Regular Expression Matching, Computer, vol. 43, pp , [8] O. Villa, D.P. Scarpazza, and F. Petrini, Accelerating Real-Time String Searching with Multicore Processors, Computer, vol. 41, no. 4, pp , [9] D.P. Scarpazza, O. Villa, and F. Petrini, Exact Multi-Pattern String Matching on the Cell/B.E. Processor, Proc. Fifth Conf. Computing Frontiers (CF), pp , [10] M. Roesch, Snort: Lightweight Intrusion Detection for Networks, Proc. 13th USENIX Conf. System Administration (LISA), pp , [11] N. Jacob and C. Brodley, Offloading IDS Computation to the GPU, Proc. 22nd Ann. Computer Security Applications Conf. (ACSAC), pp , [12] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, and S. Ioannidis, Gnort: High Performance Network Intrusion Detection Using Graphics Processors, Proc. 11th Int l Symp. Recent Advances in Intrusion Detection (RAID), pp , [13] Symantec Corporation Symantec Global Internet Security Threat Report Whitepaper, Apr [14] J. Feo, D. Harper, S. Kahan, and P. Konecny, ELDORADO, Proc. Second Conf. Computing Frontiers (CF), pp , [15] G. Ruetsch and P. Micikevicius, NVIDIA Whitepaper: Optimizing Matrix Transpose in CUDA, [16] Ohio State Univ. MPI MicroBenchmarks, ohio-state.edu/benchmarks/, Antonino Tumeo received the MS degree in informatic engineering, in 2005, and the PhD degree in computer engineering, in 2009, from Politecnico di Milano, Italy. Since February 2011, he has been a research scientist at Pacific Northwest National Laboratory (PNNL). He joined PNNL in 2009 as a postdoctoral research associate. Previously, he was a postdoctoral researcher at Politecnico di Milano. His research interests include modeling and simulation of high-performance architectures, hardware-software codesign, FPGA prototyping, and GPGPU computing. He is a member of the IEEE. Oreste Villa received the MS degree in electronic engineering in 2003 from the University of Cagliari in Italy and the ME degree in 2004 in embedded systems design from the University of Lugano in Switzerland. He joined PNNL, in May 2008, after receiving the PhD degree from Politecnico di Milano for his research on Designing and Programming Advanced Multicore Architectures. While receiving the PhD degree, he was an intern student at PNNL, conducting research in programming techniques and algorithms for advanced multicore architectures, cluster fault tolerance, and virtualization techniques for HPC. He is a research scientist at the Pacific Northwest National Laboratory with a research focus on computer architectures and simulation, accelerators for scientific computing and irregular applications. He is a member of the IEEE. Daniel G. Chavarría-Miranda received the MS and PhD degrees in computer science from the Rice University in He is a senior scientist in the High-Performance Computing Group at the Pacific Northwest National Laboratory. His expertise is in programming models, compilers, and languages for large-scale HPC systems. He also served as a co-pi for a DoD-funded Center for Adaptive Supercomputing Software- Multithreaded Architectures (CASS-MT), with a focus on scalable highly irregular applications and systems software. He has also served as the principal investigator for several PNNLfunded Laboratory Directed Research and Development projects in the application of reconfigurable and hybrid systems to scientific codes. He has been a member of the ACM since For more information on this or any other computing topic, please visit our Digital Library at

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm

String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm String Matching with Multicore CPUs: Performing Better with the -Corasick Algorithm S. Arudchutha, T. Nishanthy and R.G. Ragel Department of Computer Engineering University of Peradeniya, Sri Lanka Abstract

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Gregex: GPU based High Speed Regular Expression Matching Engine

Gregex: GPU based High Speed Regular Expression Matching Engine 11 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing Gregex: GPU based High Speed Regular Expression Matching Engine Lei Wang 1, Shuhui Chen 2, Yong Tang

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Multipattern String Matching On A GPU

Multipattern String Matching On A GPU Multipattern String Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL 32611 Email: {xzha, sahni}@cise.ufl.edu Abstract

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI,

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Accelerating String Matching Using Multi-threaded Algorithm

Accelerating String Matching Using Multi-threaded Algorithm Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Detecting Computer Viruses using GPUs

Detecting Computer Viruses using GPUs Detecting Computer Viruses using GPUs Alexandre Nuno Vicente Dias Instituto Superior Técnico, No. 62580 alexandre.dias@ist.utl.pt Abstract Anti-virus software is the main defense mechanism against malware,

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection Lucas Vespa Department of Computer Science University of Illinois at Springfield lvesp@uis.edu Ning Weng Department of Electrical and Computer

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA (Technical Report UMIACS-TR-2010-08) Zheng Wei and Joseph JaJa Department of Electrical and Computer Engineering Institute

More information

Maximizing NFS Scalability

Maximizing NFS Scalability Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

GrAVity: A Massively Parallel Antivirus Engine

GrAVity: A Massively Parallel Antivirus Engine GrAVity: A Massively Parallel Antivirus Engine Giorgos Vasiliadis and Sotiris Ioannidis Institute of Computer Science, Foundation for Research and Technology Hellas, N. Plastira 100, Vassilika Vouton,

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs 2012 IEEE 14th International Conference on High Performance Computing and Communications Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Che-Lun Hung Dept. of Computer

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

Chapter 2 Parallel Hardware

Chapter 2 Parallel Hardware Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation GPU Technology Conference 2012 May 15, 2012 Thomas M. Benson, Daniel P. Campbell, Daniel A. Cook thomas.benson@gtri.gatech.edu

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J. Rooney 1 José G. Delgado-Frias 2 Douglas H. Summerville 1 1 Dept. of Electrical and Computer Engineering. 2 School of Electrical Engr. and Computer

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Four-Socket Server Consolidation Using SQL Server 2008

Four-Socket Server Consolidation Using SQL Server 2008 Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures Simone Secchi Marco Ceriani Antonino Tumeo Oreste Villa Gianluca Palermo Luigi Raffo Universita degli

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment. Eldorado John Feo Cray Inc Outline Why multithreaded architectures The Cray Eldorado Programming environment Program examples 2 1 Overview Eldorado is a peak in the North Cascades. Internal Cray project

More information