STRING matching algorithms check and detect the presence
|
|
- Horatio Cole
- 5 years ago
- Views:
Transcription
1 436 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures Antonino Tumeo, Member, IEEE, Oreste Villa, Member, IEEE, and Daniel G. Chavarría-Miranda Abstract String matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. This paper compares several software-based implementations of the Aho-Corasick algorithm for high-performance systems. We focus on the matching of unknown inputs streamed from a single source, typical of security applications and difficult to manage since the input cannot be preprocessed to obtain locality. We consider shared-memory architectures (Niagara 2, x86 multiprocessors, and Cray XMT) and distributed-memory architectures with homogeneous (InfiniBand cluster of x86 multicores) or heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C1060 GPUs). We describe how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets. Index Terms Aho-Corasick, string matching, GPGPU, Cray XMT, multithreaded architectures, high-performance computing. Ç 1 INTRODUCTION STRING matching algorithms check and detect the presence of one or more known symbol sequences inside a data set. Besides their well-known application to databases and text processing, they are the basis of several critical realworld applications. String matching algorithms are key components of DNA and protein sequence analysis, data mining, security systems, such as Intrusion Detection Systems (IDS) for Networks (NIDS), Applications (APIDS), Protocols (PIDS), or Systems (Host-based IDS HIDS), antivirus software, and machine learning problems [1], [2]. All these applications process large quantities of textual data and require extremely high performance to produce meaningful results in an acceptable time. Among all string matching algorithms, one of the most studied, especially for text processing and security applications, is the Aho- Corasick (AC) algorithm [3], due to its exact, multipattern approach and its ability to perform the search in time linearly proportional to the length of the input stream. A large amount of research has been done to design efficient implementations of string matching algorithms using Field Programmable Gate Arrays (FPGAs) [4], [5], highly multithreaded solutions like the Cray XMT [6], multicore processors [7], or heterogeneous processors like the Cell Broadband Engine [8], [9]. Recently, Graphic Processing Units (GPUs) have been demonstrated as a suitable platform for some classes of string matching algorithms for NIDS such as SNORT [10], [11], [12]. Most previous approaches mainly focused on speed. Only in the last few years, aspects such as performance stability,. The authors are with the High Performance Computing Group, Pacific Northwest National Laboratory (PNNL), Richland, WA {antonino.tumeo, oreste.villa, daniel.chavarria}@pnl.gov. Manuscript received 19 July 2010; revised 21 Feb. 2011; accepted 17 May 2011; published online 16 June Recommended for acceptance by S. Aluru. For information on obtaining reprints of this article, please send to: tpds@computer.org, and reference IEEECS Log Number TPDS Digital Object Identifier no /TPDS independent from the size of the input and from the number of patterns to search, have started to gain interest. Actually, string matching applications not only require high performance, but also the ability to deal with very large dictionaries. For example, NIDS must be able to scan inputs from modern Ethernet links at 10 Gbps, while considering a number of malicious threats, already well over 1 million and exponentially growing [13]. They should do that in real time, without hindering the overall performance of the system, reducing the available bandwidth, or raising latencies. In general, hardware solutions support only small dictionary sizes due to lack of memory and are difficult to customize, while platforms such as the Cell/BE are extremely complex to program. With the emergence of multicore and multithreaded architectures and general-purpose computation on GPUs, softwarebased solutions have started to become a feasible platform for high throughput string matching applications. String matching is a good candidate for execution on these parallel architectures, since the search can be parallelized by dividing the input data set in smaller subsets, each one processed by a single thread or core. However, obtaining the maximum performance on all of them still requires a significant effort to efficiently map the algorithm to their features, and it may not even be sufficient to reach the desired throughput. Furthermore, performance variability when dealing with different sizes of inputs and dictionaries has always been a significant limit for softwarebased solutions. This is particularly true on cache-based architectures: if the matching patterns are resident in the cache, the matching algorithm performs very well; however, if they are not in the cache and have to be retrieved from main memory, the matching algorithm performs poorly. In many applications, when the input is not known and the data cannot be adequately preprocessed to guarantee some locality, such as data streamed from a single-input source like a network, the algorithm accesses data in unpredictable locations of the main memory, leading to highly variable performance /12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
2 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 437 In this paper, we present and compare several softwarebased implementations of the AC algorithm for highperformance systems. We focus our attention on streaming applications with unknown inputs, a situation typical of security systems, and often problematic for the search process. We look carefully how each solution achieves the objectives of supporting large dictionaries (up to 190,000 patterns), obtaining high performance, enabling flexibility and customization, and limiting performance variability. We present optimized implementations of the algorithm on a range of high-performance architectures, with shared or distributed memory, and with homogeneous or heterogeneous processing elements. For the shared-memory solutions, we consider a Cray XMT with up to 128 processors (128 threads per processor), a dual-socket Niagara 2 (eight cores per processor, eight threads per core), and a dual-socket Intel Xeon 5560 (Nehalem architecture, four cores per processor, two threads per core). For the distributed-memory systems, we evaluate a homogeneous cluster of Xeon 5560 processors (10 nodes, two processors per node) interconnected through Infiniband QDR and a heterogeneous cluster, where the Xeon 5560 processors are accelerated with NVIDIA Tesla C1060 GPUs (10 nodes, two GPUs per node). This paper extends the work in [6], by introducing and comparing new platforms with various architectural features. To the best of our knowledge, no previous work on software-based string matching algorithms included such a broad evaluation in terms of high-performance systems, input sets, and algorithm optimizations. The paper is organized as follows. Section 2 presents the algorithmic design on the various machines. Section 3 discusses our experimental results and the comparison among all the machines. Section 4 presents our conclusions. A comprehensive survey of the related work on string matching and background material on the AC algorithm and on the systems evaluated in this paper are included in the supplemental material, which can be found on the Computer Society Digital Library at doi.ieeecomputersociety.org/ /tpds ALGORITHM DESIGN and OPTIMIZATION In this section, we introduce the overall algorithmic design and then focus on the specific modifications and optimizations for the various platforms. Starting from the same basis, we designed sharedmemory implementations using pthreads for the Niagara 2, the dual Xeon (referred as x86 SMP Shared-memory Multiprocessor in the rest of the paper), and the proprietary programming model of the Cray XMT. For the GPU kernel, we used NVIDIA CUDA. For the heterogeneous and homogeneous clusters, instead, we designed an MPI load balancer, which can be used alone, or integrated with pthreads or CUDA. For all the implementations, our algorithm design is based on the following cornerstones:. minimize the number of memory references. reduce memory contention. We represent the patterns in a Deterministic Finite-state Automaton (DFA). The supplemental material, which can be found on the Computer Society Digital Library at doi.ieeecomputersociety.org/ /tpds , discusses how the automaton is generated. For each possible input symbol, there is always a valid transition to another node in the automaton. This key feature guarantees that, for each input symbol, there is always the same amount of work to perform. For a given dictionary, the data structures in main memory (DFA and input symbols) are read-only. For all the implementations, the common parallelization strategy is to use multiple threads or processes that concurrently execute the algorithm. Each thread or process has a current_node and operates on a distinct section of the input. The threads of shared-memory implementations and the threads running on a specific GPU access the same DFA, while different MPI processes and different GPUs access their own copy of the DFA. At runtime the input stream is buffered and split into chunks, which are then assigned to each processing element. The size of the chunks depends on the granularity of the parallelization. Eventually, they are chunked again in a hierarchical way. This is, for example, what happens in the MPI with CUDA implementation, where each GPU receives a buffer and then partitions it among its own threads, or in the MPI with pthreads implementation, where each MPI process running on a node gets its own buffer that is then partitioned among the threads executed by the different cores. The chunks overlap partially to allow matching of those patterns that cross a boundary. The overlapping is equal to the length of the longest pattern in the dictionary minus 1 symbol. The inefficiency of the overlapping (replicated work) is measured as (longest pattern-1)/(size of the chunk). As already discussed in [6], we represent the DFA graph as a State Transition Table (STT). The STT is a table composed of as many rows as there are nodes in the DFA and as many columns as there are symbols of the alphabet. Each STT line represents a node in the original DFA. Each entry (cell) of a STT line (indexed by a current_node and a symbol) stores the address of the beginning of the STT line that represents the next_node for that transition in the DFA. For the Cray XMT and pthreads pattern matchers, the STT lines are 256-byte aligned such that the least-significant byte of the address is equal to zero. This property allows us to store in the least-significant bit of each STT cell the boolean information indicating if that transition is final or not. Since we want to retrieve the next_node address dereferencing the current node þ symbol pointer, STT lines must always have the same size. Alphabet symbols not used in the dictionary have to be explicitly represented as transitions to the root node. The supplemental materials, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds , discusses in deeper details the basic design, which is at the base of the Cray XMT and the pthreads implementation. 2.1 Cray XMT Implementation The approach for exploiting the highly multithreaded architecture of the Cray XMT focuses mostly on reducing latency variability. If the latency is constant or slowly variable, the system is able to schedule a sufficient number of threads to hide it. On the Cray XMT, the main cause of variability in the memory access time is the presence of hotspots. Hotspots are memory regions frequently accessed by multiple
3 438 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 threads simultaneously. Nevertheless, the XMT employs a hardware hashing mechanism which spreads data in all the system s memory banks with a granularity of 64 bytes (block) [14] (see supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds ). However, if different blocks corresponding to different memory banks have different access ratios, the pressure on the memory banks is not equally balanced, producing variability in the access time. In our implementation, there are two reasons why this can happen.. Each STT cell is large 8 bytes (address of the STT line + Boolean flag for final transitions). Therefore, each STT line (representing a DFA node) has size corresponding to 8 bytes multiplied by the alphabet size. If we consider the 256-symbol ASCII alphabet, each STT line requires 2,048 bytes, or 32 blocks (64 bytes per block). If the scanned input has a particular symbol frequency (i.e., English text, decimal numbers, etc.), the input symbols will only be a subset of the ASCII alphabet and concurrent threads will be accessing only a subset of the 32 blocks per STT line, producing hotspots.. Typically, a few states in the first levels of a Bread- First Search (BFS) of the DFA graph are responsible for the majority of accesses. As a result, the memory blocks containing those states form hotspots. For inputs that are similar to the dictionary, instead, the transitions tend to be distributed more equally on the levels, leading to reduced or absent hotspots. The supplemental material, which can be found on the Computer Society Digital Library at TPDS , includes the analysis of the access patterns for a dictionary of 20,000 English words when scanned against different inputs. To alleviate the above problems, as shown in [6] we propose the following solutions:. Alphabet shuffling. The alphabet symbols in a STT line can be shuffled using a relatively simple linear transformation, ensuring that contiguous symbols in the alphabet are spread out over multiple memory blocks. The shuffling function can be inexpensively and effectively computed as symbol 0 ¼ðsymbol fixed offsetþ 8. This transformation, in conjunction with the hardware hashing mechanism on the XMT s memory, guarantees that accesses are spread out over distinct memory blocks for those inputs (i.e., English text, decimal numbers, etc.) that have characters located relatively close to each other in the ASCII ordering. The hardware memory hashing allows us designing a shuffling function that does not depend on the state number (line in the STT table). In contrast, systems where the memory hierarchy is not hashed, like the Niagara 2 or the x86 SMP, require more complex shuffling functions.. State replication. We replicate the STT states corresponding to the first levels of the BFS exploration of the DFA. Addresses of the different replicas of the same logic state (STT line) are randomly stored during the creation of the STT in the STT cells pointing to that state. This ensures that the memory pressure is equally balanced when different threads access the blocks for that state. This mechanism is greatly simplified by the underlying hardware hashing, since different replicas are spread in different memory banks. On cache-based architectures (x86 SMP and Niagara 2), these optimizations would not bring significant benefits, but rather reduce the spatial locality of the cache. 2.2 GPU Implementation The GPU implementation starts from the same principles of the basic design, but requires some specific adaptations. Each CUDA thread independently performs the matching on a chunk of the input text. This allows using reasonably sized chunks, while maintaining a high utilization of each thread. The main differences reside in the STT. The rows still represent the states, the columns still represents the symbols. However, states are addressed by indices and not by pointers. Each cell of the table, which in our GPU code has a size of 32 bits, contains, thus, the index of the next state in the first 31 bits and the flag that tags final states in the last bit. So, we remove the requirement for the 256-bytes alignment of the lines, reducing the memory usage, which is critical for GPUs that can address only up to 4 GB. Nevertheless, the organization of the memory controllers for the CUDA architecture still requires to further optimize the STT layout. Similarly to the Cray XMT implementation, memory hotspots may happen. The reasons are that each memory controller manages a 256-byte wide memory partition and, depending on the number of memory controllers and partitions, on the alphabet, on the input streams, on the number of matches, the accesses may concentrate only on a partition, thus saturating its own memory controller. This problem is known in the GPGPU community as partition camping [15]. On the Tesla C1060 there are eight partitions, thus, with rows of 256 cells (ASCII alphabet) of 32 bits (1,024 bytes in total), a partition has the cells for the same 64 symbols on odd or even rows. So, with inputs that have particular symbol frequencies, the accesses may concentrate only on few partitions. To solve this problem, we added padding to each line of the STT corresponding to the size of a partition, generating a more even access pattern. The unpredictable nature of the loads on the STT makes it impossible to coalesce the accesses of half-warps in a single-memory transaction: with high probability, they are not sequential and not aligned. Thus, we decided to bind the STT to the texture memory. Binding data to texture memory makes the GPU fetch the data through the texture units, which do not require coalescence (but still suffer from partition camping), and exploit the texture caches, which are optimized for 2D locality. The texture cache has an active set around 8 KB and, when there is a hit, the latency for loading the data is only a few cycles (the exact performance is not disclosed by NVIDIA, but empirical evaluation puts it around 6-7 cycles). The benefits in using the texture cache derive mainly from the hits on the first levels of the STT. There is a limit on the size of textures in CUDA: when bound to linear memory, they can allocate up to 2 27 elements. So, when big STTs are used, we bind them to texture only partially.
4 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES Distributed-Memory Implementation For distributed-memory architectures, we wrapped our string matching engine in an MPI load balancer. We developed a master/slave scheduler, where the master MPI process distributes the work, and the slaves perform the effective computation. From the same scheduler, we developed three slightly different implementations for the various configurations of the clustered architectures. Fig. 1. Optimizations on the input text for coalesced reads. The input text is transposed so that each thread reads four consecutive input symbols from its respective chunk with a single load. This allows us retaining the benefits of caching for the majority of light matching cases, without significantly influencing the performance in heavy matching cases, where there are many cache misses. Using shared memory for caching STT lines is not effective for large dictionaries: with the maximum block size of 512 threads, only eight lines (each line has a size of 4 bytes 256 symbols ¼ 1;024 bytes) can be stored (replicated for each block) in the 16 KB of memory. However, for the smallest dictionary in our benchmark set (20k-pattern), the first two levels already have 54 lines (0.11 percent of 49,849). Another important optimization is performed on the input text. In applications streaming data from a single source, the input text is sequentially buffered in the host memory. Thus, if the input text is directly moved to the global memory of the graphic card, each CUDA thread would start loading it with a stride corresponding to the chunk size. Consequently, memory accesses of a half-warp will be uncoalesced. This can be thought as having the input text organized in a matrix, where each chunk corresponds to a row. Fig. 1a shows the situation. To solve this problem, we apply a transposition after copying the input text to the global memory. However, the input text is transposed in groups of four symbols. This is due to the optimal size for loads on the Tesla C1060 GPU (T10 architecture). Since we use the ASCII alphabet, each symbol is an 8-bit character. The minimum granularity for load transactions of a half-warp (group of 16 threads simultaneously performing memory operations) is 32 bytes (16 bits per threads), while the optimal size is 64 bytes (32 bits per thread). Thus, the input chunks are transposed in blocks of four symbols, which are then read with a single, coalesced load in the main loop of the pattern matching algorithm. The final result of the transposition, with chunks now on columns, is shown in Fig. 1b. When the number of chunks is not an integer multiple of the number of threads in a half-warp (16), we also add padding so that each row starts aligned in memory to respect coalescence rules. The transposition is performed directly on the GPU with a fast and optimized kernel, and it is transparent to the rest of the system. The matching results are collected per chunk. A reduction is applied to gather all the results and send them back to the host through a single-memory copy operation. Again, the reduction operation is performed on the GPU with an optimized kernel.. An MPI-only solution for the homogeneous x86 cluster, where each slave process is mapped to a core and wraps a single-threaded version of our string matching engine.. An MPI with phtreads solution, where each slave process is mapped to an entire node of the cluster and wraps a multithreaded version of our string matching engine.. An MPI with CUDA solution, where each slave process is mapped to a GPU and wraps a CUDA kernel. The MPI load balancer uses a multibuffering scheme with a configurable number of buffers and buffer size. There are a few obvious reasons for this choice. As explained before, the performance of cache-based architectures in the matching process strictly depends on the number of matches in the input text. If there are few matches, only few states of the STT are accessed and the procedure is fast. If there are many matches, instead, a lot of states in different memory locations are accessed, generating many cache misses and lowering the speed. This is also true for our GPU implementation, which uses the cached texture memory for the STT. Consequently, statically dividing the input text in chunks among the MPI slave processes would generate an unbalanced execution, especially if the chunks exhibit substantially different matching behavior. The slowest chunk to process would, in fact, determine the performance. A more appropriate and balanced solution is setting a relatively small fixed buffer size for the slave processes and allowing the master to send new data to a slave as soon as previous data have been consumed. Furthermore, if the computational kernels are well optimized, communication bandwidth among the MPI processes may become the main bottleneck. Thus, the only possible approach to maximize the performance is to overlap communication with computation. This is accomplished by using nonblocking communication together with multiple buffers. 3 EXPERIMENTAL RESULTS We have implemented various versions of the Aho-Corasick algorithm as described in Section 2 for the Cray XMT, the x86 SMP, the Niagara 2, the GPU cluster, and the x86 cluster. Our implementations have two main phases: building the state transition table and executing string matching against the built STT. The STT building phase is performed offline and stored in a file representation. We focus our experiments on the string matching phase, since it is the critical portion of the algorithm in real-world applications.
5 440 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Fig. 2. Scaling of the optimized implementation on the Cray XMT. The optimized code scales almost linearly. Our experiments utilize four different dictionaries:. Dictionary 1. An 190;000-pattern data set with mostly text entries with an average length of 16 bytes.. Dictionary 2. An 190;000-pattern data set with mixed text and binary entries and an average length of 16 bytes.. English. A 20,000-pattern data set with the most common words from the English language and an average length of 8.5 bytes.. Random. A 50,000-pattern data set with entries generated at random from the ASCII alphabet with a uniform distribution and an average length of 8 bytes. Dictionaries with more text-like entries have higher frequencies of alphabetical ASCII symbols. We also use four different input streams for each dictionary:. Text. Which corresponds to the English text of the King James Bible.. TCP. Which corresponds to captured TCP/IP traffic.. Random. Which corresponds to a random sample of characters from the ASCII alphabet.. Itself. Which corresponds to feeding the dictionary itself as an input stream for string matching. Using the dictionary itself as an input will exhibit the heaviest matching behavior, thus influencing significantly the performance of the algorithm. We report all the results in Gigabits per second (Gbps). The machines used for the evaluation of the algorithm are configured as follows:. Niagara 2: two Niagara 2 processors (1,165 GHz), 32 GB of memory,. XMT: 128 nodes, with a total of 1 TB of memory, Seastar2 interconnection. x86 SMP: two Xeon 5560 processors (2.8 GHz), 24 GB of memory. x86 cluster: 10 nodes, each one configured as the x86 SMP, interconnected with Infiniband QDR (24 Gbps). GPU cluster: 10 nodes with two Tesla C1060 (4 GB of memory) each. We consider a situation in which the input text is streamed from a single source and buffered in memory before being processed. This is analogous to the approach adopted in NIDS for real-time analysis, where some buffers are processed when, at the same time, others are filled with new data. Since we want to compare sustained performance, we size the buffers to minimize data starving. For this reason, we use an input buffer of 100 MB. The input is saved in the shared memory for SMP architectures and on the master node for MPI solutions. The input is then chunked for parallel processing. For pthreads the input is equally divided among the threads. For the XMT and the GPU implementation, instead, each thread processes a fixed size, and we generate a sufficient number of threads to reach maximum utilization. For these platforms, we empirically evaluated the chunk size and found that 2 KB is a reasonable tradeoff in terms of number of threads generated and utilization of each thread. With this chunk size, as discussed in Section 2, the inefficiency due to chunk overlapping is limited to 0.7 percent in the worst case (patterns of 16 bytes). Fig. 2 shows the results of the optimized implementation for the Cray XMT, exploiting both alphabet shuffling and state replication. The performance is very close to linear scaling, with minimal differences among the various input sets for all the dictionaries. For comparison, the performance of the basic implementation on the Cray XMT (shown in the supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds ) remains far from linear scaling. With more than 48 processors, it even present slowdowns, in particular, with the text and TCP inputs, due to significant memory hotspotting. It is interesting to compare the XMT with the x86 SMP and the Niagara 2, both sharedmemory architectures. Both the machines run a pthreads version of the basic algorithm, compiled, respectively, with the Intel C Compiler (icc) 11.1 and the Sun C compiler 5.9. On the x86 SMP, we executed the benchmarks with HyperThreading enabled, increasing the number of threads from 1 to 16 (eight per processor). On the Niagara 2, we used from 1 to 128 threads (64 per processor), but we report only the most relevant runs. The results are, respectively, reported in Figs. 3 and 4. For the x86 SMP, we see that the scaling is not linear and that, in a few low matching cases, increasing the number of threads can slightly reduce the performance. The main reason for this behavior is the complex hierarchy of caches in Nehalem. In particular, the shared L3 may get different access patterns when changing the number of threads and the size of chunks processed by each thread. In general, HyperThreading brings some performance benefits, but, when the number of threads outgrows the number of cores, the scaling is progressively reduced, sometimes reaching saturation. This is evident in heavy matching cases, where there is a high probability of missing in the cache and randomly accessing the memory, thus getting limited by the memory bandwidth. There is high variability in performance among low and heavy matching cases, with
6 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 441 Fig. 3. Scaling on the x86 SMP. The variability for the various dictionary/input combinations is high. Fig. 4. Scaling on the dual Niagara 2. Light and medium matching cases reach similarly high performance, while heavy matching cases are limited to lower throughputs. Fig. 5. Comparison among all the shared-memory solutions for the various dictionary/input combinations. Only the Cray XMT shows a very stable performance. A single Tesla C1060 performs, on average, like the x86 SMP and the dual Niagara 2. results significantly different depending on the type of input streams matched against the dictionaries. With Niagara 2, we obtain significant speedups, albeit not linear, with up to 80 threads. At 80 threads, we start getting reduced speedups and over 96 threads the speedups become marginal. Niagara 2 obtains stable performance (i.e., similar results for different dictionaries and input streams) in light and medium matching conditions. However, in heavy matching conditions it does not, due to the thrashing of the small second level cache and the memory hotspots. We compare the results of the XMT, the x86 SMP, and the Niagara 2, all shared-memory platforms, to the throughput of the optimized GPU implementation on a single Tesla C1060. A Tesla C1060 board can be considered as a sharedmemory platform too, since all the processing units of the GPU shares the same STT. We compiled the kernel with CUDA 3.0. Fig. 5 presents this comparison, and also shows the performance variability of these architectures with our data sets. Without the optimizations presented in Section 2.2, the GPU would hardly obtain speedups with respect to a sequential x86 implementation, in particular, for the noncoalescent accesses. In general, the Tesla is competitive with both the Niagara 2 and the x86 SMP. The GPU achieves high performance especially when there is light matching and the dictionaries are small. In heavy matching conditions, the GPU is fast only with the smallest dictionaries, the 20k-pattern English and the 50k-pattern random, which fit in a single texture, thus removing any noncoalescent access and divergence due to partial caching. All the implementations that exploit caching for the STTs show high variability. The GPU implementation s worst performance is 95 percent slower than its best, the x86 SMP implementation 92.2 percent slower than its best, and the Niagara 2, 90.3 percent slower than its best. Even if GPUs and Niagara 2 exploit multithreading to hide the memory latencies, they remain still limited by their memory subsystem. The XMT with 48 processors, instead, shows a performance variability of only 2.5 percent with the different combinations of dictionaries and input streams, and, on average, is faster than the GPU, the x86 SMP, and the Niagara 2. On the XMT with 128 processors, the performance rises up to 28 Gbps, the highest reported in the literature for a software solution with very large dictionaries, with a variability of 12%.
7 442 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 3, MARCH 2012 Fig. 6. Scaling on the x86 cluster. Only light and medium matching cases saturate the Infiniband bandwidth. On the homogeneous cluster, using an MPI-only implementation does not allow us to fully exploit each node. This solution, in fact, requires allocating, inside each node, an MPI process for each hardware thread. However, due to the distributed-memory abstraction of MPI, this means replicating all the STT data structures. With our largest dictionaries (Dictionaries 1 and 2) the 24 GB of memory of each node is soon consumed and not more than two or three processes can be allocated. Furthermore, with small dictionaries, even if there is significant scaling up to seven processes, the replication of the STTs increases the memory traffic and significantly limits the peak performance with respect to the pthreads implementation. The supplemental material, which can be found on the Computer Society Digital Library at /TPDS , shows the behavior of the MPIonly implementation on the x86 SMP, equivalent to a single node of the homogeneous cluster. For these reasons, for the scaling tests with the full, 10-node x86 cluster, we only show the pthreads implementation with the MPI load balancer. The GPU cluster implementation, instead, uses an MPI process for each GPU, thus each node runs two MPI processes. Fig. 6 shows the performance obtained while incresing the number of nodes on the x86 cluster from 1 to 10 (thus, from 2 to 20 CPUs). Fig. 7, instead, shows the performance of the GPU cluster, again increasing the number of nodes from 1 to 10 (from 2 to 20 GPUs). In the cluster implementations, the master MPI process streams the input text to the other nodes in blocks of 3 MB each. We measured the maximum internode bandwidth of our cluster with the Ohio State University MPI MicroBenchmarks [16], and we verified a maximum performance of 3 GB/s (24 Gbps) with buffers over 1 MB. We also measured the PCI-Express 2.0 bandwidth between the host processors on a node and an attached GPU with the NVIDIA benchmark, and obtained a peak of 3:7 GB/s for both host-to-device and device-to-host bandwidth, which is expected since an x16 link (8 GB/s) is shared among two GPUs. The GPU-accelerated cluster reaches a saturation point with all the data sets. With an adequate number of nodes, even the heavy matching benchmarks reach the same performance of the light matching benchmarks. This does not happen for the x86 cluster, where the heavy matching benchmarks continue to scale, but never reaches the performance of the light matching tests. At most five nodes (10 GPUs) are required to reach saturation. The main bottleneck appears the Infiniband bandwidth. However, the saturation is reached around 19 Gbps, under the maximum verified bandwidth of 24 Gbps for the network connection. The reason is the overhead generated by the MPI layer, which becomes more significant when increasing the number of nodes. The x86 cluster reaches slightly higher saturated performance with respect to the GPU cluster, mostly because the data blocks being delivered to the GPU go through an additional hop on the PCI-Express bus. The supplemental material, which can be found on the Computer Society Digital Library at ieeecomputersociety.org/ /tpds , also briefly discusses the case of known inputs and the cost/performance tradeoffs of the different systems. 4 CONCLUSIONS We have presented several software implementations of the Aho-Corasick pattern matching algorithm for highperformance systems, and carefully analyzed their performance. We considered the various tradeoffs in terms of peak performance, performance variability, and data set size. We presented optimized designs for the various architectures, discussing several algorithmic strategies, for shared-memory solutions, GPU-accelerated systems, and distributed-memory systems. Fig. 7. Scaling on the GPU cluster. Infiniband bandwidth gets saturated with at most five nodes (10 GPUs).
8 TUMEO ET AL.: AHO-CORASICK STRING MATCHING ON SHARED AND DISTRIBUTED-MEMORY PARALLEL ARCHITECTURES 443 We found that the absolute performance obtained on the Cray XMT is one of the highest reported in the literature, at 28 Gbps (using 128 processors) for a software solution with very large dictionaries. Through multithreading and memory hashing the XMT is able to maintain stable performance across very different sets of dictionaries and input streams. A dual Niagara 2 obtains stable performance only in low and medium matching conditions, while a dual Xeon 5560 has more varied results, obtaining high peak rates for light matching conditions, but progressively reducing its performance as the number of matches increases. Our optimized GPU implementation, which exploits the texture cache, obtains varied results depending on the dictionaries and the input streams, but reaches, on average, the same performance of the dual Niagara 2 and the dual Xeon 5560 on a single Tesla C1060. A Cray XMT machine with 48 processors is able, on average, to outperform them, while maintaining substantially the same performance (2% variability) on all the dictionaries and all input sets. On clustered architectures, our GPU implementation saturates the communication bandwidth with all the data sets when using 10 GPUs. Nevertheless, even with a higher Infiniband bandwidth, it would still be limited by the PCI-Express bandwidth. On the x86 cluster, an MPI-only solution with large dictionaries is not practical, while a mixed solution with MPI for internode communication and pthreads for intranode computation does not reach the same performance of the GPU cluster on heavy matching benchmarks. Today, software approaches for pattern matching on high-performance systems can reach high throughputs, with moderate programming efforts and simpler code structures with respect to custom solutions on FPGAs and multimedia processors such as the IBM CELL/BE. By covering such a wide range of machines, we think that our work may lay a foundation for a better understanding of the behavior of such irregular parallel algorithms on modern architectures. REFERENCES [1] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, [2] Pattern Recognition and String Matching, D. Chen and X. Chen, eds. Springer, [3] A.V. Aho and M.J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM, vol. 18, no. 6, pp , [4] Y.H. Cho and W.H. Mangione-Smith, Deep Packet Filter! with Dedicated Logic and Read Only Memories, Proc. IEEE 12th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM), pp , [5] C.R. Clark and D.E. Schimmel, Scalable Pattern Matching for High Speed Networks, Proc. IEEE 12th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM), pp , [6] O. Villa, D. Chavarria-Miranda, and K. Maschhoff, Input- Independent, Scalable and Fast String Matching on the Cray XMT, Proc. IEEE 23rd Int l Symp. Parallel AND Distributed Processing (IPDPS), pp. 1-12, [7] D. Pasetto, F. Petrini, and V. Agarwal, Tools for Very Fast Regular Expression Matching, Computer, vol. 43, pp , [8] O. Villa, D.P. Scarpazza, and F. Petrini, Accelerating Real-Time String Searching with Multicore Processors, Computer, vol. 41, no. 4, pp , [9] D.P. Scarpazza, O. Villa, and F. Petrini, Exact Multi-Pattern String Matching on the Cell/B.E. Processor, Proc. Fifth Conf. Computing Frontiers (CF), pp , [10] M. Roesch, Snort: Lightweight Intrusion Detection for Networks, Proc. 13th USENIX Conf. System Administration (LISA), pp , [11] N. Jacob and C. Brodley, Offloading IDS Computation to the GPU, Proc. 22nd Ann. Computer Security Applications Conf. (ACSAC), pp , [12] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, and S. Ioannidis, Gnort: High Performance Network Intrusion Detection Using Graphics Processors, Proc. 11th Int l Symp. Recent Advances in Intrusion Detection (RAID), pp , [13] Symantec Corporation Symantec Global Internet Security Threat Report Whitepaper, Apr [14] J. Feo, D. Harper, S. Kahan, and P. Konecny, ELDORADO, Proc. Second Conf. Computing Frontiers (CF), pp , [15] G. Ruetsch and P. Micikevicius, NVIDIA Whitepaper: Optimizing Matrix Transpose in CUDA, [16] Ohio State Univ. MPI MicroBenchmarks, ohio-state.edu/benchmarks/, Antonino Tumeo received the MS degree in informatic engineering, in 2005, and the PhD degree in computer engineering, in 2009, from Politecnico di Milano, Italy. Since February 2011, he has been a research scientist at Pacific Northwest National Laboratory (PNNL). He joined PNNL in 2009 as a postdoctoral research associate. Previously, he was a postdoctoral researcher at Politecnico di Milano. His research interests include modeling and simulation of high-performance architectures, hardware-software codesign, FPGA prototyping, and GPGPU computing. He is a member of the IEEE. Oreste Villa received the MS degree in electronic engineering in 2003 from the University of Cagliari in Italy and the ME degree in 2004 in embedded systems design from the University of Lugano in Switzerland. He joined PNNL, in May 2008, after receiving the PhD degree from Politecnico di Milano for his research on Designing and Programming Advanced Multicore Architectures. While receiving the PhD degree, he was an intern student at PNNL, conducting research in programming techniques and algorithms for advanced multicore architectures, cluster fault tolerance, and virtualization techniques for HPC. He is a research scientist at the Pacific Northwest National Laboratory with a research focus on computer architectures and simulation, accelerators for scientific computing and irregular applications. He is a member of the IEEE. Daniel G. Chavarría-Miranda received the MS and PhD degrees in computer science from the Rice University in He is a senior scientist in the High-Performance Computing Group at the Pacific Northwest National Laboratory. His expertise is in programming models, compilers, and languages for large-scale HPC systems. He also served as a co-pi for a DoD-funded Center for Adaptive Supercomputing Software- Multithreaded Architectures (CASS-MT), with a focus on scalable highly irregular applications and systems software. He has also served as the principal investigator for several PNNLfunded Laboratory Directed Research and Development projects in the application of reconfigurable and hybrid systems to scientific codes. He has been a member of the ACM since For more information on this or any other computing topic, please visit our Digital Library at
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationString Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm
String Matching with Multicore CPUs: Performing Better with the -Corasick Algorithm S. Arudchutha, T. Nishanthy and R.G. Ragel Department of Computer Engineering University of Peradeniya, Sri Lanka Abstract
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationGregex: GPU based High Speed Regular Expression Matching Engine
11 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing Gregex: GPU based High Speed Regular Expression Matching Engine Lei Wang 1, Shuhui Chen 2, Yong Tang
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationMultipattern String Matching On A GPU
Multipattern String Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL 32611 Email: {xzha, sahni}@cise.ufl.edu Abstract
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationExploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI,
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationAccelerating String Matching Using Multi-threaded Algorithm
Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationChapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction
Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.
More informationDetecting Computer Viruses using GPUs
Detecting Computer Viruses using GPUs Alexandre Nuno Vicente Dias Instituto Superior Técnico, No. 62580 alexandre.dias@ist.utl.pt Abstract Anti-virus software is the main defense mechanism against malware,
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationParallelizing Inline Data Reduction Operations for Primary Storage Systems
Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationDistributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet
Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationParallelism. Parallel Hardware. Introduction to Computer Systems
Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,
More informationSWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection
SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection Lucas Vespa Department of Computer Science University of Illinois at Springfield lvesp@uis.edu Ning Weng Department of Electrical and Computer
More informationFile Structures and Indexing
File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationMulticore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
More informationOptimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA (Technical Report UMIACS-TR-2010-08) Zheng Wei and Joseph JaJa Department of Electrical and Computer Engineering Institute
More informationMaximizing NFS Scalability
Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationGrAVity: A Massively Parallel Antivirus Engine
GrAVity: A Massively Parallel Antivirus Engine Giorgos Vasiliadis and Sotiris Ioannidis Institute of Computer Science, Foundation for Research and Technology Hellas, N. Plastira 100, Vassilika Vouton,
More informationA Parallel Access Method for Spatial Data Using GPU
A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationEfficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs
2012 IEEE 14th International Conference on High Performance Computing and Communications Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Che-Lun Hung Dept. of Computer
More informationOptimizing LS-DYNA Productivity in Cluster Environments
10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for
More informationProcess size is independent of the main memory present in the system.
Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationImplementing a Statically Adaptive Software RAID System
Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems
More informationChapter 2 Parallel Hardware
Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationLimiting the Number of Dirty Cache Lines
Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology
More informationImproved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment
Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationUsing GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation
Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation GPU Technology Conference 2012 May 15, 2012 Thomas M. Benson, Daniel P. Campbell, Daniel A. Cook thomas.benson@gtri.gatech.edu
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationAN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme
AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J. Rooney 1 José G. Delgado-Frias 2 Douglas H. Summerville 1 1 Dept. of Electrical and Computer Engineering. 2 School of Electrical Engr. and Computer
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationEffective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management
International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,
More informationFour-Socket Server Consolidation Using SQL Server 2008
Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationDistributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne
Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel
More informationExploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures Simone Secchi Marco Ceriani Antonino Tumeo Oreste Villa Gianluca Palermo Luigi Raffo Universita degli
More informationChapter 9 Memory Management
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
More informationEldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.
Eldorado John Feo Cray Inc Outline Why multithreaded architectures The Cray Eldorado Programming environment Program examples 2 1 Overview Eldorado is a peak in the North Cascades. Internal Cray project
More information