An Efficient AC Algorithm with GPU

Similar documents
The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies

Speech Recognition Based on Efficient DTW Algorithm and Its DSP Implementation

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs

Design of New Oscillograph based on FPGA

Design of three-dimensional reconstruction system for farm production

A Hand Gesture Recognition Method Based on Multi-Feature Fusion and Template Matching

Gregex: GPU based High Speed Regular Expression Matching Engine

DSP-Based Parallel Processing Model of Image Rotation

Research on Sine Dynamic Torque Measuring System

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

Parallelization of K-Means Clustering Algorithm for Data Mining

Development of a Rapid Design System for Aerial Work Truck Subframe with UG Secondary Development Framework

GrAVity: A Massively Parallel Antivirus Engine

Implementation of String Match Algorithm BMH on GPU Using CUDA

Implementation and Optimization of LZW Compression Algorithm Based on Bridge Vibration Data

Preliminary Research on Distributed Cluster Monitoring of G/S Model

The Design of Embedded MCU Network Measure and Control System

Research on monitoring technology of Iu-PS interface in WCDMA network

View-dependent fast real-time generating algorithm for large-scale terrain

A Batched GPU Algorithm for Set Intersection

A Fast Video Illumination Enhancement Method Based on Simplified VEC Model

Performance Analysis of Sobel Edge Detection Filter on GPU using CUDA & OpenGL

Optimization solutions for the segmented sum algorithmic function

GPU-based NFA Implementation for Memory Efficient High Speed Regular Expression Matching

Analysis of the Power Consumption for Wireless Sensor Network Node Based on Zigbee

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

A Rapid Automatic Image Registration Method Based on Improved SIFT

Design of three-dimensional photoelectric stylus micro-displacement measuring system

Implementation and Improvement of DSR in Ipv6

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

A Multicast Routing Algorithm for 3D Network-on-Chip in Chip Multi-Processors

Available online at ScienceDirect. Procedia Engineering 99 (2015 )

VFS Interceptor: Dynamically Tracing File System Operations in real. environments

An Improved PageRank Method based on Genetic Algorithm for Web Search

The Analysis and Detection of Double JPEG2000 Compression Based on Statistical Characterization of DWT Coefficients

Available online at ScienceDirect. Procedia Computer Science 98 (2016 )

Multipattern String Matching On A GPU

Research on software development platform based on SSH framework structure

Accelerating koblinger's method of compton scattering on GPU

Data Processing System to Network Supported Collaborative Design

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Detecting Computer Viruses using GPUs

Accelerating String Matching Using Multi-threaded Algorithm

A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Citation for the original published paper (version of record):

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Configurable String Matching Hardware for Speeding up Intrusion Detection

Performance Study of GPUs in Real-Time Trigger Applications for HEP Experiments

Network Topology for the Application Research of Electrical Control System Fault Propagation

A GPU Based Memory Optimized Parallel Method For FFT Implementation

The IIC interface based on ATmega8 realizes the applications of PS/2 keyboard/mouse in the system

CUDA (Compute Unified Device Architecture)

Available online at ScienceDirect. Energy Procedia 69 (2015 )

A new Class of Priority-based Weighted Fair Scheduling Algorithm

THE DESIGN AND IMPLEMENTATION OF AN EXTENSIBLE FORMAT FORMEMORY DUMPS

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Advanced and parallel architectures

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Available online at ScienceDirect. Procedia Manufacturing 6 (2016 ) 33 38

The Community Library Anniance Based on Cloud Computing

Design of a Chinese Input Method on the Remote Controller Based on the Embedded System

FSRM Feedback Algorithm based on Learning Theory

State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

Unknown Malicious Code Detection Based on Bayesian

Main idea. Demonstrate how malware can increase its robustness against detection by taking advantage of the ubiquitous Graphics Processing Unit (GPU)

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Design and Simulation Based on Pro/E for a Hydraulic Lift Platform in Scissors Type

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Research on Embedded CNC Device Based on ARM and FPGA

GPU Programming Using NVIDIA CUDA

An Efficient Image Processing Method Based on Web Services for Mobile Devices

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

Available online at ScienceDirect. Procedia Engineering 111 (2015 )

Hybrid ant colony optimization algorithm for two echelon vehicle routing problem

Recognition of Human Body Movements Trajectory Based on the Three-dimensional Depth Data

Metric and Identification of Spatial Objects Based on Data Fields

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

A paralleled algorithm based on multimedia retrieval

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

LUNAR TEMPERATURE CALCULATIONS ON A GPU

Research on the Application of Interactive Electronic Whiteboard in Network Teaching

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Technologies in Smart Substation

A Practical Distributed String Matching Algorithm Architecture and Implementation

Parallel 3D Images Surface Texture Editing

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

High Performance Remote Sensing Image Processing Using CUDA

Security Based Heuristic SAX for XML Parsing

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

IP LOOK-UP WITH TIME OR MEMORY GUARANTEE AND LOW UPDATE TIME 1

New research on Key Technologies of unstructured data cloud storage

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Transcription:

Available online at www.sciencedirect.com Procedia Engineering 29 (2012) 4249 4253 2012 International Workshop on Information and Electronics Engineering (IWIEE) An Efficient AC Algorithm with GPU Liang Hu, Zhen Wei, Feng Wang, Xiaolu Zhang, Kuo Zhao* Department of Computer Science and Technology, Jilin University, Changchun 130012, P.R.China Abstract Pattern matching algorithm is the basis of information biology as well as information retrieval research field. As far as the current computing capability of CPU is concerned, it may hardly satisfy the demand in terms of system realtime when faced by a large amount of data. Due to the enormous computing potential of GPU, various efforts have been made for the computing task migration to GPU with given application programs. In this paper, we present an AC Multi-pattern matching algorithm based on GPU with the parallelization of traditional algorithm based on CPU, and the matching efficiency may be significantly improved due to the high performance parallel processing capability of GPU. Experimental results show our proposed scheme achieved a better speed-up compared to the original one. 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Harbin University of Science and Technology Open access under CC BY-NC-ND license. Keywords: Pattern matching; CPU; AC 1. Introduction Pattern matching algorithm has been an active research field of computer science, it plays a key role in lots of applications, and it is a highly computing intensive process. While traditional pattern matching algorithms based on CPU have inherent limits, specific parallel processing hardware has been applied to the improvement of the performance of traditional pattern matching algorithms. In recent years, with the rapid development of Graphic Processing Unit (GPU), there are more and more research efforts for GPU including the research on pattern matching algorithm. Jacob and Brodley tried to employ the GPU as the analysis engine applied to single pattern matching for PixelSnort, and the pattern matching algorithm is Knuth-Morris-Pratt (KMP). However, the result isn't satisfied. Giorgos Vasiliadis and Spiros Antonatos[1] proposed a new intrusion detection system called Gnort, and the * Corresponding author. Tel.: +86-431-85168716; fax: +86-431-85166494. E-mail address: zhaokuo@jlu.edu.cn. 1877-7058 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2012.01.652 Open access under CC BY-NC-ND license.

4250 Liang Hu et al. / Procedia Engineering 29 (2012) 4249 4253 detection engine is Aho-Corasick[2][3][4] multi-pattern matching algorithm based on GPU. However, with the increase of the pattern set, system performance decreases sharply. Giorgos Vasiliadis directly store the lookup table (DFA[1]) in the form of the two-dimensional array. With the increase of the pattern set, there are too much cache misses, and it affects matching efficiency. So we must optimize the lookup table when the pattern set's scale is increased. In this paper, we presents a GPU-CSAC algorithm (Compress Storage AC algorithm Based On GPU) in order to meet the needs of large-scale pattern set. 2. Algorithm implementation 2.1. AC pattern matching algorithm AC algorithm is proposed by the A.V.Aho and M.J.Corasick, which is a classic multi-pattern matching algorithm and firstly used in the Bell Labs library searching system and later has been widely used in other filed. AC algorithm is mainly composed of three functions: goto function, failure function and output function. AC pattern matching algorithm time complexity is O(n), have nothing to do with the number of pattern strings and the length of each pattern string. No matter pattern P appears in the text string T, each character in T must input into the state machine. So whether it is in the best case or worstcase, AC pattern matching algorithm time complexity is always O(n). Including pretreatment, AC time complexity is O (M + n), M is the length of string for all pattern. 2.2. Preprocess The main work in preprocessing stage is to read the pattern set and generated a state tree. we can assign the successor node adopting the principle of distribution according to need instead of assigning 256 successor node for each state according to the fan-out number which we can get from the state tree. This will greatly reduce the look-up table of storage. 2.3. Construction of the lookup table Firstly, we may construct the original look-up table, in which horizontal axis indicates that the current state, vertical axis indicates that input characters, the table element indicates the next state, such as State 0 input s turn to state 3. When the pattern is large, goto table need too much storage space, available memory can not be fully qualified, so it must be compressed storage. The value of property element 256 means not exist, because the number of fan-out (fan-out number indicates the input number) is 0, the state can not have any input characters, Similarly,-1 means not exist too. The index of 0-2 have 0 fan-out, 3-7 have one fan-out, 8-11 have 2 fan-out. Secondly, construct failure table: This step of was the same as the original algorithm implementation, so we will not repeat the construction method here in detail, please refer to reference [4]. Finally we may get the lookup table: 2.4. Memory's selection Lookup table eventually need to be passed to the GPU memory,gpu computing environment can operate more than one memory type, we use the NVIDIA GPU's [5] computing environment CUDA [6], Therefore, this paper only discuss the definition of the memory type under CUDA environment. CUDA has six kinds of memory types: register memory, local memory, share memory, constant memory, texture memory, global memory. Register memory and local memory is private for each thread, while the lookup

Liang Hu et al. / Procedia Engineering 29 (2012) 4249 4253 4251 table need to be shared by all threads, so can not select the register memory and local memory; empathy share memory is shared by threads within the block, is not appropriate too. Although the constant memory shared for all threads to meet, but it is limited in size, the largest 64K, When the mode is heavy, the lookup table is far greater than 64K, so it can not choose the constant memory; Both texture memory and global memory are so great that can store lookup table, what s more, texture memory is used for graphics rendering, using CUDA we can access it by texture binding and texture fetch, and it is cached, when hit it only take one clock cycle. Global memory does not have the cache mechanism, access to global memory require 600-800 clock cycles. Taken together, select the texture memory is undoubtedly the most sensible choice. 2.5. Description for the GPU-CSAC algorithm Algorithm input: Pattern set(m1, M2 Mn) TEXT(A1, A2 An) Algorithm description: CPU reads the pattern Mi(0<i<n+1)one by one, generate a state tree; Construct State Tree According to state tree; Allocate a space(texture memory) and transfer the lookup table to the space; read the TEXT and transfer it to memory; Set the grid and block dimensions and Launch the Kernel function, GPU start to matching. CPU processing the returned results. 3. Experimental evaluation This paper evaluate algorithm from time complexity and space complexity. Time complexity: the matching time is an important indicator to measure pattern matching performance algorithm. Space complexity: analyze the size of the storage space. AC algorithm based on GPU proposed by Giorgos Vasiliadis not for storage optimization, so its storage n SIZE approximate formula is(assuming only a handful of models with common prefix): SIZE 3 (256 Pi ) (i is pattern set's number) Through this formula we can calculate the approximate share i= 0of occupied space DFA. This compressed storage after the approximate value of: SIZE 3 ( O ) n P i i= 0 j= 0 Two GPUs (GTS 8800 and Tesla C2050) are applied in the experimental environment, and the CPU is Inter Xeon(R) E5430. The source of data sets consist of 64MB, 128MB data through the packet capture tool as TEXT to be matched, and 1KB, 10KB, 100KB, 1MB, 10MB, 20MB pattern sets are selected. Fig. 1 shows CPU-AC,GPU-NAC(Normal AC Based On GPU), the increasing of the number of GPU- CSAC with the pattern, the relative changes in matching time. we learn from Fig.1. that the matching time of GPU-NAC and GPU-CSAC is far less than the CPU-AC's matching time. Worthy of our concern is that when the pattern set is less than 100KB the time that GPU-NAC takes is shorter than GPU-CSAC. This is because the capacity of the lookup table is relatively small, Directly search the lookup table that in the form of two-dimensional matrix is more efficient than what has been compressed. With the pattern set's increasing, when the number reaches 1MB, the size of the two-dimensional generated by GPU-NAC matrix has been considerable, the lookup table's cache misses resulting in matching efficiency's decrease. j

4252 Liang Hu et al. / Procedia Engineering 29 (2012) 4249 4253 However, when the pattern set increased to 10MB or more, two-dimensional matrix will become very large, much larger than we have optimized the size of lookup table, thereby causing a sharp decline in matching efficiency. The following experiments we will see this phenomenon more apparent. Fig. 1. Comparison in time complexity (a) Matched 64MB data Using CPU and GTS8800 (b) Matched 128MB data Using CPU and GTS8800 In order to prove a more powerful GPU-CSAC performance, then use the Tesla C2050 which is professional graphics chip, here we only compare the two GPU AC algorithm, use 64MB data set size to match the experimental data, such as Fig.2. Employing the Tesla C2050 significantly reduced the time spent, the conclusion is consistent with the previous, when the pattern set to a certain extent, GPU-CSAC advantage may show up. Fig. 2. Comparison in time complexity (a) Matched 64MB data Using CPU and Tesla C2050 (b) Matched 128MB data Using CPU and Tesla C2050 To evaluate the space complexity of proposed method, we employ look-up table formula storage space to find the GPU-NAC and GPU-CSAC for 1KB, 10KB, 100KB, 1MB, 10MB, 20MB pattern set lookup table's share of space size in Fig.3. Fig.3 reflects that the space that GPU-NAC's lookup table occupied is much higher than GPU-CSAC. Through this compression method, we reduce the memory footprint success.

Liang Hu et al. / Procedia Engineering 29 (2012) 4249 4253 4253 4. Conclusions This paper optimized the AC algorithm based on the Giorgos Vasiliadis's algorithm, the method is to compress stored lookup table and to allocate space for each state based on the Fan-out. Then we analyzed CPU-AC, GPU-NAC, GPU-SCAC and draw the final conclusion: GPU-SCAC algorithm may still achieve efficient performance in case of large-scale pattern set.with the development of intrusion technology and anti-intrusion technology, Fig. 3. Comparison in space complexity more and more attack signatures may be added to the intrusion detection rule set. With the increase of attack signature rule set, the GPU-NAC algorithm presented by Giorgos Vasiliadis can not fulfill the real-time processing requirements; in this case, our proposed GPU-SCAC may be applied to improving the performance. Acknowledgments This work was supported in part by the National Grand Fundamental Research 973 Program of China under Grant No. 2009CB320706, the National High Technology Research and Development Program of China under Grant No. 2011AA010101, the National Natural Science Foundation of China under Grant No. 61103197 and 61073009, Program of New Century Excellent Talents in University of Ministry of Education of China under Grant No. NCET-06-0300, the Youth Foundation of Jilin Province of China under Grant No. 201101035, and the Fundamental Research Funds for the Central Universities of China under Grant NO.200903179. References and Notes [1] Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos P.Markatos, and Sotiris Ioannidis(2007)Gnort: High Performance Network Intrusion Detection Using Graphics Processors.Institute of Computer Science, Foundation for Research and Technology Hellas,N. Plastira 100, Vassilika Vouton, GR-700 13 Heraklion, Crete, Greece. [2] A.V.Aho and M.J.Corasick(1975)Efficient string matching:an aid to bibliographic search.communications of the ACM,18(6):333~340. [3] Xinyan Zha, Sahni, S(2008)Highly compressed Aho-Corasick automata for efficient intrusion detection.computers and Communications. ISCC 2008. IEEE Symposium on.298-303. [4] Dimopoulos, V., Papaefstathiou, I., Pnevmatikatos, A(2007) Memory-Efficient Reconfigurable Aho-Corasick FSM Implemen-tation for Intrusion Detection Systems. Embedded Computer Systems : Architectures, Modeling and Simulation. IC- SAMOS 2007. 186-193. Samos. [5] Nvidia(2008), NVIDIA CUDA Compute Unified Device Architecture Programming Guide [OL], http://developer.download.nvidia.com/com-pute/cuda. [6] NVIDIA(2010)CUDA C Programming Guide,Version 3.2.