Accelerating String Matching Using Multi-threaded Algorithm

Size: px

Start display at page:

Download "Accelerating String Matching Using Multi-threaded Algorithm"

Magnus George
6 years ago
Views:

1 Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National Tsing Hua University, Taiwan

2 Introduction Network Intrusions Detection System (NIDS) has been widely used to detect network attacks. The pattern matching engine dominates the performance of an NIDS. Traditional pattern matching approaches on uniprocessor are too slow for today s networking. Hardware approaches for acceleration pattern matching. Logic-based Memory-based Multiprocessor-based 2

3 GPU for Pattern Matching Parallel computation on GPU is suitable for accelerating pattern matching. AAAAAAAAAAAAAAAAAAAAAAAB 1 thread 24 cycles AAAAAAAAAAAAAAAAAAAAAAAB Thread #1 Thread #2 Thread #3 Thread #4 4 segments 4 threads 6 cycles 3

4 Boundary Problem Boundary Problem Pattern occurring in the boundary of adjacent segments cannot be detected. False negative results False Negative AAAAAAAAAAAAAAAAAABBBBBB Thread #1 Thread #2 Thread #3 Thread #4 4

5 Overlapped Computation To resolve boundary problem Scan across boundaries Problem Overhead of overlapped computation Throughput reduction Thread #1 Thread #3 can identify "AB" AAAAAAAAAAAAAAAAAABBBBBB Thread #2 Thread #3 Thread #4 5

6 Aho-Corasick Algorithm Aho-Corasick (AC) algorithm has been widely used for pattern matching due to its advantage of matching multiple patterns in a single pass Compiling multiple patterns into a composite state machine Patterns (1) AB (2) ABG (3) BD (4) F [^AB] B G A B 0 D F 9 7 6

7 Aho-Corasick Algorithm (cont.) Aho-Corasick (AC) state machine composes of Solid line represents valid transitions. Dotted line represents failure transitions. Failure transition backtracks the state machine to recognize patterns in different start locations. B A 1 2 G 3 Input strings : A B D [^AB] 0 B F 9 D 6 7 location 1 location 2 7

8 Problems of AC on GPU Direct implementation of AC on GPU To resolve the boundary problem, each thread has low bound constraint of scanning length Constraint = segment length + overlapped length Overlapped length = the length of longest pattern -1 Overhead of overlapped computation A A A A A A A A A A A A A A A A A A B B B B B B 8

9 Problems of AC on GPU (cont.) 9

10 Failureless-AC State Machine AC state machine Failure transition backtracks the state machine to recognize patterns in different start locations. Input strings : A B D [^AB] 0 B A 1 2 B 4 5 G D location 1 location 2 8 F 9 Failureless-AC state machine Remove failure transition Terminated when no valid transitions Recognize patterns in location 1. Location 1 Input strings : A B D 0 B A 1 2 B G D F 9 Stop 10

11 Parallel Failureless-AC Algorithm Parallel Failureless-AC (PFAC) Algorithm Allocate each byte of input a thread to traverse Failureless-AC state machine. X X X X X X X X X A B D X X X X X X X X X X 11

12 Mechanism of PFAC Thread #n X X X X A B D X X X Thread #n+1 A 1 B 2 G 3 A 1 B 2 G 3 0 B D B D F 9 8 F 9 Thread #n Thread #n+1 12

13 Reducing Overlapped Computation Direct Implementation of AC Algorithm ach thread has low bound constraint of scanning length Overlapped computation (overlapped length = 3) PFAC Algorithm Without boundary problem. ach thread has variable scanning length Most thread terminates early Reducing overlapped computation to C C C C C C C C B C C C C C C 13

14 xperimental nvironments CPU: Intel Core i7 CPU GHz 4 cores 12 GB DDR3 memory GPU: NVIDIA GeForce GTX 1.4 GHz 480 cores 1536MB DDR5 memory Patterns: String pattern of Snort V2.4 1,998 rules containing 41,997 characters Total 27,754 states Input: Normal and worst case DFCON packet 14

15 xperimental Results Table 1: Throughput of normal case inputs AC_CPU AC_OMP AC_Pthread PFAC Speedup 1 thread (Gbps) 8 threads (Gbps) 8 threads (Gbps) multi-threads (Gbps) to fastest 2 MB MB MB MB MB MB MB MB

16 Comparisons of Normal Case AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB 16

17 xperimental Results Table 2: Throughput of worst case inputs AC_CPU AC_OMP AC_Pthread PFAC Speedup 1 thread (Gbps) 8 threads (Gbps) 8 threads (Gbps) multi-threads (Gbps) to fastest 2 MB MB MB MB MB MB MB MB

18 Comparisons of Worst Case AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB 18

19 Comparisons Approaches Character number of rule set Memory (KB) Throughput (Gbps) Memory fficiency Notes PFAC NVIDIA GTX 480 Huang et al. [10] Modified WM Schatz et al. [11] Suffix Tree Vasiliadis et al. [12] DFA Smith et al. [13] XFA NVIDIA 7600 GT NVIDIA GTX 8800 N.A NA NVIDIA 9800 GX2 N.A NA NVIDIA 8800 GTX Memory efficiency= (Throughput x # of characters) / Memory 19

20 Conclusions We have proposed a novel parallel string matching algorithm which is well-suited to be performed on GPUs and is free from the boundary detection problem. The proposed algorithm creates a new state machine which has less complexity and memory usage compared to the traditional Aho-Corasick state machine. The new algorithm achieves a significant speedup compared to the traditional Aho-Corasick algorithm accelerated by OpenMP on CPU. Compared to other GPU approaches, the new algorithm achieves 11.6 times faster than the state-of-the-art approach. 20

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most