Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National Tsing Hua University, Taiwan
Introduction Network Intrusions Detection System (NIDS) has been widely used to detect network attacks. The pattern matching engine dominates the performance of an NIDS. Traditional pattern matching approaches on uniprocessor are too slow for today s networking. Hardware approaches for acceleration pattern matching. Logic-based Memory-based Multiprocessor-based 2
GPU for Pattern Matching Parallel computation on GPU is suitable for accelerating pattern matching. AAAAAAAAAAAAAAAAAAAAAAAB 1 thread 24 cycles AAAAAAAAAAAAAAAAAAAAAAAB Thread #1 Thread #2 Thread #3 Thread #4 4 segments 4 threads 6 cycles 3
Boundary Problem Boundary Problem Pattern occurring in the boundary of adjacent segments cannot be detected. False negative results False Negative AAAAAAAAAAAAAAAAAABBBBBB Thread #1 Thread #2 Thread #3 Thread #4 4
Overlapped Computation To resolve boundary problem Scan across boundaries Problem Overhead of overlapped computation Throughput reduction Thread #1 Thread #3 can identify "AB" AAAAAAAAAAAAAAAAAABBBBBB Thread #2 Thread #3 Thread #4 5
Aho-Corasick Algorithm Aho-Corasick (AC) algorithm has been widely used for pattern matching due to its advantage of matching multiple patterns in a single pass Compiling multiple patterns into a composite state machine Patterns (1) AB (2) ABG (3) BD (4) F [^AB] B G A 1 2 3 B 0 D 4 5 6 8 F 9 7 6
Aho-Corasick Algorithm (cont.) Aho-Corasick (AC) state machine composes of Solid line represents valid transitions. Dotted line represents failure transitions. Failure transition backtracks the state machine to recognize patterns in different start locations. B A 1 2 G 3 Input strings : A B D [^AB] 0 B 4 5 8 F 9 D 6 7 location 1 location 2 7
Problems of AC on GPU Direct implementation of AC on GPU To resolve the boundary problem, each thread has low bound constraint of scanning length Constraint = segment length + overlapped length Overlapped length = the length of longest pattern -1 Overhead of overlapped computation A A A A A A A A A A A A A A A A A A B B B B B B 8
Problems of AC on GPU (cont.) 9
Failureless-AC State Machine AC state machine Failure transition backtracks the state machine to recognize patterns in different start locations. Input strings : A B D [^AB] 0 B A 1 2 B 4 5 G D 6 3 7 location 1 location 2 8 F 9 Failureless-AC state machine Remove failure transition Terminated when no valid transitions Recognize patterns in location 1. Location 1 Input strings : A B D 0 B A 1 2 B G 3 4 5 D 6 7 8 F 9 Stop 10
Parallel Failureless-AC Algorithm Parallel Failureless-AC (PFAC) Algorithm Allocate each byte of input a thread to traverse Failureless-AC state machine. X X X X X X X X X A B D X X X X X X X X X X 11
Mechanism of PFAC Thread #n X X X X A B D X X X Thread #n+1 A 1 B 2 G 3 A 1 B 2 G 3 0 B D 4 5 6 7 0 B D 4 5 6 7 8 F 9 8 F 9 Thread #n Thread #n+1 12
Reducing Overlapped Computation Direct Implementation of AC Algorithm ach thread has low bound constraint of scanning length Overlapped computation (overlapped length = 3) PFAC Algorithm Without boundary problem. ach thread has variable scanning length Most thread terminates early Reducing overlapped computation to 1 1 3 C C C C C C C C B C C C C C C 13
xperimental nvironments CPU: Intel Core i7 CPU 950 @3.07 GHz 4 cores 12 GB DDR3 memory GPU: NVIDIA GeForce GTX 480 @ 1.4 GHz 480 cores 1536MB DDR5 memory Patterns: String pattern of Snort V2.4 1,998 rules containing 41,997 characters Total 27,754 states Input: Normal and worst case DFCON packet 14
xperimental Results Table 1: Throughput of normal case inputs AC_CPU AC_OMP AC_Pthread PFAC Speedup 1 thread (Gbps) 8 threads (Gbps) 8 threads (Gbps) multi-threads (Gbps) to fastest 2 MB 0.73 2.84 3.98 69.73 17.53 4 MB 0.72 2.98 4.06 80.36 19.78 8 MB 0.72 3.04 4.24 87.66 20.66 16 MB 0.72 3.06 3.26 90.13 27.63 32 MB 0.72 3.04 4.32 88.20 20.43 64 MB 0.73 3.11 4.39 94.89 21.63 128 MB 0.82 3.36 4.71 110.63 23.51 192 MB 0.86 3.43 4.69 117.04 24.94 15
Comparisons of Normal Case 140.00 120.00 100.00 80.00 60.00 40.00 AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads 20.00 0.00 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB 16
xperimental Results Table 2: Throughput of worst case inputs AC_CPU AC_OMP AC_Pthread PFAC Speedup 1 thread (Gbps) 8 threads (Gbps) 8 threads (Gbps) multi-threads (Gbps) to fastest 2 MB 0.45 2.40 2.44 12.09 4.96 4 MB 0.45 2.50 2.57 12.61 4.91 8 MB 0.45 2.58 2.62 12.79 4.88 16 MB 0.45 2.61 2.63 12.98 4.93 32 MB 0.46 2.55 2.61 12.95 4.96 64 MB 0.45 2.62 2.63 13.10 4.97 128 MB 0.46 2.57 2.60 13.12 5.04 192 MB 0.46 2.63 2.64 13.10 4.97 17
Comparisons of Worst Case 14.00 12.00 10.00 8.00 6.00 4.00 AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads 2.00 0.00 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB 18
Comparisons Approaches Character number of rule set Memory (KB) Throughput (Gbps) Memory fficiency Notes PFAC 41997 27754 117.04 177.10 NVIDIA GTX 480 Huang et al. [10] Modified WM Schatz et al. [11] Suffix Tree Vasiliadis et al. [12] DFA Smith et al. [13] XFA 1565 230 2.40 16.33 NVIDIA 7600 GT 200000 14125 2.00 28.32 NVIDIA GTX 8800 N.A. 200000 6.40 NA NVIDIA 9800 GX2 N.A. 3000 10.40 NA NVIDIA 8800 GTX Memory efficiency= (Throughput x # of characters) / Memory 19
Conclusions We have proposed a novel parallel string matching algorithm which is well-suited to be performed on GPUs and is free from the boundary detection problem. The proposed algorithm creates a new state machine which has less complexity and memory usage compared to the traditional Aho-Corasick state machine. The new algorithm achieves a significant speedup compared to the traditional Aho-Corasick algorithm accelerated by OpenMP on CPU. Compared to other GPU approaches, the new algorithm achieves 11.6 times faster than the state-of-the-art approach. 20