Accelerating String Matching Using Multi-threaded Algorithm

Similar documents
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

PFAC Library: GPU-Based String Matching Algorithm

소프트웨어기반고성능침입탐지시스템설계및구현

Gregex: GPU based High Speed Regular Expression Matching Engine

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

G-NET: Effective GPU Sharing In NFV Systems

Detecting Computer Viruses using GPUs

Facial Recognition Using Neural Networks over GPGPU

Configurable String Matching Hardware for Speeding up Intrusion Detection

Multipattern String Matching On A GPU

Accelerating Image Feature Comparisons using CUDA on Commodity Hardware

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

A MULTI-CHARACTER TRANSITION STRING MATCHING ARCHITECTURE BASED ON AHO-CORASICK ALGORITHM. Chien-Chi Chen and Sheng-De Wang

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Space-Time Tradeoffs in Software-Based Deep Packet Inspection

GrAVity: A Massively Parallel Antivirus Engine

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

THE problem of string matching is to find all locations

Design of Deterministic Finite Automata using Pattern Matching Strategy

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Efficient Signature Matching with Multiple Alphabet Compression Tables

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

PERG-Rx: An FPGA-based Pattern-Matching Engine with Limited Regular Expression Support for Large Pattern Database. Johnny Ho

Fast BVH Construction on GPUs

The case for limited-preemptive scheduling in GPUs for real-time systems

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

PacketShader: A GPU-Accelerated Software Router

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Applications of Berkeley s Dwarfs on Nvidia GPUs

Parallel Processing SIMD, Vector and GPU s cont.

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

Face Detection CUDA Accelerating

A Simulated Annealing algorithm for GPU clusters

Improving Signature Matching using Binary Decision Diagrams

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

The Art of Parallel Processing

CME 213 S PRING Eric Darve

TOKEN-BASED DICTIONARY PATTERN MATCHING FOR TEXT ANALYTICS. Raphael Polig, Kubilay Atasu, Christoph Hagleitner

High-Performance Packet Classification on GPU

Processors, Performance, and Profiling

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

GPUs and Emerging Architectures

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

Interval arithmetic on graphics processing units

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Improving Intrusion Detection for IoT Networks. A Snort GPGPU Modification Using OpenCL LINUS JOHANSSON OSKAR OLSSON

Performance potential for simulating spin models on GPU

Network Design Considerations for Grid Computing

Hardware-accelerated regular expression matching with overlap handling on IBM PowerEN processor

Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache

Building NVLink for Developers

The rcuda middleware and applications

Lecture 1: Gentle Introduction to GPUs

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS427 Multicore Architecture and Parallel Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Deep Packet Inspection of Next Generation Network Devices

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Enhancing Byte-Level Network Intrusion Detection Signatures with Context

JCudaMP: OpenMP/Java on CUDA

Parallel Variable-Length Encoding on GPGPUs

Accelerating image registration on GPUs

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

General Purpose GPU Computing in Partial Wave Analysis

FPGA Implementation of Token-Based Clam AV Regex Virus Signatures with Early Detection

Investigation of GPU-based Pattern Matching

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

KULLEĠĠ SAN BENEDITTU Boys Secondary, Kirkop

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Accelerating Lossless Data Compression with GPUs

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Red Fox: An Execution Environment for Relational Query Processing on GPUs

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPGPU Programming & Erlang. Kevin A. Smith

NOISE ELIMINATION USING A BIT CAMS

Transcription:

Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National Tsing Hua University, Taiwan

Introduction Network Intrusions Detection System (NIDS) has been widely used to detect network attacks. The pattern matching engine dominates the performance of an NIDS. Traditional pattern matching approaches on uniprocessor are too slow for today s networking. Hardware approaches for acceleration pattern matching. Logic-based Memory-based Multiprocessor-based 2

GPU for Pattern Matching Parallel computation on GPU is suitable for accelerating pattern matching. AAAAAAAAAAAAAAAAAAAAAAAB 1 thread 24 cycles AAAAAAAAAAAAAAAAAAAAAAAB Thread #1 Thread #2 Thread #3 Thread #4 4 segments 4 threads 6 cycles 3

Boundary Problem Boundary Problem Pattern occurring in the boundary of adjacent segments cannot be detected. False negative results False Negative AAAAAAAAAAAAAAAAAABBBBBB Thread #1 Thread #2 Thread #3 Thread #4 4

Overlapped Computation To resolve boundary problem Scan across boundaries Problem Overhead of overlapped computation Throughput reduction Thread #1 Thread #3 can identify "AB" AAAAAAAAAAAAAAAAAABBBBBB Thread #2 Thread #3 Thread #4 5

Aho-Corasick Algorithm Aho-Corasick (AC) algorithm has been widely used for pattern matching due to its advantage of matching multiple patterns in a single pass Compiling multiple patterns into a composite state machine Patterns (1) AB (2) ABG (3) BD (4) F [^AB] B G A 1 2 3 B 0 D 4 5 6 8 F 9 7 6

Aho-Corasick Algorithm (cont.) Aho-Corasick (AC) state machine composes of Solid line represents valid transitions. Dotted line represents failure transitions. Failure transition backtracks the state machine to recognize patterns in different start locations. B A 1 2 G 3 Input strings : A B D [^AB] 0 B 4 5 8 F 9 D 6 7 location 1 location 2 7

Problems of AC on GPU Direct implementation of AC on GPU To resolve the boundary problem, each thread has low bound constraint of scanning length Constraint = segment length + overlapped length Overlapped length = the length of longest pattern -1 Overhead of overlapped computation A A A A A A A A A A A A A A A A A A B B B B B B 8

Problems of AC on GPU (cont.) 9

Failureless-AC State Machine AC state machine Failure transition backtracks the state machine to recognize patterns in different start locations. Input strings : A B D [^AB] 0 B A 1 2 B 4 5 G D 6 3 7 location 1 location 2 8 F 9 Failureless-AC state machine Remove failure transition Terminated when no valid transitions Recognize patterns in location 1. Location 1 Input strings : A B D 0 B A 1 2 B G 3 4 5 D 6 7 8 F 9 Stop 10

Parallel Failureless-AC Algorithm Parallel Failureless-AC (PFAC) Algorithm Allocate each byte of input a thread to traverse Failureless-AC state machine. X X X X X X X X X A B D X X X X X X X X X X 11

Mechanism of PFAC Thread #n X X X X A B D X X X Thread #n+1 A 1 B 2 G 3 A 1 B 2 G 3 0 B D 4 5 6 7 0 B D 4 5 6 7 8 F 9 8 F 9 Thread #n Thread #n+1 12

Reducing Overlapped Computation Direct Implementation of AC Algorithm ach thread has low bound constraint of scanning length Overlapped computation (overlapped length = 3) PFAC Algorithm Without boundary problem. ach thread has variable scanning length Most thread terminates early Reducing overlapped computation to 1 1 3 C C C C C C C C B C C C C C C 13

xperimental nvironments CPU: Intel Core i7 CPU 950 @3.07 GHz 4 cores 12 GB DDR3 memory GPU: NVIDIA GeForce GTX 480 @ 1.4 GHz 480 cores 1536MB DDR5 memory Patterns: String pattern of Snort V2.4 1,998 rules containing 41,997 characters Total 27,754 states Input: Normal and worst case DFCON packet 14

xperimental Results Table 1: Throughput of normal case inputs AC_CPU AC_OMP AC_Pthread PFAC Speedup 1 thread (Gbps) 8 threads (Gbps) 8 threads (Gbps) multi-threads (Gbps) to fastest 2 MB 0.73 2.84 3.98 69.73 17.53 4 MB 0.72 2.98 4.06 80.36 19.78 8 MB 0.72 3.04 4.24 87.66 20.66 16 MB 0.72 3.06 3.26 90.13 27.63 32 MB 0.72 3.04 4.32 88.20 20.43 64 MB 0.73 3.11 4.39 94.89 21.63 128 MB 0.82 3.36 4.71 110.63 23.51 192 MB 0.86 3.43 4.69 117.04 24.94 15

Comparisons of Normal Case 140.00 120.00 100.00 80.00 60.00 40.00 AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads 20.00 0.00 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB 16

xperimental Results Table 2: Throughput of worst case inputs AC_CPU AC_OMP AC_Pthread PFAC Speedup 1 thread (Gbps) 8 threads (Gbps) 8 threads (Gbps) multi-threads (Gbps) to fastest 2 MB 0.45 2.40 2.44 12.09 4.96 4 MB 0.45 2.50 2.57 12.61 4.91 8 MB 0.45 2.58 2.62 12.79 4.88 16 MB 0.45 2.61 2.63 12.98 4.93 32 MB 0.46 2.55 2.61 12.95 4.96 64 MB 0.45 2.62 2.63 13.10 4.97 128 MB 0.46 2.57 2.60 13.12 5.04 192 MB 0.46 2.63 2.64 13.10 4.97 17

Comparisons of Worst Case 14.00 12.00 10.00 8.00 6.00 4.00 AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads 2.00 0.00 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB 18

Comparisons Approaches Character number of rule set Memory (KB) Throughput (Gbps) Memory fficiency Notes PFAC 41997 27754 117.04 177.10 NVIDIA GTX 480 Huang et al. [10] Modified WM Schatz et al. [11] Suffix Tree Vasiliadis et al. [12] DFA Smith et al. [13] XFA 1565 230 2.40 16.33 NVIDIA 7600 GT 200000 14125 2.00 28.32 NVIDIA GTX 8800 N.A. 200000 6.40 NA NVIDIA 9800 GX2 N.A. 3000 10.40 NA NVIDIA 8800 GTX Memory efficiency= (Throughput x # of characters) / Memory 19

Conclusions We have proposed a novel parallel string matching algorithm which is well-suited to be performed on GPUs and is free from the boundary detection problem. The proposed algorithm creates a new state machine which has less complexity and memory usage compared to the traditional Aho-Corasick state machine. The new algorithm achieves a significant speedup compared to the traditional Aho-Corasick algorithm accelerated by OpenMP on CPU. Compared to other GPU approaches, the new algorithm achieves 11.6 times faster than the state-of-the-art approach. 20