A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

Size: px
Start display at page:

Download "A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms"

Transcription

1 A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of Macedonia 156 Egnatia str., P.O. Box 1591, 546 Thessaloniki, Greece Abstract Multiple keyword matching is an important problem in text processing that involves the location of all the positions of an input string where one or more keywords from a finite set occur. Modern multiple keyword matching algorithms can scan the input string in a single pass by preprocessing the keyword set, an essential phase that affects the overall performance of each algorithm. This paper presents a performance evaluation in terms of preprocessing of the well known Commentz-Walter, Wu-Manber, Set Backward Oracle Matching and Salmela-Tarhio-Kytöjoki multiple keyword matching algorithms for different types of keywords and for several problem parameters. Keywords-Algorithms, Performance Evaluation, Preprocessing, Multiple Keyword Matching, Multiple Pattern Matching I. INTRODUCTION Multiple keyword matching is an important problem in text processing and is commonly used to locate all the positions of an input string (the so called text ) where one or more keywords (the so called patterns ) from a finite set of keywords occur. It is the computationally intensive kernel of many security and network applications including information retrieval, intrusion detection systems, web filtering, virus scanners and spam filters while it is also used as a powerful tool in locating nucleotide or Amino Acid sequence keywords in biological sequence databases. The multiple keyword matching problem can be defined as: Definition 1 Given an input string T = t 1 t 2...t n of length n and a finite set of r keywords P = p 1,p 2,...,p r, where each p i is a string p i = p i 1p i 2...p i m of length m over a finite character set Σ and the total size of all keywords is denoted as P, the task is to find all occurrences of any of the keywords in the input string. Preprocessing is an important phase of multiple keyword matching algorithms that is used so that they can scan the input string T in a single pass to locate all occurrences of any keyword from a finite keyword set. This is achieved by processing the keyword set and constructing some necessary data structures, usually automatons or tables of hashed keywords. An efficient preprocessing phase is therefore crucial for the overall performance of the algorithms in terms of running time and memory usage. For the experiments of this paper, the Commentz-Walter (CW) [2], Wu-Manber (WM) [15], Set Backward Oracle Matching (SBOM) [8] and the Salmela-Tarhio-Kytöjoki [12] family of the Horspool with q-grams (HG), Shift-Or with q-grams and BNDM with q-grams (BG) algorithms were used, algorithms that are simple, efficient and widely used. Commentz-Walter combines the filtering functions of the single keyword matching Boyer-Moore algorithm and a suffix automaton to search for the occurrence of multiple keywords in an input string. During the preprocessing phase, the algorithm creates a trie structure from the reversed keywords of the keyword set where each node corresponds to a single character, constructs the two shift functions of the Boyer-Moore algorithm extended to multiple keywords and specifies the exit nodes that indicate that a complete match is found in O( P ) time. Wu-Manber is a generalization of the Horspool [3] algorithm for multiple keyword matching. To achieve a good performance as P increases, the algorithm essentially enlarges the alphabet size by considering the text as blocks of size B instead of single characters. During the preprocessing phase three tables are built, the SHIFT, HASH and PREFIX tables. SHIFT is used to determine the number of characters that can be safely skipped based on the previous B characters on each text position, PREFIX stores a hashed value of the B-characters prefix of each keyword while HASH contains a list of all keywords with the same prefix. As recommended in [15], usually B could be equal to 2 for a small keyword set size or to 3 otherwise. Since the experiments of this paper involve large keyword set sizes, Wu-Manber was implemented with a block size of B =3. The Set Backward Oracle Matching algorithm uses a factor oracle, an acyclic automaton that was first introduced in [1], with at most P +1 states and a linear in P number of transitions. During the preprocessing phase, the factor oracle is created from the set of the reversed keywords in O( P ) time. Apart from the transitions that link the nodes of each keyword, a set of at most P external transitions must be built so that the oracle can recognize at least any

2 factor of a keyword. The external transitions associate each state i of the oracle to a previous state j that is called the supply state of i, such as i>j. The Salmela-Tarhio-Kytöjoki algorithms are character class filters; they essentially construct a generalized keyword with a length of m characters in O( P ) time for BG and SOG and O( P m) time for the HG algorithm that simultaneously matches all the keywords. As P increases, the efficiency of the filters should decrease since a candidate match would occur in almost every position [11]. To solve this problem, the algorithms are using a similar technique to the Wu-Manber algorithm. They treat the input string and the keywords in groups of B characters, effectively enlarging the alphabet size to Σ B characters. To reduce the required memory space to 2 21 bytes a hashing technique can be applied. For the experiments of this paper, the HG, SOG and BG algorithms were implemented using hashed 3-grams. The Commentz-Walter algorithm is substantially faster in practice than the Aho-Corasick algorithm, particularly when long keywords are involved [14][15]. Wu-Manber is considered to be a practical, simple and efficient algorithm for multiple keyword matching [8]. The Set Backward Oracle Matching algorithm has the same performance as Set Backward Dawg Matching but uses a much simpler automaton while at the same time appears to be very efficient when used on large keyword sets [8]. Finally, Salmela- Tarhio-Kytöjoki is a recently introduced family of algorithms that has a reportedly good performance on specific types of data [5]. Table I KNOWN THEORETICAL PREPROCESSING, WORST AND AVERAGE TIME COMPLEXITY OF THE MULTIPLE KEYWORD MATCHING ALGORITHMS Algorithm Preprocessing Worst case Average case CW P n m n SBOM P n P n HG P m n P nlog Σ ( P )/m SOG P n P n BG P n P nlog Σ ( P )/m Table I summarizes the known theoretical preprocessing, worst and average time complexity of the presented algorithms. The worst case complexity noted for the HG, SOG and BG algorithms is for keywords that all have the same hash value. If all keywords have different hash values, the worst case time complexity is O(n(logr + m)) instead. It was impossible to calculate the theoretical complexity of the Wu-Manber algorithm and thus was omitted, as the original paper does not specify the best size of the HASH and SHIFT tables and the hash functions, parameters that affect the complexity [8]. Several experiments on multiple keyword matching algorithms have already been reported in [4], [5], [6], [9], [11], [13] for alphabet input strings, randomly generated data sets, live network traffic and biological sequence databases. Most of that work though, concentrated on the performance evaluation of the search phase of the algorithms. The performance of the preprocessing phase of different multiple keyword matching algorithms has not been studied extensively in the past and usually focuses on different implementations of the same algorithm (i.e. [1]). The aim of this paper is to evaluate the performance of the preprocessing phase in terms of running time of the presented multiple keyword matching algorithms. The algorithms are compared for different types of keywords including randomly generated keywords, alphabet keywords and biological sequence databases and for several problem parameters such as the total size of the keyword set and the length and alphabet size of the keywords. II. EXPERIMENTAL METHODOLOGY The parameters that describe the performance of the preprocessing phase of multiple keyword matching algorithms are the size of the keyword set r, the length of the keywords m and the alphabet size Σ used. The data set was similar to the sets used in [4], [7], [13]. It consisted of randomly generated texts of size n =4.. with a binary alphabet and an alphabet of size 8, the CIA World Fact Book from the Large Canterbury Corpus with a size of n = and an alphabet of size 94, the genome of Escherichia coli from the Large Canterbury Corpus with a size of n = and an alphabet of size Σ=4, the SWISS-PROT Amino Acid sequence database with a size of n = and an alphabet of size Σ=2, the FASTA Amino Acid () of the A-thaliana genome with a size of n = and an alphabet of size Σ=2and the FASTA Nucleidic Acid () sequences of the A-thaliana genome with a size of n = and an alphabet of size Σ=4.The keyword set used consisted of 1. and 1. keywords where each keyword had a length of m =8and m =32 characters. The experiments were executed locally on an Intel Core 2 Duo CPU with a 3.GHz clock speed and 2 Gb of memory, 64 KB L1 cache and 6 MB L2 cache. The Ubuntu Linux operating system was used and during the experiments only the typical background processes ran. To decrease random variation, the time results were averages of 1 runs. All algorithms were implemented using the ANSI C programming language and were compiled using the GCC compiler with the -O2 and -funroll-loops optimization flags. III. ANALYSIS Since the performance of the preprocessing phase of different multiple keyword matching algorithms has not been studied extensively in the past, it is not possible to compare the results with other published metrics. The running time results presented in Tables II to V generally agree with previous work as discussed in [5], [6]. Based on the theoretical time complexity of the algorithms as reported in the original papers and summarized in Table I,

3 Table II PREPROCESSING AND RUNNING TIME OF THE ALGORITHMS FOR 1. KEYWORDS, FOR ALL TYPES OF DATA WITH m =8(SEC) Random Σ=2 Random Σ=8 Swiss Prot. CW WM SBOM HG SOG BG Table III PREPROCESSING AND RUNNING TIME OF THE ALGORITHMS FOR 1. KEYWORDS, FOR ALL TYPES OF DATA WITH m =32(SEC) Random Σ=2 Random Σ=8 Swiss Prot. CW WM SBOM HG SOG BG Table IV PREPROCESSING AND RUNNING TIME OF THE ALGORITHMS FOR 1. KEYWORDS, FOR ALL TYPES OF DATA WITH m =8(SEC) Random Σ=2 Random Σ=8 Swiss Prot. CW WM SBOM HG SOG BG Table V PREPROCESSING AND RUNNING TIME OF THE ALGORITHMS FOR 1. KEYWORDS, FOR ALL TYPES OF DATA WITH m =32(SEC) Random Σ=2 Random Σ=8 Swiss Prot. CW WM SBOM HG SOG BG the time required by an algorithm to construct the necessary data structures and process the keyword set during the preprocessing phase is expected to be based on the size r of the keyword set and the length m of the keywords. In practice though it can be seen that the alphabet Σ of the keywords is an important factor that also affects the preprocessing time. Tables II to V present the time spent by the algorithms during the preprocessing phase along the running time for keyword sets consisting of 1. and 1. keywords with lengths of m =8and m =32and for all types of data while Figure 1 depicts the preprocessing time for the same data set as a percent of the running time of the algorithms. It is clear that the size r of the keyword set was the primary factor that affected the performance of the algorithms in terms of preprocessing time. As r increased to 1. keywords, the preprocessing time of each algorithm increased linear in r as expected from the theoretical complexity presented in Table I. The preprocessing time to compute the trie of the reversed keywords and the two shift functions of the Commentz-Walter algorithm increased up to 5 times when larger keyword sets were used although the running time of the algorithm was not affected as much. The Set Backward Oracle Matching algorithm had the slowest preprocessing time comparing to the rest of the algorithms when 1. keywords were used. The factor oracle constructed during preprocessing though is the reason why SBOM is one of the fastest algorithms in terms of

4 Algorithm used (m=8, 1. keywords) Algorithm used (m=32, 1. keywords) Algorithm used (m=8, 1. keywords) Algorithm used (m=32, 1. keywords) Figure 1. Percentage of preprocessing on running time for different types of data running time, especially on randomly generated binary data sets and biological databases when a keyword of length m =8was used. As depicted in Figure 1, the preprocessing of Set Backward Oracle Matching accounted for 97% of the running time of the algorithm for some types of data. HG had a 2 to 8 times better performance in terms of preprocessing time than the SOG and BG algorithms when used on sets with 1. keywords and in general its preprocessing phase was faster than the rest of the presented algorithms for most types of data. On sets with 1. keywords though, the performance of HG in terms of preprocessing time was affected more from the increase in the size of the keyword set than the SOG and BG algorithms as can be explained by the P m comparing to P theoretical preprocessing time of the algorithm. The exact theoretical preprocessing time of the Wu-Manber algorithm is not clear from the original analysis by Wu and Manber but it can be concluded from the experimental results that the time spent by the algorithm to construct the necessary hash tables was generally linear in r and independent of the alphabet size. From Tables II to V it can also be seen that together with the Salmela-Tarhio-Kytöjoki algorithms, the preprocessing time of Wu-Manber increased less than that of the Commentz-Walter and the Set Backward Oracle Matching algorithms on sets with 1. keywords. The preprocessing time of the algorithms was affected in different ways by the length m of the keywords. When m increased from 8 to 32 characters and for all types of data, the preprocessing time of the Commentz-Walter and the Set Backward Oracle Matching algorithms generally increased linear in m while the time to complete the preprocessing phase of the Wu-Manber and the Salmela- Tarhio-Kytöjoki algorithms was unaffected, as presented in Tables II to V. For most types of data, the increase in the preprocessing time of the Commentz-Walter and the Set Backward Oracle Matching algorithms affected negatively their total performance. A notable exception to that was the use of the Commentz-Walter algorithm on randomly generated keyword sets with a binary alphabet and on the and keywords, keywords with an alphabet of

5 Σ = 4. In these cases were a small alphabet size was used, the increase in the preprocessing time resulted in a significant decrease in the running time of the algorithm, of up to 17 times. Although not expected from their theoretical complexity, the preprocessing time of the algorithms also depended on the alphabet size Σ of the keyword set. As can be concluded from Figure 1, the preprocessing time of the algorithms increased on keyword sets with larger alphabets. When language keywords were used, with an alphabet size Σ of 94 characters, the preprocessing to running time ratio of all algorithms drastically increased. It is interesting that for randomly generated keyword sets where a binary alphabet was used, the performance of the Set Backward Oracle Matching algorithm in terms of preprocessing time also decreased, indicative of the sensitivity of the algorithm to the alphabet size. Finally it should be noted that the preprocessing time of the Salmela-Tarhio-Kytöjoki algorithms was roughly constant in the alphabet size while at the same time their running time decreased for data sets with a larger alphabet size, outperforming the rest of the algorithms on language data sets in terms of running time. IV. CONCLUSIONS This paper presented a performance evaluation in terms of preprocessing time of the well known Commentz- Walter, Wu-Manber, Set Backward Oracle Matching and the Salmela-Tarhio-Kytöjoki multiple keyword matching algorithms for alphabet keyword sets, randomly generated keyword sets of a binary alphabet and an alphabet of size 8, the genome, the SWISS-PROT Amino Acid sequence database and the FASTA Amino Acid () and FASTA Nucleidic Acid () sequences of the A-thaliana genome. The keyword sets used consisted of 1. and 1. keywords with a length of m =8and m =32. It was shown that the time required by an algorithm to construct the necessary data structures and process the keyword set during the preprocessing phase is based on the size r of the keyword set and the length m of the keywords. It was also discussed that the alphabet Σ of the keywords is an important factor that also affects the preprocessing time of the algorithms. More specifically it was concluded that the preprocessing time of each algorithm increased generally linear in r as expected from their theoretical complexity. Additionally, the preprocessing time of the Commentz- Walter and Set Backward Oracle Matching algorithms generally increased linear in m while the time to complete the preprocessing phase of the Wu-Manber and the Salmela- Tarhio-Kytöjoki algorithms was unaffected by the keyword length. Finally, the performance of all algorithms in terms of preprocessing time decreased when used on keyword sets with a large alphabet size, although not expected by their theoretical preprocessing complexity. The work presented in this paper could be extended with a performance evaluation of the preprocessing phase of additional families of pattern matching algorithms including two dimensional and approximate pattern matching algorithms. A study of the preprocessing phase of the presented algorithms in terms of memory usage would also be interesting. REFERENCES [1] Allauzen, C., Crochemore, M., Raffinot, M.: Factor oracle: A new structure for pattern matching 1725, (1999) [2] Commentz-Walter, B.: A string matching algorithm fast on the average. Proceedings of the 6th Colloquium, on Automata, Languages and Programming pp (1979) [3] Horspool, R.: Practical fast searching in strings. Software: Practice and Experience 1(6), (198) [4] Kalsi, P., Peltola, H., Tarhio, T.: Comparison of exact string matching algorithms for biological sequences. Communications in Computer and Information Science pp (28) [5] Kouzinopoulos, C., Margaritis, K.: Experimental Results on Algorithms for Multiple Keyword Matching. In: IADIS International Conference on Informatics (21) [6] Kouzinopoulos, C., Margaritis, K.: Experimental Results On Multiple Pattern Matching Algorithms For Biological Sequences. In: International Conference on Bioinformatics - Models, Methods and Algorithms (211) [7] Lecroq, T.: Fast exact string matching algorithms. Information Processing Letters 12(6), (27) [8] Navarro, G., Raffinot, M.: Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press (22) [9] Navarro, G., Tarhio, J.: LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Software-Practice and Experience 35(12), (25) [1] Nieminen, J., Kilpeläinen, P.: Efficient implementation of aho corasick pattern matching automata using unicode. Software: Practice and Experience 37(6), (27) [11] Salmela, L.: Improved Algorithms for String Searching Problems. Ph.D. thesis, Helsinki University of Technology (29) [12] Salmela, L., Tarhio, J., Kytöjoki, J.: Multipattern string matching with q -grams. Journal of Experimental Algorithmics 11, 1 19 (26) [13] Sheik, S., Aggarwal, S., Poddar, A., Sathiyabhama, B., Balakrishna, N., Sekar, K.: Analysis of string-searching algorithms on biological sequence databases. Current Science 89(2), (25) [14] Watson, B.: Taxonomies and toolkits of regular language algorithms. Ph.D. thesis, Eindhoven University of Technology (1995) [15] Wu, S., Manber, U.: A fast algorithm for multi-pattern searching pp (24), technical report TR-94-17

Multi-Pattern String Matching with Very Large Pattern Sets

Multi-Pattern String Matching with Very Large Pattern Sets Multi-Pattern String Matching with Very Large Pattern Sets Leena Salmela L. Salmela, J. Tarhio and J. Kytöjoki: Multi-pattern string matching with q-grams. ACM Journal of Experimental Algorithmics, Volume

More information

Fast exact string matching algorithms

Fast exact string matching algorithms Information Processing Letters 102 (2007) 229 235 www.elsevier.com/locate/ipl Fast exact string matching algorithms Thierry Lecroq LITIS, Faculté des Sciences et des Techniques, Université de Rouen, 76821

More information

A New Multiple-Pattern Matching Algorithm for the Network Intrusion Detection System

A New Multiple-Pattern Matching Algorithm for the Network Intrusion Detection System IACSIT International Journal of Engineering and Technology, Vol. 8, No. 2, April 2016 A New Multiple-Pattern Matching Algorithm for the Network Intrusion Detection System Nguyen Le Dang, Dac-Nhuong Le,

More information

Practical and Optimal String Matching

Practical and Optimal String Matching Practical and Optimal String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of Łódź, Computer Engineering Department SPIRE

More information

Fast Searching in Biological Sequences Using Multiple Hash Functions

Fast Searching in Biological Sequences Using Multiple Hash Functions Fast Searching in Biological Sequences Using Multiple Hash Functions Simone Faro Dip. di Matematica e Informatica, Università di Catania Viale A.Doria n.6, 95125 Catania, Italy Email: faro@dmi.unict.it

More information

Improving Practical Exact String Matching

Improving Practical Exact String Matching Improving Practical Exact String Matching Branislav Ďurian Jan Holub Hannu Peltola Jorma Tarhio Abstract We present improved variations of the BNDM algorithm for exact string matching. At each alignment

More information

TUNING BG MULTI-PATTERN STRING MATCHING ALGORITHM WITH UNROLLING Q-GRAMS AND HASH

TUNING BG MULTI-PATTERN STRING MATCHING ALGORITHM WITH UNROLLING Q-GRAMS AND HASH Computer Modelling and New Technologies, 2013, Vol.17, No. 4, 58-65 Transport and Telecommunication Institute, Lomonosov 1, LV-1019, Riga, Latvia TUNING BG MULTI-PATTERN STRING MATCHING ALGORITHM WITH

More information

Fast Exact String Matching Algorithms

Fast Exact String Matching Algorithms Fast Exact String Matching Algorithms Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, Traitement de l Information, Systèmes. Part of this work has been done with Maxime Crochemore

More information

An efficient matching algorithm for encoded DNA sequences and binary strings

An efficient matching algorithm for encoded DNA sequences and binary strings An efficient matching algorithm for encoded DNA sequences and binary strings Simone Faro 1 and Thierry Lecroq 2 1 Dipartimento di Matematica e Informatica, Università di Catania, Italy 2 University of

More information

Efficient String Matching Using Bit Parallelism

Efficient String Matching Using Bit Parallelism Efficient String Matching Using Bit Parallelism Kapil Kumar Soni, Rohit Vyas, Dr. Vivek Sharma TIT College, Bhopal, Madhya Pradesh, India Abstract: Bit parallelism is an inherent property of computer to

More information

Algorithms for Weighted Matching

Algorithms for Weighted Matching Algorithms for Weighted Matching Leena Salmela and Jorma Tarhio Helsinki University of Technology {lsalmela,tarhio}@cs.hut.fi Abstract. We consider the matching of weighted patterns against an unweighted

More information

arxiv: v1 [cs.ds] 3 Jul 2017

arxiv: v1 [cs.ds] 3 Jul 2017 Speeding Up String Matching by Weak Factor Recognition Domenico Cantone, Simone Faro, and Arianna Pavone arxiv:1707.00469v1 [cs.ds] 3 Jul 2017 Università di Catania, Viale A. Doria 6, 95125 Catania, Italy

More information

Tuning BNDM with q-grams

Tuning BNDM with q-grams Tuning BNDM with q-grams Branislav Ďurian Jan Holub Hannu Peltola Jorma Tarhio Abstract We develop bit-parallel algorithms for exact string matching. Our algorithms are variations of the BNDM and Shift-Or

More information

A Two-Hashing Table Multiple String Pattern Matching Algorithm

A Two-Hashing Table Multiple String Pattern Matching Algorithm 2013 10th International Conference on Information Technology: New Generations A Two-Hashing Table Multiple String Pattern Matching Algorithm Chouvalit Khancome Department of Computer Science, Faculty of

More information

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use: This article was downloaded by: [Universiteit Twente] On: 21 May 2010 Access details: Access Details: [subscription number 907217948] Publisher Taylor & Francis Informa Ltd Registered in England and Wales

More information

MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm for Large-Scale Pattern Set

MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm for Large-Scale Pattern Set MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm for Large-Scale Pattern Set Zongwei Zhou,2, Yibo Xue 2,3, Junda Liu,2, Wei Zhang,2, and Jun Li 2,3 Department of Computer Science and

More information

Multiple-Pattern Matching In LZW Compressed Files Using Aho-Corasick Algorithm ABSTRACT 1 INTRODUCTION

Multiple-Pattern Matching In LZW Compressed Files Using Aho-Corasick Algorithm ABSTRACT 1 INTRODUCTION Multiple-Pattern Matching In LZW Compressed Files Using Aho-Corasick Algorithm Tao Tao, Amar Mukherjee School of Electrical Engineering and Computer Science University of Central Florida, Orlando, Fl.32816

More information

Automaton-based Sublinear Keyword Pattern Matching. SoC Software. Loek Cleophas, Bruce W. Watson, Gerard Zwaan

Automaton-based Sublinear Keyword Pattern Matching. SoC Software. Loek Cleophas, Bruce W. Watson, Gerard Zwaan SPIRE 2004 Padova, Italy October 5 8, 2004 Automaton-based Sublinear Keyword Pattern Matching Loek Cleophas, Bruce W. Watson, Gerard Zwaan SoC Software Construction Software Construction Group Department

More information

Bit-Reduced Automaton Inspection for Cloud Security

Bit-Reduced Automaton Inspection for Cloud Security Bit-Reduced Automaton Inspection for Cloud Security Haiqiang Wang l Kuo-Kun Tseng l* Shu-Chuan Chu 2 John F. Roddick 2 Dachao Li 1 l Department of Computer Science and Technology, Harbin Institute of Technology,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases Robinson Silvester.A J. Cruz Antony M. Pratheepa, PhD ABSTRACT Emergent interest in genomic

More information

GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences

GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences Int. J. Bioinformatics Research and Applications, Vol. GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences Sérgio Deusdado* Centre for Mountain Research (CIMO), Polytechnic Institute

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Study of Selected Shifting based String Matching Algorithms

Study of Selected Shifting based String Matching Algorithms Study of Selected Shifting based String Matching Algorithms G.L. Prajapati, PhD Dept. of Comp. Engg. IET-Devi Ahilya University, Indore Mohd. Sharique Dept. of Comp. Engg. IET-Devi Ahilya University, Indore

More information

A Survey of String Matching Algorithms

A Survey of String Matching Algorithms RESEARCH ARTICLE OPEN ACCESS A Survey of String Matching Algorithms Koloud Al-Khamaiseh*, Shadi ALShagarin** *(Department of Communication and Electronics and Computer Engineering, Tafila Technical University,

More information

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan Enhanced Two Sliding Windows Algorithm For Matching (ETSW) Mariam Itriq 1, Amjad Hudaib 2, Aseel Al-Anani 2, Rola Al-Khalid 2, Dima Suleiman 1 1. Department of Business Information Systems, King Abdullah

More information

/ department of mathematics and computer science

/ department of mathematics and computer science TABASCO: TAxonomy-BAsed Software COnstruction + A Keyword Pattern Matching Example Loek Cleophas l.g.w.a.cleophas@tue.nl May 11, 2005 Software Construction Group FASTAR Research Espresso Research http://www.win.tue.nl/soc

More information

Text Algorithms (6EAP) Lecture 3: Exact paaern matching II

Text Algorithms (6EAP) Lecture 3: Exact paaern matching II Text Algorithms (6EA) Lecture 3: Exact paaern matching II Jaak Vilo 2012 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 2 Algorithms Brute force O(nm) Knuth- Morris- raa O(n) Karp- Rabin hir- OR, hir- AND

More information

Application of the BWT Method to Solve the Exact String Matching Problem

Application of the BWT Method to Solve the Exact String Matching Problem Application of the BWT Method to Solve the Exact String Matching Problem T. W. Chen and R. C. T. Lee Department of Computer Science National Tsing Hua University, Hsinchu, Taiwan chen81052084@gmail.com

More information

High Performance Pattern Matching Algorithm for Network Security

High Performance Pattern Matching Algorithm for Network Security IJCSNS International Journal of Computer Science and Network Security, VOL.6 No., October 6 83 High Performance Pattern Matching Algorithm for Network Security Yang Wang and Hidetsune Kobayashi Graduate

More information

String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm

String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm String Matching with Multicore CPUs: Performing Better with the -Corasick Algorithm S. Arudchutha, T. Nishanthy and R.G. Ragel Department of Computer Engineering University of Peradeniya, Sri Lanka Abstract

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

The Exact Online String Matching Problem: A Review of the Most Recent Results

The Exact Online String Matching Problem: A Review of the Most Recent Results 13 The Exact Online String Matching Problem: A Review of the Most Recent Results SIMONE FARO, Università di Catania THIERRY LECROQ, Université derouen This article addresses the online exact string matching

More information

Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA)

Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA) Multiple Skip Multiple Pattern Matching (MSMPMA) Ziad A.A. Alqadi 1, Musbah Aqel 2, & Ibrahiem M. M. El Emary 3 1 Faculty Engineering, Al Balqa Applied University, Amman, Jordan E-mail:ntalia@yahoo.com

More information

Text Algorithms. Jaak Vilo 2016 fall. MTAT Text Algorithms

Text Algorithms. Jaak Vilo 2016 fall. MTAT Text Algorithms Text Algorithms Jaak Vilo 2016 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Topics Exact matching of one pattern(string) Exact matching of multiple patterns Suffix trie and tree indexes Applications Suffix

More information

Exscind: A Faster Pattern Matching For Intrusion Detection Using Exclusion and Inclusion Filters

Exscind: A Faster Pattern Matching For Intrusion Detection Using Exclusion and Inclusion Filters Exscind: A Faster Pattern Matching For Intrusion Detection Using Exclusion and Inclusion Filters 1 Monther Aldwairi and Duaa Alansari Seventh International Conference on Next Generation Web Services Practices

More information

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms

Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms Regular Paper Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms Mohammed Sahli 1,a) Tetsuo Shibuya 2 Received: September 8, 2011, Accepted: January 13, 2012 Abstract:

More information

Fast Hybrid String Matching Algorithms

Fast Hybrid String Matching Algorithms Fast Hybrid String Matching Algorithms Jamuna Bhandari 1 and Anil Kumar 2 1 Dept. of CSE, Manipal University Jaipur, INDIA 2 Dept of CSE, Manipal University Jaipur, INDIA ABSTRACT Various Hybrid algorithms

More information

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern)

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern) Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern) Hussein Abu-Mansour 1, Jaber Alwidian 1, Wael Hadi 2 1 ITC department Arab Open University Riyadh- Saudi Arabia 2 CIS department

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

A Multipattern Matching Algorithm Using Sampling and Bit Index

A Multipattern Matching Algorithm Using Sampling and Bit Index A Multipattern Matching Algorithm Using Sampling and Bit Index Jinhui Chen, Zhongfu Ye Department of Automation University of Science and Technology of China Hefei, P.R.China jeffcjh@mail.ustc.edu.cn,

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational

More information

AGREP A FAST APPROXIMATE PATTERN-MATCHING TOOL. (Preliminary version) Sun Wu and Udi Manber 1

AGREP A FAST APPROXIMATE PATTERN-MATCHING TOOL. (Preliminary version) Sun Wu and Udi Manber 1 AGREP A FAST APPROXIMATE PATTERN-MATCHING TOOL (Preliminary version) Sun Wu and Udi Manber 1 Department of Computer Science University of Arizona Tucson, AZ 85721 (sw udi)@cs.arizona.edu ABSTRACT Searching

More information

A Practical Distributed String Matching Algorithm Architecture and Implementation

A Practical Distributed String Matching Algorithm Architecture and Implementation A Practical Distributed String Matching Algorithm Architecture and Implementation Bi Kun, Gu Nai-jie, Tu Kun, Liu Xiao-hu, and Liu Gang International Science Index, Computer and Information Engineering

More information

University of Huddersfield Repository

University of Huddersfield Repository University of Huddersfield Repository Klaib, Ahmad and Osborne, Hugh OE Matching for Searching Biological Sequences Original Citation Klaib, Ahmad and Osborne, Hugh (2009) OE Matching for Searching Biological

More information

A General Weighted Grammar Library

A General Weighted Grammar Library A General Weighted Grammar Library Cyril Allauzen, Mehryar Mohri, and Brian Roark AT&T Labs Research, Shannon Laboratory 80 Park Avenue, Florham Park, NJ 0792-097 {allauzen, mohri, roark}@research.att.com

More information

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan.

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan. Enhanced Two Sliding Windows Algorithm For Matching (ETSW) Mariam Itriq 1, Amjad Hudaib 2, Aseel Al-Anani 2, Rola Al-Khalid 2, Dima Suleiman 1 1. Department of Business Information Systems, King Abdullah

More information

LZW Based Compressed Pattern Matching

LZW Based Compressed Pattern Matching LZW Based Compressed attern Matching Tao Tao, Amar Mukherjee School of Electrical Engineering and Computer Science University of Central Florida, Orlando, Fl.32816 USA Email: (ttao+amar)@cs.ucf.edu Abstract

More information

Fast and Cache-Oblivious Dynamic Programming with Local Dependencies

Fast and Cache-Oblivious Dynamic Programming with Local Dependencies Fast and Cache-Oblivious Dynamic Programming with Local Dependencies Philip Bille and Morten Stöckel Technical University of Denmark, DTU Informatics, Copenhagen, Denmark Abstract. String comparison such

More information

A Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions

A Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions A Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions Yohei Ueki, Kazuyuki Narisawa, and Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan

More information

This chapter is based on the following sources, which are all recommended reading:

This chapter is based on the following sources, which are all recommended reading: Bioinformatics I, WS 09-10, D. Hson, December 7, 2009 105 6 Fast String Matching This chapter is based on the following sorces, which are all recommended reading: 1. An earlier version of this chapter

More information

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

A Depth First Search approach to finding the Longest Common Subsequence

A Depth First Search approach to finding the Longest Common Subsequence A Depth First Search approach to finding the Longest Common Subsequence Fragkiadaki Eleni, Samaras Nikolaos Department of Applied Informatics, University of Macedonia, Greece eleni.fra@gmail.com, samaras@uom.gr

More information

Efficient validation and construction of border arrays

Efficient validation and construction of border arrays Efficient validation and construction of border arrays Jean-Pierre Duval Thierry Lecroq Arnaud Lefebvre LITIS, University of Rouen, France, {Jean-Pierre.Duval,Thierry.Lecroq,Arnaud.Lefebvre}@univ-rouen.fr

More information

Approximate search with constraints on indels with application in SPAM filtering Ambika Shrestha Chitrakar and Slobodan Petrović

Approximate search with constraints on indels with application in SPAM filtering Ambika Shrestha Chitrakar and Slobodan Petrović Approximate search with constraints on indels with application in SPAM filtering Ambika Shrestha Chitrakar and Slobodan Petrović Norwegian Information Security Laboratory, Gjøvik University College ambika.chitrakar2@hig.no,

More information

Experimental Results on String Matching Algorithms

Experimental Results on String Matching Algorithms SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(7), 727 765 (JULY 1995) Experimental Results on String Matching Algorithms thierry lecroq Laboratoire d Informatique de Rouen, Université de Rouen, Facultés des

More information

Inexact Pattern Matching Algorithms via Automata 1

Inexact Pattern Matching Algorithms via Automata 1 Inexact Pattern Matching Algorithms via Automata 1 1. Introduction Chung W. Ng BioChem 218 March 19, 2007 Pattern matching occurs in various applications, ranging from simple text searching in word processors

More information

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and International Journal of Foundations of Computer Science c World Scientific Publishing Company MODELING DELTA ENCODING OF COMPRESSED FILES SHMUEL T. KLEIN Department of Computer Science, Bar-Ilan University

More information

LAB # 3 / Project # 1

LAB # 3 / Project # 1 DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Importance of String Matching in Real World Problems

Importance of String Matching in Real World Problems www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 3 Issue 6 June, 2014 Page No. 6371-6375 Importance of String Matching in Real World Problems Kapil Kumar Soni,

More information

WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA

WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA 2010 WAVE-FRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND

More information

Applications of Suffix Tree

Applications of Suffix Tree Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences

More information

Experiments on string matching in memory structures

Experiments on string matching in memory structures Experiments on string matching in memory structures Thierry Lecroq LIR (Laboratoire d'informatique de Rouen) and ABISS (Atelier de Biologie Informatique Statistique et Socio-Linguistique), Universite de

More information

Shift-based Pattern Matching for Compressed Web Traffic

Shift-based Pattern Matching for Compressed Web Traffic The Interdisciplinary Center, Herzlia Efi Arazi School of Computer Science Shift-based Pattern Matching for Compressed Web Traffic M.Sc. Dissertation for Research Project Submitted by Victor Zigdon Under

More information

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count 2011 International Conference on Life Science and Technology IPCBEE vol.3 (2011) (2011) IACSIT Press, Singapore An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count Raju Bhukya

More information

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the

More information

Exclusion-based Signature Matching for Intrusion Detection

Exclusion-based Signature Matching for Intrusion Detection Exclusion-based Signature Matching for Intrusion Detection Evangelos P. Markatos, Spyros Antonatos, Michalis Polychronakis, Kostas G. Anagnostakis Institute of Computer Science (ICS) Foundation for Research

More information

A New Platform NIDS Based On WEMA

A New Platform NIDS Based On WEMA I.J. Information Technology and Computer Science, 2015, 06, 52-58 Published Online May 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.06.07 A New Platform NIDS Based On WEMA Adnan A.

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression

The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression Yossi Matias Nasir Rajpoot Süleyman Cenk Ṣahinalp Abstract We report on the performance evaluation of greedy parsing with a

More information

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

Semantic Search through Pattern Recognition

Semantic Search through Pattern Recognition IJCSET November 2012 Vol 2, Issue 11, 1483-1487 www.ijcset.net ISSN:2231-0711 Semantic Search through Pattern Recognition Shanti Gunna, Dept of Computer Science & Engineering DRK Institute of Science &

More information

Automaton-based Backward Pattern Matching

Automaton-based Backward Pattern Matching Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering Automaton-based Backward Pattern Matching Doctoral Thesis Jan Antoš PhD program: Computer

More information

Efficient Pattern Matching With Flexible Wildcard Gaps and One-off Constraint

Efficient Pattern Matching With Flexible Wildcard Gaps and One-off Constraint Efficient Pattern Matching With Flexible Wildcard Gaps and One-off Constraint Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Computer Science

More information

Fast Substring Matching

Fast Substring Matching Fast Substring Matching Andreas Klein 1 2 3 4 5 6 7 8 9 10 Abstract The substring matching problem occurs in several applications. Two of the well-known solutions are the Knuth-Morris-Pratt algorithm (which

More information

Accelerating String Matching Using Multi-threaded Algorithm

Accelerating String Matching Using Multi-threaded Algorithm Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National

More information

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011 Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA December 16, 2011 Abstract KMP is a string searching algorithm. The problem is to find the occurrence of P in S, where S is the given

More information

SigMatch: Fast and Scalable Multi-Pattern Matching

SigMatch: Fast and Scalable Multi-Pattern Matching SigMatch: Fast and Scalable Multi-Pattern Matching Ramakrishnan Kandhan Nikhil Teletia Jignesh M. Patel Computer Sciences Department, University of Wisconsin Madison {ramak, teletia, jignesh}@cs.wisc.edu

More information

Modeling Delta Encoding of Compressed Files

Modeling Delta Encoding of Compressed Files Modeling Delta Encoding of Compressed Files EXTENDED ABSTRACT S.T. Klein, T.C. Serebro, and D. Shapira 1 Dept of CS Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il 2 Dept of CS Bar Ilan University

More information

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract)

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract) A very fast string matching algorithm for small alphabets and long patterns (Extended abstract) Christian Charras 1, Thierry Lecroq 1, and Joseph Daniel Pehoushek 2 1 LIR (Laboratoire d'informatique de

More information

Comparisons of Efficient Implementations for DAWG

Comparisons of Efficient Implementations for DAWG Comparisons of Efficient Implementations for DAWG Masao Fuketa, Kazuhiro Morita, and Jun-ichi Aoe Abstract Key retrieval is very important in various applications. A trie and DAWG are data structures for

More information

A General Weighted Grammar Library

A General Weighted Grammar Library A General Weighted Grammar Library Cyril Allauzen, Mehryar Mohri 2, and Brian Roark 3 AT&T Labs Research 80 Park Avenue, Florham Park, NJ 07932-097 allauzen@research.att.com 2 Department of Computer Science

More information

A New String Matching Algorithm Based on Logical Indexing

A New String Matching Algorithm Based on Logical Indexing The 5th International Conference on Electrical Engineering and Informatics 2015 August 10-11, 2015, Bali, Indonesia A New String Matching Algorithm Based on Logical Indexing Daniar Heri Kurniawan Department

More information

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Combinatorial Pattern Matching. CS 466 Saurabh Sinha Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

More information

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013 Abdullah-Al Mamun CSE 5095 Yufeng Wu Spring 2013 Introduction Data compression is the art of reducing the number of bits needed to store or transmit data Compression is closely related to decompression

More information

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May

More information

Modeling Delta Encoding of Compressed Files

Modeling Delta Encoding of Compressed Files Shmuel T. Klein 1, Tamar C. Serebro 1, and Dana Shapira 2 1 Department of Computer Science Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il, t lender@hotmail.com 2 Department of Computer Science

More information

Packet Inspection on Programmable Hardware

Packet Inspection on Programmable Hardware Abstract Packet Inspection on Programmable Hardware Benfano Soewito Information Technology Department, Bakrie University, Jakarta, Indonesia E-mail: benfano.soewito@bakrie.ac.id In the network security

More information

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman

More information

arxiv: v2 [cs.it] 15 Jan 2011

arxiv: v2 [cs.it] 15 Jan 2011 Improving PPM Algorithm Using Dictionaries Yichuan Hu Department of Electrical and Systems Engineering University of Pennsylvania Email: yichuan@seas.upenn.edu Jianzhong (Charlie) Zhang, Farooq Khan and

More information

Hash-Based String Matching Algorithm For Network Intrusion Prevention systems (NIPS)

Hash-Based String Matching Algorithm For Network Intrusion Prevention systems (NIPS) Hash-Based String Matching Algorithm For Network Intrusion Prevention systems (NIPS) VINOD. O & B. M. SAGAR ISE Department, R.V.College of Engineering, Bangalore-560059, INDIA Email Id :vinod.goutham@gmail.com,sagar.bm@gmail.com

More information

School of Engineering and Mathematical Sciences. Packet Pattern Matching for Intrusion Detection

School of Engineering and Mathematical Sciences. Packet Pattern Matching for Intrusion Detection School of Engineering and Mathematical Sciences Packet Pattern Matching for Intrusion Detection by Alireza Shams Project for the Degree of MSc In Telecommunications and Networks Supervisor: Prof Tom Chen

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Advanced Pattern Based Virus Detection Algorithm for Network Security

Advanced Pattern Based Virus Detection Algorithm for Network Security Advanced Pattern Based Virus Detection Algorithm for Network Security Binroy T.B. M.E. Communication Systems Department of Electronics and Communication Engineering RVS College of Engineering & Technology,

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

A Malicious Pattern Detection Engine for Embedded Security Systems in the Internet of Things

A Malicious Pattern Detection Engine for Embedded Security Systems in the Internet of Things Sensors 2014, 14, 24188-24211; doi:10.3390/s141224188 OPEN ACCESS sensors ISSN 1424-8220 www.mdpi.com/journal/sensors Article A Malicious Pattern Detection Engine for Embedded Security Systems in the Internet

More information