PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

Size: px
Start display at page:

Download "PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:"

Transcription

1 This article was downloaded by: [Universiteit Twente] On: 21 May 2010 Access details: Access Details: [subscription number ] Publisher Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: Registered office: Mortimer House, Mortimer Street, London W1T 3JH, UK International Journal of Computer Mathematics Publication details, including instructions for authors and subscription information: On-line string matching algorithms: survey and experimental results P. D. Michailidis a ;K. G. Margaritis a a Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece To cite this Article Michailidis, P. D. andmargaritis, K. G.(2001) 'On-line string matching algorithms: survey and experimental results', International Journal of Computer Mathematics, 76: 4, To link to this Article: DOI: / URL: PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

2 htern. J. Computer Math., Vol. 76, pp Reprints available directly from the publisher Photocopying permitted by license only OPA (Overseas Publishers Association) N.V. Published by license under the Gordon and Breach Science Publishers imprint. Pnnted in Singapore. ON-LINE STRING MATCHING ALGORITHMS: SURVEY AND EXPERIMENTAL RESULTS P. D. MICHAILIDIS and K. G. MARGARITIS* Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia, 156 Egnatia Str., P.O. Box 1591, 54006, Thessaloniki, Greece (Received 9 March 2000) In this paper we present a short survey and experimental results for well known sequential string matching algorithms. We consider algorithms based on different approaches including classical, suffx automata, bit-parallelism and hashing. We put special emphasis on algorithms recently prewnted such as Shift-Or and BNDM algorithms. We compare these algorithms in terms of the number of character comparisons and the running time for four different types of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Keywords: String matching; Pattern matching; String searching; Text searching; Text editing C. R. Categories: F.2.2, INTRODUCTION Pattern matching is a basic problem in computer science and it occurs naturally as part of data processing, information retrieval, speech recognition, vision for two dimensional image recognition and computational biology. The type of pattern matchmg discussed in this paper is exact string matching. String matching is a special case of pattern matching, where the pattern is described by a finite sequence of symbols (or alphabet C). It consists of finding one or more generally all the occurrences of a short pattern *Corresponding authors. {panosm, kmarg}@uom.gr 41 1

3 412 P. D. MICHAILIDIS AND K. G. MARGARITIS P=P[O]P[l]...P[m-11 of lengthmin a large text T=T[O]T[l]..-T[n-11 of length n, where m, n > 0 and m 5 n. Both P and Tare built over the same alphabet C. The solution to this problem differ if the algorithm has to be on-line (that is, the text is not known in advance) or off-line (the text can be preprocessed). In this paper, we focus on on-line algorithms for this problem. Numerous solutions to string matching problem have been designed [2,10,29 and 241. In general, an on-line string matching algorithm consists of two phases: the preprocessing phase in P and the search phase of P in T. During the preprocessing phase a data structure Xis constructed, X is usually proportional to the length of the pattern and its details vary in different algorithms. The search phase uses the data structure X and it tries to quickly determine if the pattern occurs in the text. This phase is based on four different approaches including classical, suffix automata, bit-parallelism and hashing algorithms. More specifically, for the string matching problem, the algorithms can be divided in four categories: Classical algorithms Brute-Force [24] algorithm, Knuth-Morris- Pratt [18] algorithm, Simon [14] algorithm, Colussi [8] algorithm, Boyer - Moore [3] algorithm, the variations of the Boyer - Moore algorithm like Galil [12] algorithm, Apostolico - Giancarlo [I] algorithm, Turbo-BM [7] algorithm, Reverse Colussi [9] algorithm, Boyer - Moore- Horspool [16] algorithm, Sunday's algorithms (Quick Search, Optimal Mismatch, Maximal Shift) [30], Boyer - Moore - Horspool - Raita [26] algorithm and Boyer - Moore - Smith [28] algorithm. Su@x automata algorithms Reverse Factor [21 and 71 algorithm and Turbo Reverse Factor [7] algorithm. Bit-parallelism algorithms Shift-Or [6] algorithm, Shift-And [31] algorithm and BNDM [25] algorithm. Hashing algorithms Harrison [I51 algorithm and Karp- Rabin [24] algorithm. Several experiments on string matching algorithms have already been reported [16,27,11,4,30,17,28,6,26,22,23 and 251. In this paper we report experiments on eleven well known algorithms from each category: the Brute-Force algorithm, the Knuth-Morris-Pratt algorithm, the Boyer- Moore algorithm, the Turbo-BM algorithm, the Boyer-Moore-Horspool algorithm, the Quick-Search algorithm, the Boyer - Moore - Smith algorithm, the Reverse Factor algorithm, the Shift-Or algorithm, the BNDM algorithm and the Karp- Rabin algorithm.

4 STRING MATCHING ALGORITHMS 413 This paper is organized as follows: in the next section we present the algorithms tested. In the third section we describe the experimental methodology including the test environment, types of test data and ways measures for the comparison of the algorithms. In section four we present the results of our experiments in the form of performance tables and graphs. In the last section, we discuss the conclusions of this paper, and outline some goals for further research. 2. STRING MATCHING ALGORITHMS In this section we present the basic sequential algorithms tested for solving of the string matching problem. However, for the further details and the coding of the algorithms, the reader is referred to [24] and the original references Classical Approach The classical string matching algorithms are based on character comparisons. The Brute-Force (in short, BF algorithm) [24] algorithm, which is the simplest, performs character comparisons between a character in the text and a character in the pattern from left to right. In any case, after a mismatch or a complete match of the entire pattern it shifts exactly one position to the right. It requires no preprocessing phase and no extra space. The BF algorithm has O(mn) worse-case time complexity. The average number of character comparisons is n(l + l/((ci - 1)). The Knuth-Morris-Pratt (in short, KMP) [18] algorithm, which was the first linear time string matching algorithm discovered, performs character comparisons from left to right. In case of mismatch it uses the knowledge of the previous characters that we have already examined in order to compute the next position of the pattern to use. In addition, this algorithm provides the advantage that the pointer in the text is never decremented. The preprocessing phase of the KMP algorithm requires O(m) time and space. The searching phase needs O(n) time in the worse and average case. The next algorithm is Boyer- Moore (in short, BM) [3] algorithm, which is known to be very fast in practice, performs character comparisons between a character in the text and a character in the pattern from right to

5 414 P. D. MICHAILIDIS AND K. G. MARGANTIS left. After a mismatch or a complete match of the entire pattern it uses two shift heuristics to shift the pattern to the right. These two heuristics are called the occurrence heuristic and the match heuristic. For the length of the shlft is the maximum shift between the occurrence heuristic and the match heuristic. The details for two heuristics are referred to original paper [3]. These heuristics are preprocessed in O(m+JCI) time and space. The searching phase of the BM algorithm needs O(n+ rm) time in the worse case, where r is the number of occurrences of the pattern in the text. Finally, the expected performance of the BM algorithm is sublinear requiring about nlm character comparisons on average. The Turbo-BM (in short, TBM) [7] algorithm is an variant of the BM algorithm. It consists in remembering the substring of the text that matched a suffix of the pattern during the last character comparisons (and only if a good suffix shift has been performed). This method has two advantages: a) it is possible to jump over this substring and b) it can enable to perform a turbo shift. The details for the turbo shift is referred to original paper [7]. It can be shown that the number of character comparisons performed by the TBM algorithm is bounded by 2n. The Boyer - Moore- Horspool (in short, BMH) [16] algorithm does not use the match heuristic. In case of mismatch or match of the pattern, the length of the shift is maximized by using only the occurrence heuristic for the text character corresponding to the rightmost pattern character (and not for the text character where the mismatch occurred). The preprocessing phase of the BMH algorithm requires O(m+ 1x1) time and reduces the space requirements from O(m+lCI) to O(IC1). Finally, the searching phase requires O(mn) time in the worse case but it can be proved that the average number of character comparisons is n/lci. The Quick Search (in short, QS) [30] algorithm of Sunday, performs character comparisons from left to right from the leftmost pattern character and in case of mismatch it computes the shift with the occurrence heuristic for the first text character after the last pattern character by the time of mismatch. The preprocessing and searching time of the QS algorithm are same as the BMH algorithm. The Boyer-Moore-Smith (in short, BMS) [28] algorithm, noticed that computing the shift with the text character just next the rightmost text character gives sometimes shorter shift than using the rightmost text character. He advised then to take the maximum between the two values. The preprocessing phase of the BMS algorithm consists of O(m+ (El) time and O(IC1) space. Further, this algorithm has O(mn) worse case time complexity.

6 2.2. Suffix Automata Approach STRING MATCHING ALGORITHMS 415 This category uses the suffix automaton data structure (frequently called DAWG- for Deterministic Acyclic Word Graph) that recognizes all the suffixes of the pattern [lo and 251. The Reverse Factor (in short, RF) [21 and 71 algorithm, which performs the characters of the text from right to left using the smallest suffix automaton of the reverse pattern. The preprocessing phase of the RF algorithm requires linear time and space in the length of the pattern. The searching phase of RF algorithm has a quadratic worse-case time complexity but it is optimal on the average. It performs O(nlogm/m) characters comparisons on the average Bit Parallelism Approach Bit parallelism [6 and 51 uses the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel (whose number of bits in the computer word we denote w). This technique has became a general way to simulate simple nondeterministic finite automata (NFA) instead of converting them to deterministic. The main advantages of this approach are simplicity, flexibility and no buffering. The basic idea of the first Shift-Or (in short, SO) [6] algorithm, is to represent the state of the search as a number, and each search step costs a small number of arithmetic and logical operations, provided that the numbers are large enough to represent all possible states of the search. Assuming that the pattern length is no longer than the computer word of the machine, the time complexity of the preprocessing phase is O((m + 1x1) [mlw]) using O(mlC1) extra space. Finally, the time complexity of the searching phase is O(n rm/wl) in the worse and average case, where rrnlwl is the time to compute a shift or other simple operation on numbers of m bits using a word size of w bits. An new algorithm has appeared recently, called Backward Nondeterministic DAWG Matching (BNDM) [25]. This algorithm uses a nondeterministic suffix automaton that is simulated using bit-parallelism. The preprocessing time for the BNDM algorithm is O(m+ lc() for m 5 w using O(rn(C() extra space. The searching time is O(mn) in the worse case and O(nlogrn/m) on average Hashing Approach We introduce a different approach to string matching, the Karp-Rabin (in short, KR) [24] algorithm, which uses hashing techniques. Hashing provides

7 416 P. D. MICHAILIDIS AND K. G. MARGARITIS a simple method to avoid a quadratic number of character comparisons in most practical situations. The main idea of the KR algorithm is to compute the signature or hashing function of each possible m-character substring in the text and check if it is equal to the signature function of the pattern. The preprocessing phase of the KR algorithm requires O(m) time while the searching phase has O(mn) worse case time complexity. Its expected number of character comparisons is O(m+n). 3. EXPERIMENTAL METHODOLOGY In this section we present the testing methodology which used in our experiments in order to compare the relative performance of string matching algorithms. The parameters which is described the performance of the algorithms are: a) The text size, b) The pattern length and c) The alphabet size. It is known that none of the algorithms are optimal or best in all three cases. Therefore, the main goal in our experimental study is to compare the practical performance of the algorithms against the length of the pattern (small and long patterns) under various alphabets of different sizes (or types of text) i.e., binary alphabet, alphabet of size 8, English alphabet and DNA alphabet, which have different characteristics Test Environment The experiments were run on a Sun UltraSparc-1 of 143Mhz clock, with 64 Mb RAM which is a 32 bit machine and a 2.1 Gb local hard disk. The operating system is Solaris 2.5. During all experiments, this machine was not performing other heavy tasks (or processes). The data structures used in the testing were all in physical memory during the experiments. Finally, the algorithms presented in the Section 2 have been implemented in ANSI C programming language [19] in a homogeneous way so as to keep their comparison significant, using the compiler cc. We greatly used the code presented in [4,13 and 241 for known algorithms Types of Test Data We note that because the performances of the string matching algorithms depended upon statistical properties of the pattern and the text string from

8 STRING MATCHING ALGORITHMS 417 which the test patterns were obtained, experiments were performed on four different types of texts: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet Binary Alphabet The alphabet is C = (0, 1). The text is consisted of 150,000 characters and was randomly built. For patterns of lengths between 2 and 100 we search 50 of them random built Alphabet of Size 8 The alphabet is C = {a, b, c, d, e,f, g, h). The text is consisted of 150,000 characters and was random built. In addition, for patterns of lengths between 2 and 100 we search 50 of them random built English Alphabet We used a document of English language from an web page. The alphabet is consisted of 70 different characters. The text is consisted of 148,188 characters and we search 50 patterns of each length from 2 to 100 characters were chosen at random from words inside the text DNA Alphabet The DNA alphabet consists of the four nucleotides a, c, g and t (standing for adenine, cytosine, guanine, and thymine, respectively) used to encode DNA. Therefore, the alphabet is 6 = {a, c, g, t). The text is consisted of 997,642 characters and we search 50 patterns of each length from 10 to 100 characters. Finally, the text and the patterns is portion of the GenBank DNA database, as distributed by Hume and Sunday [17] Measures of Comparison For the comparison of the string matching algorithms we used the number of character comparisons and the practical running time as measures. The counting of the number of character comparisons is the same as that used by Smith [28], that is, computing the number of actually compared characters to the number of passed characters in the text. Since all algorithms are designed to find all occurrences of a pattern in the text in our experiments, the number of passed characters is always n - m + 1. The running time is the

9 418 P. D. MICHAILIDIS AND K. G. MARGARITIS total time of calling an algorithm to search a pattern in the text including the preprocess time of building the auxiliary arrays. The running time is obtained by calling the C function clock () and it is measured in seconds. Thus, we measured the number of character comparisons and the running time all the algorithms in Section 2 in order to examine the effect of the pattern length. We performed the following test series: We measured the effect of the pattern length in a test series with varying m = 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 60, 80 and 100. In case of the DNA alphabet we used longer patterns because this alphabet has biological applications on long patterns. For this reason, in this alphabet we measured the effect of the pattern length in a test series with varying m = 10, 20, 30, 40, 50 and 100. Finally, to decrease random variation, the results of the algorithms are averages of 50 runs with different patterns of each length. We note that the bit-parallelism algorithms (such as SO and BNDM) use only the running time measure because they involve only implicit character comparisons. In addition, we know that they are limited to run for pattern length smaller than the word size in bits. For this reason, in our experimental study the SO and BNDM algorithms are limited to m EXPERIMENTAL RESULTS In the previous sections we have briefly presented the most well known string matching algorithms and the experimental methodology of our test. In this section, we present the experimental results for the string matching algorithms according to the number of character comparisons and the running time. Finally, the performance of each algorithm was plotted against the length of the pattern for each type of text Results for the Number of Character Comparisons Figures 1 to 4 and Tables I to IV show the results for the number of character comparisons for a binary alphabet, an alphabet of size 8, an English alphabet and a DNA alphabet respectively, against the pattern length. It can be seen that the KMP and KR algorithms produce in all cases exactly 1 character comparison. Further, the BF algorithm produces approximately the same number of character comparisons with the KMP and KR algorithms for the alphabet of size 8 and for the English alphabet. The BF requires more character comparisons for small size alphabet (i.e., the binary or the genome alphabet). Based on the empirical results, it is clear

10 STRING MATCHING ALGORITHMS 419..\-, :! Pattern length FIGURE 1 Binary alphabet. Pattern length FIGURE 2 Alphabet of size 8. Pattern length FIGURE 3 English alphabet. 1...:-:: q -.-. Kii, that for patterns of length greater than 10, the number of character comparisons is approximately 2, twice the number required by the KMP and KR algorithms for the binary alphabet. For the DNA alphabet case the BF requires on average 1,34 character comparisons. This occurs because

11 420 P. D. MICHAILIDIS AND K. G. MARGARITIS " Pattern length FIGURE 4 DNA alphabet. when the small size alphabet is used it leads to many exact pattern matches in the text and as a result the number of character comparisons tends to be greater than 1. However, when a larger alphabet is used this phenomenon is alleviated according to Figures 2 and 3. The number of character comparisons of the BM-like algorithms (such as BM, BMH, QS, BMS and TBM) and the suffix automata algorithm (such as RF) is generally less than 1 with the exception of the binary alphabet where the BMH and QS algorithms have on average 1,25 and 1,l character comparisons. Furthermore, it must be noted that the number of character comparisons of the BM-like and the RF algorithms is significantly higher when the binary alphabet is used than with any other type of text. It should also be observed that for all those algorithms the number of character comparisons decreases significantly as the pattern length increases. Thus the empirical results support theoretical evidence that the BM-like and the RF algorithms are sublinear in the number of character comparisons. The number of character comparisons decreases more slowly as the pattern length increases because for long patterns the probability is higher that the character just fetched occurs somewhere in the pattern, and therefore the distance the pattern can be moved forward (if a mismatch occurs) is shortened. Moreover, it is noticed that the character comparisons of all BM-like algorithms are very close to one another results and tend to stabilize to a certain performance measure except for the binary alphabet. Finally, for long patterns the difference between the number of character comparisons performed by the BM-like algorithms and the number of character comparisons performed by the suffix automata algorithm like RF increases in all cases. In all cases, it can be seen that the BM-like algorithms and suffix automata algorithm (like RF) have better results. More specifically, the

12 TABLE 1 Number of character comparisons for a binary alphabet m BF KMP BM BMH es BMS TBM RF KR - 2 1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,16883 I 60 2, , , , , , , , , , , , , , , , , , , , , , , , Average 1, , , , , , , , ,099198

13 TABLE 11 Number of character comparisons for an alphabet of size 8 m BF KMP BM BMH 0s BMS TBM RF KR 2 1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , I 100 1, , , , , , , , Average 1, , ,2378 0, , , , , ,002558

14 TABLE 111 Number of character comparisons for an English alphabet rn BF - KMP BM BMH Qs BMS TBM RF KR 2 0, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,0248 0,06 0, , , , , , , ,061 0,0533 0, , , ,0249 0, , , , , , , , ,0423 0, , , ,0423 0, Average 1, , , , , , , , ,001567

15 TABLE 1V Number of character comvarisons for a DNA alvhabet m BF KMP BM BMH Qs BMS TBM RE KR 10 1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,3453 1, , , , , , , Average 1, ,1066 0, , , , , , ,000017

16 STRING MATCHING ALGORITHMS 425 BM-like algorithms (such as TBM and BMS) and the RF algorithm is much more efficient in terms of number of character comparisons than the remaining algorithms for small and long patterns respectively Results for the Running Time Figures 5 to 8 and Tables V to VIII show the results for the practical running time for a binary alphabet, an alphabet of size 8, an English alphabet and a DNA alphabet respectively, against the pattern length. We observe that in all cases the KR algorithm requires much more time than any other algorithm. This observation agrees with the expected behaviour that the computation of the hash values is computationally expensive in terms of machine cycles and so increases the running time of the algorithm. Therefore, this algorithm isn't recommend for text applications. FIGURE 5 Pattsrn length Binary alphabet W c VNL W Pattern length FIGURE 6 Alphabet of size 8.

17 426 P. D. MICHAILIDIS AND K. G. MARGARITIS ktt.rn bngth FIGURE 7 English alphabet. FIGURE 8 DNA alphabet. Further, based on empirical results, it is clear that in all cases the KMP algorithm is relatively little slower than the BF algorithm for almost all pattern lengths with the exception of the binary alphabet. This behaviour support theoretical evidence that the KMP algorithm isn't better than the BF algorithm on the average case. Further, it can also be seen that in all cases the BF and KMP algorithms are significantly slower than the BM-like and bit-parallelism algorithms. The running time of the BM-like and bit parallelism (like BNDM) algorithms decreases significantly as the pattern length increases. Moreover, it should be noticed that the BM-like algorithms produce similar running times i.e., very close to each other in all cases with the exception of the binary alphabet. In addition, for long patterns the difference between the running times of BM-like algorithms and of suffix automata algorithms like RF increases in all cases with the exception of the English alphabet. This difference is in favour of RF algorithm.

18 TABLE V Running times for a binary alphabet m BF KMP BM BMH 0s BMS TBM RF SO BNDM KR Average

19 t Gz2wHwrn~m-2F3z2~ OP-Wm --o-somaqq~~n~~g-2-ggg 3 OOOOOOOOOOOOOOOC

20 TABLE VII Running times for an English alphabet m BF KMP BM BMH Qs BMS TBM RF SO BNDM KR Average

21 TABLE VIII Running times for a DNA alphabet m BF KMP BM BMH Qs BMS TBM RF SO BNDM KR 10 0,4958 0,6146 0,1704 0,1596 0,157 0,1638 0,317 0,1248 0,3018 0,1334 1, ,4958 0,6108 0,1546 0,1684 0,161 0,1702 0,2902 0,0752 0,3024 0,0746 1, ,4916 0,605 0,144 0,1534 0,1472 0,153 0,261 0,0554 0,3024 0,0528 1, ,494 0,6084 0,1204 0,1498 0,1474 0,1502 0,2256 0, , ,4932 0,61 0,1286 0,1568 0,1556 0,157 0,2412 0, , ,4931 0,62 0,1252 0,156 0,161 0,1581 0,2221 0, ,5578 Average 0, , , , , , , ,0647 0,3022 0, ,561767

22 STRING MATCHING ALGORITHMS 43 1 The SO bit-parallelism algorithm outperforms KR, KMP and BF algorithms for all pattern lengths. So is faster than the TBM and BNDM algorithms only for small patterns. The latter observation is valid in all cases with the exception of the binary alphabet. However, it can be seen that the SO algorithm outperforms than the BM-like and suffix automata algorithms for small patterns especially for the binary alphabet. Finally, it can be seen that in the majority of cases the suffix automata algorithm such as RF has a faster running time than the BM-like and the bitparallelism algorithms for long patterns. Further, the BM-like algorithms have better running times for small patterns except for the binary alphabet. 5. CONCLUSIONS We have presented experimental results of an extensive set experiments of the most well known string matching algorithms based on classical, suffix automata, bit-parallelism and hashing approach. Therefore, the conclusions of this paper fall into two main categories: general conclusions regarding the algorithms and their testing procedures, and conclusions relating to the performance of specific algorithms. As a general conclusion we can say that testing the algorithms on four different types of text (binary alphabet, alphabet of size 8, English alphabet and DNA alphabet) indicates that varying parameters such as the pattern length and the alphabet size can produce different performances. The specific performance conclusions are: It should be noticed that the absolute shapes of the lines on the performance graph are not conclusive. Information can only be derived from the relative positions of the curves for each algorithm at each pattern length. This is because the patterns were chosen at random and obviously the running time is related to how far into the text the pattern occurs. The running times for all the eleven algorithms can be compared at each pattern length because the same type of text and set of patterns were used with each algorithm. From the empirical evidence it can be concluded that the KR algorithm is linear in the number character comparisons but it has higher running time and it shouldn't be used for pattern matching in strings. However, the main advantage of this algorithm lies in its extension to higher dimensional string matching. It may be used for pattern recognition and image processing and thus in the expanding field of computer graphics. If you plan on direct searching with simple text, the linear BF algorithm is a proper choice because it produces relatively good running time results

23 432 P. D. MICHAILIDIS AND K. G. MARGARITIS despite its striking simplicity. In addition, the BF algorithm has no special memory requirements and needs no preprocessing or complex coding and thus can be surprisingly fast. But this algorithm shouldn't used for the binary alphabet in applications such as image processing or software systems. Despite its theoretical elegance, the KMP algorithm provides no significant speedup advantage over the BF algorithm in practice unless the pattern has highly repetitive subpatterns. However the KMP algorithm guarantees a linear bound and it is well suited to extensions for more difficult problems. It may be a good choice when the alphabet size is near the text size or when dealing with the binary alphabet. As far as the variations of the BM approach we can make the following remarks: Based on empirical results, it is clear that the QS algorithm is proved to be much faster algorithm in practice than the rest BM-like, suffix automata and bit-parallelism algorithms for large alphabets and short patterns. Therefore it is typically suited for search in the English alphabet. In addition, the BM algorithm is faster than its variations (such as BMH, QS, BMS and TBM) for small alphabets and long patterns. However, in theory BMS and QS are better algorithms than BM-like and suffix automata algorithms for short patterns and large alphabets. The TBM and BMS algorithms are also good both for small alphabets and short or medium patterns. We must also note that the main disadvantage of BM-like algorithms is the preprocessing time and the space required, which depends on the alphabet size and/or the pattern size. For this reason, if the pattern is small (1 to 4 characters) it is better to use the BF algorithm. Furthermore, the BM-like algorithms can't to be used if the type of string matching problem is different than finding the first occurrence of a pattern. For example, if the problem is to find the first of several possible patterns or to recognize a position in the text defined by a regular expression. This is also because the preprocessing time would be significant. It should be noted that for long patterns the running time of the suffix automata algorithm (RF) increases because of the preprocessing phase, the time for which is equal to the time for the searching phase. Thus, the RF algorithm is efficient in theory and practice for small alphabets and long patterns. Therefore, this algorithm is a good choice to be used for DNA applications. In practice, the bit-parallelism algorithms (SO and BNDM) are always fastest for small alphabets and short patterns. Also, the SO algorithm produces linear running time similar to the BF and KMP algorithms. In particular, the BNDM algorithm is the fastest and outperforms BM-like

24 STRING MATCHING ALGORITHMS 433 algorithms for moderate patterns. However, the main advantage of the algorithms, is that it is simple to implement and support class of characters (i.e. [a-z]), don't care symbols (a don't care symbol matches any symbol), complement of a character or a class, and other extensions developed by [31] such as wild cards (a wild card is a symbol that matches all characters), set of patterns, long patterns, etc., using exactly the same searching time (only the preprocessing is different). On the other hand, these algorithms have the disadvantage that the patterns is limited to 32 or 64 characters (32 or 64 being the word size of many of today's machines). Handling long patterns is fairly easy to do (you need to use multiprecision bit operations), but it can slow down the algorithms significantly. For many applications, however, a maximum pattern length of 32 or 64 characters is not much of a problem. In addition, we notice that the theoretical time complexities of algorithms [24] are valid only in the average case. For instance, the experiments have shown that on average, the algorithms such as BF, BMH, QS, BMS and BNDM have good behavior. On the other hand, the experiments have shown that in the worst and average cases, only the BM, RF and SO algorithms are fast both theoretically and practically. References [I] Apostolico, A. and Giancarlo, R. (1986). The Boyer-Moore-Galil string searching strategies revisited, SZAM Journal on Computing, 15(1), [2] Aho, A. V., Algorithms forjindingpatterns in strings, Chapter 5 (pp ) of Leeuwen J. van (Ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam. [3] Boyer, R. S. and Moore, J. S. (1977). A fast string searching algorithm, Communications of the ACM, 20(10), [4] Baeza-Yates, R. (1989). Algorithms for string searching: A survey, ACM SZGZR Forum, 23(3-4), [5] Baeza-Yates, R. (1992). Text Retrieval: Theory and Practice, In: Proc. of the 12th IFZP World Computer Congress, pp (Madrid, Spain), North-Holland. [6] Baeza-Yates, R. and Gonnet, G. H. (1992). A new approach to text searching, Communications of the ACM, 35(10), [7] Crochemore, M., Czumaj, A,, Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W. and Rytter, W. (1994). Speeding Up Two String Matching Algorithms, Algorithmica, 12(4-5), [8] Colussi, L. (1991). Correctness and efficiency of the pattern matching algorithms, Information and Computation, 95(2), [9] Colussi, L. (1994). Fastest pattern matching in strings, Journal of Algorithms, 16(2), [lo] Crochemore, M. and Rytter, W. (1994). Text Algorithms, Oxford University Press. [I 11 Davies, G. and Bowsher, S. (1986). Algorithms for pattern matching, Software-Practice and Experience, 16(6), [12] Galil, Z. (1979). On improving the worst case running time of the Boyer-Moore string searching algorithm, Communications of the ACM, 22(9), [I31 Gonnet, G. H. and Baeza-Yates, R. (1991). Handbook of Algorithms and Data Structures in Pascal and C, 2nd edition, Addison-Wesley, Workingham, pp

25 434 P. D. MICHAILIDIS AND K. G. MARGARITIS Hancart, C. (1993). On Simon's string searching algorithm, Information Processing Letters, 47(2), Harrison, M. C. (1971). Implementation of the substring test by hashing, Communications of the ACM, 14(12), Horspool, R. N. (1980). Practical fast searching in strings, Software-Practice and Experience, 10(6), Hume, A. and Sunday, D. (1991). Fast string searching, Software-Practice and Experience, 21(1 I), Knuth, D. E., Morris, J. H, and Pratt, V. R. (1977). Fast pattern matching in strings, SIAM Journal on Computing, 6(2), Kernighan, B. W. and Ritchie, D. M. (1988). The C Programming Language, Prentice Hall, Englewood Cliffs, NJ, 2nd edition. Liu, Z., Du, X. and Ishii, N. (1998). An improved adaptive string searching algorithm, Software-Practice and Experience, 28(2), Lecroq, T. (1992). A variation on the Boyer-Moore algorithm, Theoretical Computer Science, 92(1), Lecroq, T. (1995). Experimental results on string matching algorithms, Software-Practice and Experience, 25(7), Manolopoulos, Y. and Faloutsos, C. (1996). Experimenting with pattern matching algorithms, Information Sciences, 90(1-4), Michailidis, P. and Margaritis, K. (1999). String Matching Algorithms, Technical Report, Department of Ap. Informatics, University of Macedonia (in Greek). Navarro, G. and Raffinot, M. (1998). A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching, In: Proc. of the 9th Annual Symposium on Combinatorial Pattern Matching, No. 1448, pp , Springer-Verlag, Berlin. Raita, T. (1992). Tunning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 22(10), Smit, G. and De, V. (1982). A Comparison of Three String Matching Algorithms, Software-Practice and Experience, 12(1), Smith, P. (1991). Experiments with a very fast substring search algorithm, Sofiware- Practice and Experience, 21(10), Stephen, A. G. (1994). String Searching Algorithms, World Scientific Press. Sunday, D. (1990). A very fast substring search algorithm, Communications ofthe ACM, 33(8), Wu, S. and Manber, U. (1992). Fast text searching allowing errors, Communications of the ACM, 35(10),

Experimental Results on String Matching Algorithms

Experimental Results on String Matching Algorithms SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(7), 727 765 (JULY 1995) Experimental Results on String Matching Algorithms thierry lecroq Laboratoire d Informatique de Rouen, Université de Rouen, Facultés des

More information

Experiments on string matching in memory structures

Experiments on string matching in memory structures Experiments on string matching in memory structures Thierry Lecroq LIR (Laboratoire d'informatique de Rouen) and ABISS (Atelier de Biologie Informatique Statistique et Socio-Linguistique), Universite de

More information

Fast Substring Matching

Fast Substring Matching Fast Substring Matching Andreas Klein 1 2 3 4 5 6 7 8 9 10 Abstract The substring matching problem occurs in several applications. Two of the well-known solutions are the Knuth-Morris-Pratt algorithm (which

More information

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department

More information

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract)

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract) A very fast string matching algorithm for small alphabets and long patterns (Extended abstract) Christian Charras 1, Thierry Lecroq 1, and Joseph Daniel Pehoushek 2 1 LIR (Laboratoire d'informatique de

More information

Fast exact string matching algorithms

Fast exact string matching algorithms Information Processing Letters 102 (2007) 229 235 www.elsevier.com/locate/ipl Fast exact string matching algorithms Thierry Lecroq LITIS, Faculté des Sciences et des Techniques, Université de Rouen, 76821

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

Application of the BWT Method to Solve the Exact String Matching Problem

Application of the BWT Method to Solve the Exact String Matching Problem Application of the BWT Method to Solve the Exact String Matching Problem T. W. Chen and R. C. T. Lee Department of Computer Science National Tsing Hua University, Hsinchu, Taiwan chen81052084@gmail.com

More information

A Practical Distributed String Matching Algorithm Architecture and Implementation

A Practical Distributed String Matching Algorithm Architecture and Implementation A Practical Distributed String Matching Algorithm Architecture and Implementation Bi Kun, Gu Nai-jie, Tu Kun, Liu Xiao-hu, and Liu Gang International Science Index, Computer and Information Engineering

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational

More information

WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA

WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA 2010 WAVE-FRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND

More information

Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms

Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms Regular Paper Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms Mohammed Sahli 1,a) Tetsuo Shibuya 2 Received: September 8, 2011, Accepted: January 13, 2012 Abstract:

More information

On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment

On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment International Journal of Future Computer and Communication, Vol. 6, No. 1, March 2017 On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment Kunaphas Kongkitimanon

More information

Practical and Optimal String Matching

Practical and Optimal String Matching Practical and Optimal String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Szymon Grabowski Technical University of Łódź, Computer Engineering Department SPIRE

More information

arxiv: v1 [cs.ds] 3 Jul 2017

arxiv: v1 [cs.ds] 3 Jul 2017 Speeding Up String Matching by Weak Factor Recognition Domenico Cantone, Simone Faro, and Arianna Pavone arxiv:1707.00469v1 [cs.ds] 3 Jul 2017 Università di Catania, Viale A. Doria 6, 95125 Catania, Italy

More information

Accelerating Boyer Moore Searches on Binary Texts

Accelerating Boyer Moore Searches on Binary Texts Accelerating Boyer Moore Searches on Binary Texts Shmuel T. Klein Miri Kopel Ben-Nissan Department of Computer Science, Bar Ilan University, Ramat-Gan 52900, Israel Tel: (972 3) 531 8865 Email: {tomi,kopel}@cs.biu.ac.il

More information

Inexact Pattern Matching Algorithms via Automata 1

Inexact Pattern Matching Algorithms via Automata 1 Inexact Pattern Matching Algorithms via Automata 1 1. Introduction Chung W. Ng BioChem 218 March 19, 2007 Pattern matching occurs in various applications, ranging from simple text searching in word processors

More information

Survey of Exact String Matching Algorithm for Detecting Patterns in Protein Sequence

Survey of Exact String Matching Algorithm for Detecting Patterns in Protein Sequence Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 8 (2017) pp. 2707-2720 Research India Publications http://www.ripublication.com Survey of Exact String Matching Algorithm

More information

Improving Practical Exact String Matching

Improving Practical Exact String Matching Improving Practical Exact String Matching Branislav Ďurian Jan Holub Hannu Peltola Jorma Tarhio Abstract We present improved variations of the BNDM algorithm for exact string matching. At each alignment

More information

A Unifying Look at the Apostolico Giancarlo String-Matching Algorithm

A Unifying Look at the Apostolico Giancarlo String-Matching Algorithm A Unifying Look at the Apostolico Giancarlo String-Matching Algorithm MAXIME CROCHEMORE, IGM (Institut Gaspard-Monge), Université de Marne-la-Vallée, 77454 Marne-la-Vallée CEDEX 2, France. E-mail: mac@univ-mlv.fr,

More information

Efficient String Matching Using Bit Parallelism

Efficient String Matching Using Bit Parallelism Efficient String Matching Using Bit Parallelism Kapil Kumar Soni, Rohit Vyas, Dr. Vivek Sharma TIT College, Bhopal, Madhya Pradesh, India Abstract: Bit parallelism is an inherent property of computer to

More information

A Survey of String Matching Algorithms

A Survey of String Matching Algorithms RESEARCH ARTICLE OPEN ACCESS A Survey of String Matching Algorithms Koloud Al-Khamaiseh*, Shadi ALShagarin** *(Department of Communication and Electronics and Computer Engineering, Tafila Technical University,

More information

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern)

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern) Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern) Hussein Abu-Mansour 1, Jaber Alwidian 1, Wael Hadi 2 1 ITC department Arab Open University Riyadh- Saudi Arabia 2 CIS department

More information

Text Algorithms (6EAP) Lecture 3: Exact paaern matching II

Text Algorithms (6EAP) Lecture 3: Exact paaern matching II Text Algorithms (6EA) Lecture 3: Exact paaern matching II Jaak Vilo 2012 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 2 Algorithms Brute force O(nm) Knuth- Morris- raa O(n) Karp- Rabin hir- OR, hir- AND

More information

An efficient matching algorithm for encoded DNA sequences and binary strings

An efficient matching algorithm for encoded DNA sequences and binary strings An efficient matching algorithm for encoded DNA sequences and binary strings Simone Faro 1 and Thierry Lecroq 2 1 Dipartimento di Matematica e Informatica, Università di Catania, Italy 2 University of

More information

Study of Selected Shifting based String Matching Algorithms

Study of Selected Shifting based String Matching Algorithms Study of Selected Shifting based String Matching Algorithms G.L. Prajapati, PhD Dept. of Comp. Engg. IET-Devi Ahilya University, Indore Mohd. Sharique Dept. of Comp. Engg. IET-Devi Ahilya University, Indore

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms 1. Naïve String Matching The naïve approach simply test all the possible placement of Pattern P[1.. m] relative to text T[1.. n]. Specifically, we try shift s = 0, 1,..., n -

More information

A NEW STRING MATCHING ALGORITHM

A NEW STRING MATCHING ALGORITHM Intern. J. Computer Math., Vol. 80, No. 7, July 2003, pp. 825 834 A NEW STRING MATCHING ALGORITHM MUSTAQ AHMED a, *, M. KAYKOBAD a,y and REZAUL ALAM CHOWDHURY b,z a Department of Computer Science and Engineering,

More information

Application of String Matching in Auto Grading System

Application of String Matching in Auto Grading System Application of String Matching in Auto Grading System Akbar Suryowibowo Syam - 13511048 Computer Science / Informatics Engineering Major School of Electrical Engineering & Informatics Bandung Institute

More information

A New Multiple-Pattern Matching Algorithm for the Network Intrusion Detection System

A New Multiple-Pattern Matching Algorithm for the Network Intrusion Detection System IACSIT International Journal of Engineering and Technology, Vol. 8, No. 2, April 2016 A New Multiple-Pattern Matching Algorithm for the Network Intrusion Detection System Nguyen Le Dang, Dac-Nhuong Le,

More information

String Searching Algorithm Implementation-Performance Study with Two Cluster Configuration

String Searching Algorithm Implementation-Performance Study with Two Cluster Configuration International Journal of Computer Science & Communication Vol. 1, No. 2, July-December 2010, pp. 271-275 String Searching Algorithm Implementation-Performance Study with Two Cluster Configuration Prasad

More information

Fast Hybrid String Matching Algorithms

Fast Hybrid String Matching Algorithms Fast Hybrid String Matching Algorithms Jamuna Bhandari 1 and Anil Kumar 2 1 Dept. of CSE, Manipal University Jaipur, INDIA 2 Dept of CSE, Manipal University Jaipur, INDIA ABSTRACT Various Hybrid algorithms

More information

Boyer-Moore strategy to efficient approximate string matching

Boyer-Moore strategy to efficient approximate string matching Boyer-Moore strategy to efficient approximate string matching Nadia El Mabrouk, Maxime Crochemore To cite this version: Nadia El Mabrouk, Maxime Crochemore. Boyer-Moore strategy to efficient approximate

More information

Text Algorithms (6EAP) Lecture 3: Exact pa;ern matching II

Text Algorithms (6EAP) Lecture 3: Exact pa;ern matching II Text Algorithms (6EAP) Lecture 3: Exact pa;ern matching II Jaak Vilo 2010 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Find occurrences in text P S 2 Algorithms Brute force O(nm) Knuth- Morris- Pra; O(n)

More information

GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences

GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences Int. J. Bioinformatics Research and Applications, Vol. GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences Sérgio Deusdado* Centre for Mountain Research (CIMO), Polytechnic Institute

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Charles A. Wuethrich Bauhaus-University Weimar - CogVis/MMC May 11, 2017 Algorithms and Data Structures String searching algorithm 1/29 String searching algorithm Introduction

More information

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011 Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA December 16, 2011 Abstract KMP is a string searching algorithm. The problem is to find the occurrence of P in S, where S is the given

More information

A Multipattern Matching Algorithm Using Sampling and Bit Index

A Multipattern Matching Algorithm Using Sampling and Bit Index A Multipattern Matching Algorithm Using Sampling and Bit Index Jinhui Chen, Zhongfu Ye Department of Automation University of Science and Technology of China Hefei, P.R.China jeffcjh@mail.ustc.edu.cn,

More information

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17,  ISSN International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, www.ijcea.com ISSN 2321-3469 DNA PATTERN MATCHING - A COMPARATIVE STUDY OF THREE PATTERN MATCHING ALGORITHMS

More information

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

More information

Tuning BNDM with q-grams

Tuning BNDM with q-grams Tuning BNDM with q-grams Branislav Ďurian Jan Holub Hannu Peltola Jorma Tarhio Abstract We develop bit-parallel algorithms for exact string matching. Our algorithms are variations of the BNDM and Shift-Or

More information

Computing Patterns in Strings I. Specific, Generic, Intrinsic

Computing Patterns in Strings I. Specific, Generic, Intrinsic Outline : Specific, Generic, Intrinsic 1,2,3 1 Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Ontario, Canada email: smyth@mcmaster.ca 2 Digital Ecosystems

More information

Fuzzy Optimization of the Constructive Parameters of Laboratory Fermenters

Fuzzy Optimization of the Constructive Parameters of Laboratory Fermenters This article was downloaded by: [Bulgarian Academy of Sciences] On: 07 April 2015, At: 00:04 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered

More information

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple

More information

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي للعام الدراسي: 2017/2016 The Introduction The introduction to information theory is quite simple. The invention of writing occurred

More information

Automatic Export of PubMed Citations to EndNote Sue London a ; Osman Gurdal a ; Carole Gall a a

Automatic Export of PubMed Citations to EndNote Sue London a ; Osman Gurdal a ; Carole Gall a a This article was downloaded by: [B-on Consortium - 2007] On: 20 July 2010 Access details: Access Details: [subscription number 919435511] Publisher Routledge Informa Ltd Registered in England and Wales

More information

Fast Exact String Matching Algorithms

Fast Exact String Matching Algorithms Fast Exact String Matching Algorithms Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, Traitement de l Information, Systèmes. Part of this work has been done with Maxime Crochemore

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Clone code detector using Boyer Moore string search algorithm integrated with ontology editor

Clone code detector using Boyer Moore string search algorithm integrated with ontology editor EUROPEAN ACADEMIC RESEARCH Vol. IV, Issue 2/ May 2016 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Clone code detector using Boyer Moore string search algorithm integrated

More information

Bit-Reduced Automaton Inspection for Cloud Security

Bit-Reduced Automaton Inspection for Cloud Security Bit-Reduced Automaton Inspection for Cloud Security Haiqiang Wang l Kuo-Kun Tseng l* Shu-Chuan Chu 2 John F. Roddick 2 Dachao Li 1 l Department of Computer Science and Technology, Harbin Institute of Technology,

More information

A New String Matching Algorithm Based on Logical Indexing

A New String Matching Algorithm Based on Logical Indexing The 5th International Conference on Electrical Engineering and Informatics 2015 August 10-11, 2015, Bali, Indonesia A New String Matching Algorithm Based on Logical Indexing Daniar Heri Kurniawan Department

More information

TUNING BG MULTI-PATTERN STRING MATCHING ALGORITHM WITH UNROLLING Q-GRAMS AND HASH

TUNING BG MULTI-PATTERN STRING MATCHING ALGORITHM WITH UNROLLING Q-GRAMS AND HASH Computer Modelling and New Technologies, 2013, Vol.17, No. 4, 58-65 Transport and Telecommunication Institute, Lomonosov 1, LV-1019, Riga, Latvia TUNING BG MULTI-PATTERN STRING MATCHING ALGORITHM WITH

More information

University of Huddersfield Repository

University of Huddersfield Repository University of Huddersfield Repository Klaib, Ahmad and Osborne, Hugh OE Matching for Searching Biological Sequences Original Citation Klaib, Ahmad and Osborne, Hugh (2009) OE Matching for Searching Biological

More information

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan Enhanced Two Sliding Windows Algorithm For Matching (ETSW) Mariam Itriq 1, Amjad Hudaib 2, Aseel Al-Anani 2, Rola Al-Khalid 2, Dima Suleiman 1 1. Department of Business Information Systems, King Abdullah

More information

Department of Geography, University of North Texas, Denton, TX, USA. Online publication date: 01 April 2010 PLEASE SCROLL DOWN FOR ARTICLE

Department of Geography, University of North Texas, Denton, TX, USA. Online publication date: 01 April 2010 PLEASE SCROLL DOWN FOR ARTICLE This article was downloaded by: [Dong, Pinliang] On: 1 April 2010 Access details: Access Details: [subscription number 920717327] Publisher Taylor & Francis Informa Ltd Registered in England and Wales

More information

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017 Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of

More information

The Exact Online String Matching Problem: A Review of the Most Recent Results

The Exact Online String Matching Problem: A Review of the Most Recent Results 13 The Exact Online String Matching Problem: A Review of the Most Recent Results SIMONE FARO, Università di Catania THIERRY LECROQ, Université derouen This article addresses the online exact string matching

More information

Automaton-based Sublinear Keyword Pattern Matching. SoC Software. Loek Cleophas, Bruce W. Watson, Gerard Zwaan

Automaton-based Sublinear Keyword Pattern Matching. SoC Software. Loek Cleophas, Bruce W. Watson, Gerard Zwaan SPIRE 2004 Padova, Italy October 5 8, 2004 Automaton-based Sublinear Keyword Pattern Matching Loek Cleophas, Bruce W. Watson, Gerard Zwaan SoC Software Construction Software Construction Group Department

More information

High Performance Pattern Matching Algorithm for Network Security

High Performance Pattern Matching Algorithm for Network Security IJCSNS International Journal of Computer Science and Network Security, VOL.6 No., October 6 83 High Performance Pattern Matching Algorithm for Network Security Yang Wang and Hidetsune Kobayashi Graduate

More information

Efficient Implementation of Suffix Trees

Efficient Implementation of Suffix Trees SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(2), 129 141 (FEBRUARY 1995) Efficient Implementation of Suffix Trees ARNE ANDERSSON AND STEFAN NILSSON Department of Computer Science, Lund University, Box 118,

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Multithreaded Sliding Window Approach to Improve Exact Pattern Matching Algorithms

Multithreaded Sliding Window Approach to Improve Exact Pattern Matching Algorithms Multithreaded Sliding Window Approach to Improve Exact Pattern Matching Algorithms Ala a Al-shdaifat Computer Information System Department The University of Jordan Amman, Jordan Bassam Hammo Computer

More information

String Matching using Inverted Lists

String Matching using Inverted Lists nternational Journal of Computer nformation Engineering String Matching using nverted Lists Chouvalit Khancome, Veera Boonjing nternational Science ndex, Computer nformation Engineering aset.org/publication/7400

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Efficient validation and construction of border arrays

Efficient validation and construction of border arrays Efficient validation and construction of border arrays Jean-Pierre Duval Thierry Lecroq Arnaud Lefebvre LITIS, University of Rouen, France, {Jean-Pierre.Duval,Thierry.Lecroq,Arnaud.Lefebvre}@univ-rouen.fr

More information

Sanil Shanker KP a, Elizabeth Sherly b & Jim Austin c a Department of Computer Science, University of Kerala, Kerala,

Sanil Shanker KP a, Elizabeth Sherly b & Jim Austin c a Department of Computer Science, University of Kerala, Kerala, This article was downloaded by: [SANIL SHANKER KP] On: 20 September 2011, At: 22:08 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:

More information

Combined string searching algorithm based on knuth-morris- pratt and boyer-moore algorithms

Combined string searching algorithm based on knuth-morris- pratt and boyer-moore algorithms IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Combined string searching algorithm based on knuth-morris- pratt and boyer-moore algorithms To cite this article: R Yu Tsarev

More information

String matching algorithms

String matching algorithms String matching algorithms Deliverables String Basics Naïve String matching Algorithm Boyer Moore Algorithm Rabin-Karp Algorithm Knuth-Morris- Pratt Algorithm Copyright @ gdeepak.com 2 String Basics A

More information

Fast Parallel String Prex-Matching. Dany Breslauer. April 6, Abstract. n log m -processor CRCW-PRAM algorithm for the

Fast Parallel String Prex-Matching. Dany Breslauer. April 6, Abstract. n log m -processor CRCW-PRAM algorithm for the Fast Parallel String Prex-Matching Dany Breslauer April 6, 1995 Abstract An O(log logm) time n log m -processor CRCW-PRAM algorithm for the string prex-matching problem over general alphabets is presented.

More information

A New Platform NIDS Based On WEMA

A New Platform NIDS Based On WEMA I.J. Information Technology and Computer Science, 2015, 06, 52-58 Published Online May 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.06.07 A New Platform NIDS Based On WEMA Adnan A.

More information

Given a text file, or several text files, how do we search for a query string?

Given a text file, or several text files, how do we search for a query string? CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key

More information

Optimization of Boyer-Moore-Horspool-Sunday Algorithm

Optimization of Boyer-Moore-Horspool-Sunday Algorithm Optimization of Boyer-Moore-Horspool-Sunday Algorithm Rionaldi Chandraseta - 13515077 Program Studi Teknik Informatika Sekolah Teknik Elektro dan Informatika, Institut Teknologi Bandung Bandung, Indonesia

More information

Bit-Parallel LCS-length Computation Revisited

Bit-Parallel LCS-length Computation Revisited Bit-Parallel LCS-length Computation Revisited Heikki Hyyrö Abstract The longest common subsequence (LCS) is a classic and well-studied measure of similarity between two strings A and B. This problem has

More information

CSCI S-Q Lecture #13 String Searching 8/3/98

CSCI S-Q Lecture #13 String Searching 8/3/98 CSCI S-Q Lecture #13 String Searching 8/3/98 Administrivia Final Exam - Wednesday 8/12, 6:15pm, SC102B Room for class next Monday Graduate Paper due Friday Tonight Precomputation Brute force string searching

More information

SWIFT -A Performance Accelerated Optimized String Matching Algorithm for Nvidia GPUs

SWIFT -A Performance Accelerated Optimized String Matching Algorithm for Nvidia GPUs 2016 15th International Symposium on Parallel and Distributed Computing SWIFT -A Performance Accelerated Optimized String Matching Algorithm for Nvidia GPUs Sourabh S. Shenoy, Supriya Nayak U. and B. Neelima

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

Importance of String Matching in Real World Problems

Importance of String Matching in Real World Problems www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 3 Issue 6 June, 2014 Page No. 6371-6375 Importance of String Matching in Real World Problems Kapil Kumar Soni,

More information

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count 2011 International Conference on Life Science and Technology IPCBEE vol.3 (2011) (2011) IACSIT Press, Singapore An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count Raju Bhukya

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Algorithms for Weighted Matching

Algorithms for Weighted Matching Algorithms for Weighted Matching Leena Salmela and Jorma Tarhio Helsinki University of Technology {lsalmela,tarhio}@cs.hut.fi Abstract. We consider the matching of weighted patterns against an unweighted

More information

An analysis of the Intelligent Predictive String Search Algorithm: A Probabilistic Approach

An analysis of the Intelligent Predictive String Search Algorithm: A Probabilistic Approach I.J. Information Technology and Computer Science, 2017, 2, 66-75 Published Online February 2017 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2017.02.08 An analysis of the Intelligent Predictive

More information

Fast Searching in Biological Sequences Using Multiple Hash Functions

Fast Searching in Biological Sequences Using Multiple Hash Functions Fast Searching in Biological Sequences Using Multiple Hash Functions Simone Faro Dip. di Matematica e Informatica, Università di Catania Viale A.Doria n.6, 95125 Catania, Italy Email: faro@dmi.unict.it

More information

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers. STRING ALGORITHMS : Introduction A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers. There are many functions those can be applied on strings.

More information

GENERATING SUPPLEMENTARY INDEX RECORDS USING MORPHOLOGICAL ANALYSIS FOR HIGH-SPEED PARTIAL MATCHING ABSTRACT

GENERATING SUPPLEMENTARY INDEX RECORDS USING MORPHOLOGICAL ANALYSIS FOR HIGH-SPEED PARTIAL MATCHING ABSTRACT GENERATING SUPPLEMENTARY INDEX RECORDS USING MORPHOLOGICAL ANALYSIS FOR HIGH-SPEED PARTIAL MATCHING Masahiro Oku NTT Affiliated Business Headquarters 20-2 Nishi-shinjuku 3-Chome Shinjuku-ku, Tokyo 163-1419

More information

Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA)

Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA) Multiple Skip Multiple Pattern Matching (MSMPMA) Ziad A.A. Alqadi 1, Musbah Aqel 2, & Ibrahiem M. M. El Emary 3 1 Faculty Engineering, Al Balqa Applied University, Amman, Jordan E-mail:ntalia@yahoo.com

More information

Multi-Pattern String Matching with Very Large Pattern Sets

Multi-Pattern String Matching with Very Large Pattern Sets Multi-Pattern String Matching with Very Large Pattern Sets Leena Salmela L. Salmela, J. Tarhio and J. Kytöjoki: Multi-pattern string matching with q-grams. ACM Journal of Experimental Algorithmics, Volume

More information

COMPARISON AND IMPROVEMENT OF STRIN MATCHING ALGORITHMS FOR JAPANESE TE. Author(s) YOON, Jeehee; TAKAGI, Toshihisa; US

COMPARISON AND IMPROVEMENT OF STRIN MATCHING ALGORITHMS FOR JAPANESE TE. Author(s) YOON, Jeehee; TAKAGI, Toshihisa; US Title COMPARISON AND IMPROVEMENT OF STRIN MATCHING ALGORITHMS FOR JAPANESE TE Author(s) YOON, Jeehee; TAKAGI, Toshihisa; US Citation 数理解析研究所講究録 (1986), 586: 18-34 Issue Date 1986-03 URL http://hdl.handle.net/2433/99393

More information

A Two-Hashing Table Multiple String Pattern Matching Algorithm

A Two-Hashing Table Multiple String Pattern Matching Algorithm 2013 10th International Conference on Information Technology: New Generations A Two-Hashing Table Multiple String Pattern Matching Algorithm Chouvalit Khancome Department of Computer Science, Faculty of

More information

Data Structures and Algorithms. Course slides: String Matching, Algorithms growth evaluation

Data Structures and Algorithms. Course slides: String Matching, Algorithms growth evaluation Data Structures and Algorithms Course slides: String Matching, Algorithms growth evaluation String Matching Basic Idea: Given a pattern string P, of length M Given a text string, A, of length N Do all

More information

CS/COE 1501

CS/COE 1501 CS/COE 1501 www.cs.pitt.edu/~nlf4/cs1501/ String Pattern Matching General idea Have a pattern string p of length m Have a text string t of length n Can we find an index i of string t such that each of

More information

AGREP A FAST APPROXIMATE PATTERN-MATCHING TOOL. (Preliminary version) Sun Wu and Udi Manber 1

AGREP A FAST APPROXIMATE PATTERN-MATCHING TOOL. (Preliminary version) Sun Wu and Udi Manber 1 AGREP A FAST APPROXIMATE PATTERN-MATCHING TOOL (Preliminary version) Sun Wu and Udi Manber 1 Department of Computer Science University of Arizona Tucson, AZ 85721 (sw udi)@cs.arizona.edu ABSTRACT Searching

More information

Information Processing Letters Vol. 30, No. 2, pp , January Acad. Andrei Ershov, ed. Partial Evaluation of Pattern Matching in Strings

Information Processing Letters Vol. 30, No. 2, pp , January Acad. Andrei Ershov, ed. Partial Evaluation of Pattern Matching in Strings Information Processing Letters Vol. 30, No. 2, pp. 79-86, January 1989 Acad. Andrei Ershov, ed. Partial Evaluation of Pattern Matching in Strings Charles Consel Olivier Danvy LITP DIKU { Computer Science

More information

To cite this article: Raul Rojas (2014) Konrad Zuse's Proposal for a Cipher Machine, Cryptologia, 38:4, , DOI: /

To cite this article: Raul Rojas (2014) Konrad Zuse's Proposal for a Cipher Machine, Cryptologia, 38:4, , DOI: / This article was downloaded by: [FU Berlin] On: 26 February 2015, At: 03:28 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Bit-parallel (δ, γ)-matching and Suffix Automata

Bit-parallel (δ, γ)-matching and Suffix Automata Bit-parallel (δ, γ)-matching and Suffix Automata Maxime Crochemore a,b,1, Costas S. Iliopoulos b, Gonzalo Navarro c,2,3, Yoan J. Pinzon b,d,2, and Alejandro Salinger c a Institut Gaspard-Monge, Université

More information

Algorithms and Data Structures Lesson 3

Algorithms and Data Structures Lesson 3 Algorithms and Data Structures Lesson 3 Michael Schwarzkopf https://www.uni weimar.de/de/medien/professuren/medieninformatik/grafische datenverarbeitung Bauhaus University Weimar May 30, 2018 Overview...of

More information

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Parth Shah 1 and Rachana Oza 2 1 Chhotubhai Gopalbhai Patel Institute of Technology, Bardoli, India parthpunita@yahoo.in

More information

String Processing Workshop

String Processing Workshop String Processing Workshop String Processing Overview What is string processing? String processing refers to any algorithm that works with data stored in strings. We will cover two vital areas in string

More information