PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

Size: px

Start display at page:

Download "PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:"

Christopher Maxwell
6 years ago
Views:

This article was downloaded by: [Universiteit Twente] On: 21 May 2010 Access details: Access Details: [subscription number 907217948] Publisher Taylor & Francis Informa Ltd Registered in England and

1 This article was downloaded by: [Universiteit Twente] On: 21 May 2010 Access details: Access Details: [subscription number ] Publisher Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: Registered office: Mortimer House, Mortimer Street, London W1T 3JH, UK International Journal of Computer Mathematics Publication details, including instructions for authors and subscription information: On-line string matching algorithms: survey and experimental results P. D. Michailidis a ;K. G. Margaritis a a Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece To cite this Article Michailidis, P. D. andmargaritis, K. G.(2001) 'On-line string matching algorithms: survey and experimental results', International Journal of Computer Mathematics, 76: 4, To link to this Article: DOI: / URL: PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

2 htern. J. Computer Math., Vol. 76, pp Reprints available directly from the publisher Photocopying permitted by license only OPA (Overseas Publishers Association) N.V. Published by license under the Gordon and Breach Science Publishers imprint. Pnnted in Singapore. ON-LINE STRING MATCHING ALGORITHMS: SURVEY AND EXPERIMENTAL RESULTS P. D. MICHAILIDIS and K. G. MARGARITIS* Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia, 156 Egnatia Str., P.O. Box 1591, 54006, Thessaloniki, Greece (Received 9 March 2000) In this paper we present a short survey and experimental results for well known sequential string matching algorithms. We consider algorithms based on different approaches including classical, suffx automata, bit-parallelism and hashing. We put special emphasis on algorithms recently prewnted such as Shift-Or and BNDM algorithms. We compare these algorithms in terms of the number of character comparisons and the running time for four different types of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Keywords: String matching; Pattern matching; String searching; Text searching; Text editing C. R. Categories: F.2.2, INTRODUCTION Pattern matching is a basic problem in computer science and it occurs naturally as part of data processing, information retrieval, speech recognition, vision for two dimensional image recognition and computational biology. The type of pattern matchmg discussed in this paper is exact string matching. String matching is a special case of pattern matching, where the pattern is described by a finite sequence of symbols (or alphabet C). It consists of finding one or more generally all the occurrences of a short pattern *Corresponding authors. {panosm, kmarg}@uom.gr 41 1

412 P. D. MICHAILIDIS AND K. G. MARGARITIS P=P[O]P[l]...P[m-11 of lengthmin a large text T=T[O]T[l]..-T[n-11 of length n, where m, n > 0 and m 5 n. Both P and Tare built over the same alphabet C.

3 412 P. D. MICHAILIDIS AND K. G. MARGARITIS P=P[O]P[l]...P[m-11 of lengthmin a large text T=T[O]T[l]..-T[n-11 of length n, where m, n > 0 and m 5 n. Both P and Tare built over the same alphabet C. The solution to this problem differ if the algorithm has to be on-line (that is, the text is not known in advance) or off-line (the text can be preprocessed). In this paper, we focus on on-line algorithms for this problem. Numerous solutions to string matching problem have been designed [2,10,29 and 241. In general, an on-line string matching algorithm consists of two phases: the preprocessing phase in P and the search phase of P in T. During the preprocessing phase a data structure Xis constructed, X is usually proportional to the length of the pattern and its details vary in different algorithms. The search phase uses the data structure X and it tries to quickly determine if the pattern occurs in the text. This phase is based on four different approaches including classical, suffix automata, bit-parallelism and hashing algorithms. More specifically, for the string matching problem, the algorithms can be divided in four categories: Classical algorithms Brute-Force [24] algorithm, Knuth-Morris- Pratt [18] algorithm, Simon [14] algorithm, Colussi [8] algorithm, Boyer - Moore [3] algorithm, the variations of the Boyer - Moore algorithm like Galil [12] algorithm, Apostolico - Giancarlo [I] algorithm, Turbo-BM [7] algorithm, Reverse Colussi [9] algorithm, Boyer - Moore- Horspool [16] algorithm, Sunday's algorithms (Quick Search, Optimal Mismatch, Maximal Shift) [30], Boyer - Moore - Horspool - Raita [26] algorithm and Boyer - Moore - Smith [28] algorithm. Su@x automata algorithms Reverse Factor [21 and 71 algorithm and Turbo Reverse Factor [7] algorithm. Bit-parallelism algorithms Shift-Or [6] algorithm, Shift-And [31] algorithm and BNDM [25] algorithm. Hashing algorithms Harrison [I51 algorithm and Karp- Rabin [24] algorithm. Several experiments on string matching algorithms have already been reported [16,27,11,4,30,17,28,6,26,22,23 and 251. In this paper we report experiments on eleven well known algorithms from each category: the Brute-Force algorithm, the Knuth-Morris-Pratt algorithm, the Boyer- Moore algorithm, the Turbo-BM algorithm, the Boyer-Moore-Horspool algorithm, the Quick-Search algorithm, the Boyer - Moore - Smith algorithm, the Reverse Factor algorithm, the Shift-Or algorithm, the BNDM algorithm and the Karp- Rabin algorithm.

4 STRING MATCHING ALGORITHMS 413 This paper is organized as follows: in the next section we present the algorithms tested. In the third section we describe the experimental methodology including the test environment, types of test data and ways measures for the comparison of the algorithms. In section four we present the results of our experiments in the form of performance tables and graphs. In the last section, we discuss the conclusions of this paper, and outline some goals for further research. 2. STRING MATCHING ALGORITHMS In this section we present the basic sequential algorithms tested for solving of the string matching problem. However, for the further details and the coding of the algorithms, the reader is referred to [24] and the original references Classical Approach The classical string matching algorithms are based on character comparisons. The Brute-Force (in short, BF algorithm) [24] algorithm, which is the simplest, performs character comparisons between a character in the text and a character in the pattern from left to right. In any case, after a mismatch or a complete match of the entire pattern it shifts exactly one position to the right. It requires no preprocessing phase and no extra space. The BF algorithm has O(mn) worse-case time complexity. The average number of character comparisons is n(l + l/((ci - 1)). The Knuth-Morris-Pratt (in short, KMP) [18] algorithm, which was the first linear time string matching algorithm discovered, performs character comparisons from left to right. In case of mismatch it uses the knowledge of the previous characters that we have already examined in order to compute the next position of the pattern to use. In addition, this algorithm provides the advantage that the pointer in the text is never decremented. The preprocessing phase of the KMP algorithm requires O(m) time and space. The searching phase needs O(n) time in the worse and average case. The next algorithm is Boyer- Moore (in short, BM) [3] algorithm, which is known to be very fast in practice, performs character comparisons between a character in the text and a character in the pattern from right to

5 414 P. D. MICHAILIDIS AND K. G. MARGANTIS left. After a mismatch or a complete match of the entire pattern it uses two shift heuristics to shift the pattern to the right. These two heuristics are called the occurrence heuristic and the match heuristic. For the length of the shlft is the maximum shift between the occurrence heuristic and the match heuristic. The details for two heuristics are referred to original paper [3]. These heuristics are preprocessed in O(m+JCI) time and space. The searching phase of the BM algorithm needs O(n+ rm) time in the worse case, where r is the number of occurrences of the pattern in the text. Finally, the expected performance of the BM algorithm is sublinear requiring about nlm character comparisons on average. The Turbo-BM (in short, TBM) [7] algorithm is an variant of the BM algorithm. It consists in remembering the substring of the text that matched a suffix of the pattern during the last character comparisons (and only if a good suffix shift has been performed). This method has two advantages: a) it is possible to jump over this substring and b) it can enable to perform a turbo shift. The details for the turbo shift is referred to original paper [7]. It can be shown that the number of character comparisons performed by the TBM algorithm is bounded by 2n. The Boyer - Moore- Horspool (in short, BMH) [16] algorithm does not use the match heuristic. In case of mismatch or match of the pattern, the length of the shift is maximized by using only the occurrence heuristic for the text character corresponding to the rightmost pattern character (and not for the text character where the mismatch occurred). The preprocessing phase of the BMH algorithm requires O(m+ 1x1) time and reduces the space requirements from O(m+lCI) to O(IC1). Finally, the searching phase requires O(mn) time in the worse case but it can be proved that the average number of character comparisons is n/lci. The Quick Search (in short, QS) [30] algorithm of Sunday, performs character comparisons from left to right from the leftmost pattern character and in case of mismatch it computes the shift with the occurrence heuristic for the first text character after the last pattern character by the time of mismatch. The preprocessing and searching time of the QS algorithm are same as the BMH algorithm. The Boyer-Moore-Smith (in short, BMS) [28] algorithm, noticed that computing the shift with the text character just next the rightmost text character gives sometimes shorter shift than using the rightmost text character. He advised then to take the maximum between the two values. The preprocessing phase of the BMS algorithm consists of O(m+ (El) time and O(IC1) space. Further, this algorithm has O(mn) worse case time complexity.

6 2.2. Suffix Automata Approach STRING MATCHING ALGORITHMS 415 This category uses the suffix automaton data structure (frequently called DAWG- for Deterministic Acyclic Word Graph) that recognizes all the suffixes of the pattern [lo and 251. The Reverse Factor (in short, RF) [21 and 71 algorithm, which performs the characters of the text from right to left using the smallest suffix automaton of the reverse pattern. The preprocessing phase of the RF algorithm requires linear time and space in the length of the pattern. The searching phase of RF algorithm has a quadratic worse-case time complexity but it is optimal on the average. It performs O(nlogm/m) characters comparisons on the average Bit Parallelism Approach Bit parallelism [6 and 51 uses the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel (whose number of bits in the computer word we denote w). This technique has became a general way to simulate simple nondeterministic finite automata (NFA) instead of converting them to deterministic. The main advantages of this approach are simplicity, flexibility and no buffering. The basic idea of the first Shift-Or (in short, SO) [6] algorithm, is to represent the state of the search as a number, and each search step costs a small number of arithmetic and logical operations, provided that the numbers are large enough to represent all possible states of the search. Assuming that the pattern length is no longer than the computer word of the machine, the time complexity of the preprocessing phase is O((m + 1x1) [mlw]) using O(mlC1) extra space. Finally, the time complexity of the searching phase is O(n rm/wl) in the worse and average case, where rrnlwl is the time to compute a shift or other simple operation on numbers of m bits using a word size of w bits. An new algorithm has appeared recently, called Backward Nondeterministic DAWG Matching (BNDM) [25]. This algorithm uses a nondeterministic suffix automaton that is simulated using bit-parallelism. The preprocessing time for the BNDM algorithm is O(m+ lc() for m 5 w using O(rn(C() extra space. The searching time is O(mn) in the worse case and O(nlogrn/m) on average Hashing Approach We introduce a different approach to string matching, the Karp-Rabin (in short, KR) [24] algorithm, which uses hashing techniques. Hashing provides

7 416 P. D. MICHAILIDIS AND K. G. MARGARITIS a simple method to avoid a quadratic number of character comparisons in most practical situations. The main idea of the KR algorithm is to compute the signature or hashing function of each possible m-character substring in the text and check if it is equal to the signature function of the pattern. The preprocessing phase of the KR algorithm requires O(m) time while the searching phase has O(mn) worse case time complexity. Its expected number of character comparisons is O(m+n). 3. EXPERIMENTAL METHODOLOGY In this section we present the testing methodology which used in our experiments in order to compare the relative performance of string matching algorithms. The parameters which is described the performance of the algorithms are: a) The text size, b) The pattern length and c) The alphabet size. It is known that none of the algorithms are optimal or best in all three cases. Therefore, the main goal in our experimental study is to compare the practical performance of the algorithms against the length of the pattern (small and long patterns) under various alphabets of different sizes (or types of text) i.e., binary alphabet, alphabet of size 8, English alphabet and DNA alphabet, which have different characteristics Test Environment The experiments were run on a Sun UltraSparc-1 of 143Mhz clock, with 64 Mb RAM which is a 32 bit machine and a 2.1 Gb local hard disk. The operating system is Solaris 2.5. During all experiments, this machine was not performing other heavy tasks (or processes). The data structures used in the testing were all in physical memory during the experiments. Finally, the algorithms presented in the Section 2 have been implemented in ANSI C programming language [19] in a homogeneous way so as to keep their comparison significant, using the compiler cc. We greatly used the code presented in [4,13 and 241 for known algorithms Types of Test Data We note that because the performances of the string matching algorithms depended upon statistical properties of the pattern and the text string from

8 STRING MATCHING ALGORITHMS 417 which the test patterns were obtained, experiments were performed on four different types of texts: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet Binary Alphabet The alphabet is C = (0, 1). The text is consisted of 150,000 characters and was randomly built. For patterns of lengths between 2 and 100 we search 50 of them random built Alphabet of Size 8 The alphabet is C = {a, b, c, d, e,f, g, h). The text is consisted of 150,000 characters and was random built. In addition, for patterns of lengths between 2 and 100 we search 50 of them random built English Alphabet We used a document of English language from an web page. The alphabet is consisted of 70 different characters. The text is consisted of 148,188 characters and we search 50 patterns of each length from 2 to 100 characters were chosen at random from words inside the text DNA Alphabet The DNA alphabet consists of the four nucleotides a, c, g and t (standing for adenine, cytosine, guanine, and thymine, respectively) used to encode DNA. Therefore, the alphabet is 6 = {a, c, g, t). The text is consisted of 997,642 characters and we search 50 patterns of each length from 10 to 100 characters. Finally, the text and the patterns is portion of the GenBank DNA database, as distributed by Hume and Sunday [17] Measures of Comparison For the comparison of the string matching algorithms we used the number of character comparisons and the practical running time as measures. The counting of the number of character comparisons is the same as that used by Smith [28], that is, computing the number of actually compared characters to the number of passed characters in the text. Since all algorithms are designed to find all occurrences of a pattern in the text in our experiments, the number of passed characters is always n - m + 1. The running time is the

9 418 P. D. MICHAILIDIS AND K. G. MARGARITIS total time of calling an algorithm to search a pattern in the text including the preprocess time of building the auxiliary arrays. The running time is obtained by calling the C function clock () and it is measured in seconds. Thus, we measured the number of character comparisons and the running time all the algorithms in Section 2 in order to examine the effect of the pattern length. We performed the following test series: We measured the effect of the pattern length in a test series with varying m = 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 60, 80 and 100. In case of the DNA alphabet we used longer patterns because this alphabet has biological applications on long patterns. For this reason, in this alphabet we measured the effect of the pattern length in a test series with varying m = 10, 20, 30, 40, 50 and 100. Finally, to decrease random variation, the results of the algorithms are averages of 50 runs with different patterns of each length. We note that the bit-parallelism algorithms (such as SO and BNDM) use only the running time measure because they involve only implicit character comparisons. In addition, we know that they are limited to run for pattern length smaller than the word size in bits. For this reason, in our experimental study the SO and BNDM algorithms are limited to m EXPERIMENTAL RESULTS In the previous sections we have briefly presented the most well known string matching algorithms and the experimental methodology of our test. In this section, we present the experimental results for the string matching algorithms according to the number of character comparisons and the running time. Finally, the performance of each algorithm was plotted against the length of the pattern for each type of text Results for the Number of Character Comparisons Figures 1 to 4 and Tables I to IV show the results for the number of character comparisons for a binary alphabet, an alphabet of size 8, an English alphabet and a DNA alphabet respectively, against the pattern length. It can be seen that the KMP and KR algorithms produce in all cases exactly 1 character comparison. Further, the BF algorithm produces approximately the same number of character comparisons with the KMP and KR algorithms for the alphabet of size 8 and for the English alphabet. The BF requires more character comparisons for small size alphabet (i.e., the binary or the genome alphabet). Based on the empirical results, it is clear

10 STRING MATCHING ALGORITHMS 419..\-, :! Pattern length FIGURE 1 Binary alphabet. Pattern length FIGURE 2 Alphabet of size 8. Pattern length FIGURE 3 English alphabet. 1...:-:: q -.-. Kii, that for patterns of length greater than 10, the number of character comparisons is approximately 2, twice the number required by the KMP and KR algorithms for the binary alphabet. For the DNA alphabet case the BF requires on average 1,34 character comparisons. This occurs because

11 420 P. D. MICHAILIDIS AND K. G. MARGARITIS " Pattern length FIGURE 4 DNA alphabet. when the small size alphabet is used it leads to many exact pattern matches in the text and as a result the number of character comparisons tends to be greater than 1. However, when a larger alphabet is used this phenomenon is alleviated according to Figures 2 and 3. The number of character comparisons of the BM-like algorithms (such as BM, BMH, QS, BMS and TBM) and the suffix automata algorithm (such as RF) is generally less than 1 with the exception of the binary alphabet where the BMH and QS algorithms have on average 1,25 and 1,l character comparisons. Furthermore, it must be noted that the number of character comparisons of the BM-like and the RF algorithms is significantly higher when the binary alphabet is used than with any other type of text. It should also be observed that for all those algorithms the number of character comparisons decreases significantly as the pattern length increases. Thus the empirical results support theoretical evidence that the BM-like and the RF algorithms are sublinear in the number of character comparisons. The number of character comparisons decreases more slowly as the pattern length increases because for long patterns the probability is higher that the character just fetched occurs somewhere in the pattern, and therefore the distance the pattern can be moved forward (if a mismatch occurs) is shortened. Moreover, it is noticed that the character comparisons of all BM-like algorithms are very close to one another results and tend to stabilize to a certain performance measure except for the binary alphabet. Finally, for long patterns the difference between the number of character comparisons performed by the BM-like algorithms and the number of character comparisons performed by the suffix automata algorithm like RF increases in all cases. In all cases, it can be seen that the BM-like algorithms and suffix automata algorithm (like RF) have better results. More specifically, the

12 TABLE 1 Number of character comparisons for a binary alphabet m BF KMP BM BMH es BMS TBM RF KR - 2 1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,16883 I 60 2, , , , , , , , , , , , , , , , , , , , , , , , Average 1, , , , , , , , ,099198

13 TABLE 11 Number of character comparisons for an alphabet of size 8 m BF KMP BM BMH 0s BMS TBM RF KR 2 1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , I 100 1, , , , , , , , Average 1, , ,2378 0, , , , , ,002558

14 TABLE 111 Number of character comparisons for an English alphabet rn BF - KMP BM BMH Qs BMS TBM RF KR 2 0, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,0248 0,06 0, , , , , , , ,061 0,0533 0, , , ,0249 0, , , , , , , , ,0423 0, , , ,0423 0, Average 1, , , , , , , , ,001567

TABLE 1V Number of character comvarisons for a DNA alvhabet m BF KMP BM BMH Qs BMS TBM RE KR 10 1,346732 1,116931 0,301635 0,373474 0,356066 0,242178 0,299615 0,26149 1,000031 20 1,348057 1,109472

15 TABLE 1V Number of character comvarisons for a DNA alvhabet m BF KMP BM BMH Qs BMS TBM RE KR 10 1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,3453 1, , , , , , , Average 1, ,1066 0, , , , , , ,000017

16 STRING MATCHING ALGORITHMS 425 BM-like algorithms (such as TBM and BMS) and the RF algorithm is much more efficient in terms of number of character comparisons than the remaining algorithms for small and long patterns respectively Results for the Running Time Figures 5 to 8 and Tables V to VIII show the results for the practical running time for a binary alphabet, an alphabet of size 8, an English alphabet and a DNA alphabet respectively, against the pattern length. We observe that in all cases the KR algorithm requires much more time than any other algorithm. This observation agrees with the expected behaviour that the computation of the hash values is computationally expensive in terms of machine cycles and so increases the running time of the algorithm. Therefore, this algorithm isn't recommend for text applications. FIGURE 5 Pattsrn length Binary alphabet W c VNL W Pattern length FIGURE 6 Alphabet of size 8.

17 426 P. D. MICHAILIDIS AND K. G. MARGARITIS ktt.rn bngth FIGURE 7 English alphabet. FIGURE 8 DNA alphabet. Further, based on empirical results, it is clear that in all cases the KMP algorithm is relatively little slower than the BF algorithm for almost all pattern lengths with the exception of the binary alphabet. This behaviour support theoretical evidence that the KMP algorithm isn't better than the BF algorithm on the average case. Further, it can also be seen that in all cases the BF and KMP algorithms are significantly slower than the BM-like and bit-parallelism algorithms. The running time of the BM-like and bit parallelism (like BNDM) algorithms decreases significantly as the pattern length increases. Moreover, it should be noticed that the BM-like algorithms produce similar running times i.e., very close to each other in all cases with the exception of the binary alphabet. In addition, for long patterns the difference between the running times of BM-like algorithms and of suffix automata algorithms like RF increases in all cases with the exception of the English alphabet. This difference is in favour of RF algorithm.

18 TABLE V Running times for a binary alphabet m BF KMP BM BMH 0s BMS TBM RF SO BNDM KR Average

19 t Gz2wHwrn~m-2F3z2~ OP-Wm --o-somaqq~~n~~g-2-ggg 3 OOOOOOOOOOOOOOOC

20 TABLE VII Running times for an English alphabet m BF KMP BM BMH Qs BMS TBM RF SO BNDM KR Average

21 TABLE VIII Running times for a DNA alphabet m BF KMP BM BMH Qs BMS TBM RF SO BNDM KR 10 0,4958 0,6146 0,1704 0,1596 0,157 0,1638 0,317 0,1248 0,3018 0,1334 1, ,4958 0,6108 0,1546 0,1684 0,161 0,1702 0,2902 0,0752 0,3024 0,0746 1, ,4916 0,605 0,144 0,1534 0,1472 0,153 0,261 0,0554 0,3024 0,0528 1, ,494 0,6084 0,1204 0,1498 0,1474 0,1502 0,2256 0, , ,4932 0,61 0,1286 0,1568 0,1556 0,157 0,2412 0, , ,4931 0,62 0,1252 0,156 0,161 0,1581 0,2221 0, ,5578 Average 0, , , , , , , ,0647 0,3022 0, ,561767

22 STRING MATCHING ALGORITHMS 43 1 The SO bit-parallelism algorithm outperforms KR, KMP and BF algorithms for all pattern lengths. So is faster than the TBM and BNDM algorithms only for small patterns. The latter observation is valid in all cases with the exception of the binary alphabet. However, it can be seen that the SO algorithm outperforms than the BM-like and suffix automata algorithms for small patterns especially for the binary alphabet. Finally, it can be seen that in the majority of cases the suffix automata algorithm such as RF has a faster running time than the BM-like and the bitparallelism algorithms for long patterns. Further, the BM-like algorithms have better running times for small patterns except for the binary alphabet. 5. CONCLUSIONS We have presented experimental results of an extensive set experiments of the most well known string matching algorithms based on classical, suffix automata, bit-parallelism and hashing approach. Therefore, the conclusions of this paper fall into two main categories: general conclusions regarding the algorithms and their testing procedures, and conclusions relating to the performance of specific algorithms. As a general conclusion we can say that testing the algorithms on four different types of text (binary alphabet, alphabet of size 8, English alphabet and DNA alphabet) indicates that varying parameters such as the pattern length and the alphabet size can produce different performances. The specific performance conclusions are: It should be noticed that the absolute shapes of the lines on the performance graph are not conclusive. Information can only be derived from the relative positions of the curves for each algorithm at each pattern length. This is because the patterns were chosen at random and obviously the running time is related to how far into the text the pattern occurs. The running times for all the eleven algorithms can be compared at each pattern length because the same type of text and set of patterns were used with each algorithm. From the empirical evidence it can be concluded that the KR algorithm is linear in the number character comparisons but it has higher running time and it shouldn't be used for pattern matching in strings. However, the main advantage of this algorithm lies in its extension to higher dimensional string matching. It may be used for pattern recognition and image processing and thus in the expanding field of computer graphics. If you plan on direct searching with simple text, the linear BF algorithm is a proper choice because it produces relatively good running time results

23 432 P. D. MICHAILIDIS AND K. G. MARGARITIS despite its striking simplicity. In addition, the BF algorithm has no special memory requirements and needs no preprocessing or complex coding and thus can be surprisingly fast. But this algorithm shouldn't used for the binary alphabet in applications such as image processing or software systems. Despite its theoretical elegance, the KMP algorithm provides no significant speedup advantage over the BF algorithm in practice unless the pattern has highly repetitive subpatterns. However the KMP algorithm guarantees a linear bound and it is well suited to extensions for more difficult problems. It may be a good choice when the alphabet size is near the text size or when dealing with the binary alphabet. As far as the variations of the BM approach we can make the following remarks: Based on empirical results, it is clear that the QS algorithm is proved to be much faster algorithm in practice than the rest BM-like, suffix automata and bit-parallelism algorithms for large alphabets and short patterns. Therefore it is typically suited for search in the English alphabet. In addition, the BM algorithm is faster than its variations (such as BMH, QS, BMS and TBM) for small alphabets and long patterns. However, in theory BMS and QS are better algorithms than BM-like and suffix automata algorithms for short patterns and large alphabets. The TBM and BMS algorithms are also good both for small alphabets and short or medium patterns. We must also note that the main disadvantage of BM-like algorithms is the preprocessing time and the space required, which depends on the alphabet size and/or the pattern size. For this reason, if the pattern is small (1 to 4 characters) it is better to use the BF algorithm. Furthermore, the BM-like algorithms can't to be used if the type of string matching problem is different than finding the first occurrence of a pattern. For example, if the problem is to find the first of several possible patterns or to recognize a position in the text defined by a regular expression. This is also because the preprocessing time would be significant. It should be noted that for long patterns the running time of the suffix automata algorithm (RF) increases because of the preprocessing phase, the time for which is equal to the time for the searching phase. Thus, the RF algorithm is efficient in theory and practice for small alphabets and long patterns. Therefore, this algorithm is a good choice to be used for DNA applications. In practice, the bit-parallelism algorithms (SO and BNDM) are always fastest for small alphabets and short patterns. Also, the SO algorithm produces linear running time similar to the BF and KMP algorithms. In particular, the BNDM algorithm is the fastest and outperforms BM-like

24 STRING MATCHING ALGORITHMS 433 algorithms for moderate patterns. However, the main advantage of the algorithms, is that it is simple to implement and support class of characters (i.e. [a-z]), don't care symbols (a don't care symbol matches any symbol), complement of a character or a class, and other extensions developed by [31] such as wild cards (a wild card is a symbol that matches all characters), set of patterns, long patterns, etc., using exactly the same searching time (only the preprocessing is different). On the other hand, these algorithms have the disadvantage that the patterns is limited to 32 or 64 characters (32 or 64 being the word size of many of today's machines). Handling long patterns is fairly easy to do (you need to use multiprecision bit operations), but it can slow down the algorithms significantly. For many applications, however, a maximum pattern length of 32 or 64 characters is not much of a problem. In addition, we notice that the theoretical time complexities of algorithms [24] are valid only in the average case. For instance, the experiments have shown that on average, the algorithms such as BF, BMH, QS, BMS and BNDM have good behavior. On the other hand, the experiments have shown that in the worst and average cases, only the BM, RF and SO algorithms are fast both theoretically and practically. References [I] Apostolico, A. and Giancarlo, R. (1986). The Boyer-Moore-Galil string searching strategies revisited, SZAM Journal on Computing, 15(1), [2] Aho, A. V., Algorithms forjindingpatterns in strings, Chapter 5 (pp ) of Leeuwen J. van (Ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam. [3] Boyer, R. S. and Moore, J. S. (1977). A fast string searching algorithm, Communications of the ACM, 20(10), [4] Baeza-Yates, R. (1989). Algorithms for string searching: A survey, ACM SZGZR Forum, 23(3-4), [5] Baeza-Yates, R. (1992). Text Retrieval: Theory and Practice, In: Proc. of the 12th IFZP World Computer Congress, pp (Madrid, Spain), North-Holland. [6] Baeza-Yates, R. and Gonnet, G. H. (1992). A new approach to text searching, Communications of the ACM, 35(10), [7] Crochemore, M., Czumaj, A,, Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W. and Rytter, W. (1994). Speeding Up Two String Matching Algorithms, Algorithmica, 12(4-5), [8] Colussi, L. (1991). Correctness and efficiency of the pattern matching algorithms, Information and Computation, 95(2), [9] Colussi, L. (1994). Fastest pattern matching in strings, Journal of Algorithms, 16(2), [lo] Crochemore, M. and Rytter, W. (1994). Text Algorithms, Oxford University Press. [I 11 Davies, G. and Bowsher, S. (1986). Algorithms for pattern matching, Software-Practice and Experience, 16(6), [12] Galil, Z. (1979). On improving the worst case running time of the Boyer-Moore string searching algorithm, Communications of the ACM, 22(9), [I31 Gonnet, G. H. and Baeza-Yates, R. (1991). Handbook of Algorithms and Data Structures in Pascal and C, 2nd edition, Addison-Wesley, Workingham, pp

434 P. D. MICHAILIDIS AND K. G. MARGARITIS Hancart, C. (1993). On Simon's string searching algorithm, Information Processing Letters, 47(2), 95-99. Harrison, M. C. (1971).

25 434 P. D. MICHAILIDIS AND K. G. MARGARITIS Hancart, C. (1993). On Simon's string searching algorithm, Information Processing Letters, 47(2), Harrison, M. C. (1971). Implementation of the substring test by hashing, Communications of the ACM, 14(12), Horspool, R. N. (1980). Practical fast searching in strings, Software-Practice and Experience, 10(6), Hume, A. and Sunday, D. (1991). Fast string searching, Software-Practice and Experience, 21(1 I), Knuth, D. E., Morris, J. H, and Pratt, V. R. (1977). Fast pattern matching in strings, SIAM Journal on Computing, 6(2), Kernighan, B. W. and Ritchie, D. M. (1988). The C Programming Language, Prentice Hall, Englewood Cliffs, NJ, 2nd edition. Liu, Z., Du, X. and Ishii, N. (1998). An improved adaptive string searching algorithm, Software-Practice and Experience, 28(2), Lecroq, T. (1992). A variation on the Boyer-Moore algorithm, Theoretical Computer Science, 92(1), Lecroq, T. (1995). Experimental results on string matching algorithms, Software-Practice and Experience, 25(7), Manolopoulos, Y. and Faloutsos, C. (1996). Experimenting with pattern matching algorithms, Information Sciences, 90(1-4), Michailidis, P. and Margaritis, K. (1999). String Matching Algorithms, Technical Report, Department of Ap. Informatics, University of Macedonia (in Greek). Navarro, G. and Raffinot, M. (1998). A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching, In: Proc. of the 9th Annual Symposium on Combinatorial Pattern Matching, No. 1448, pp , Springer-Verlag, Berlin. Raita, T. (1992). Tunning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 22(10), Smit, G. and De, V. (1982). A Comparison of Three String Matching Algorithms, Software-Practice and Experience, 12(1), Smith, P. (1991). Experiments with a very fast substring search algorithm, Sofiware- Practice and Experience, 21(10), Stephen, A. G. (1994). String Searching Algorithms, World Scientific Press. Sunday, D. (1990). A very fast substring search algorithm, Communications ofthe ACM, 33(8), Wu, S. and Manber, U. (1992). Fast text searching allowing errors, Communications of the ACM, 35(10),

Experimental Results on String Matching Algorithms

SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(7), 727 765 (JULY 1995) Experimental Results on String Matching Algorithms thierry lecroq Laboratoire d Informatique de Rouen, Université de Rouen, Facultés des