A Depth First Search approach to finding the Longest Common Subsequence

Size: px

Start display at page:

Download "A Depth First Search approach to finding the Longest Common Subsequence"

Noah Poole
6 years ago
Views:

1 A Depth First Search approach to finding the Longest Common Subsequence Fragkiadaki Eleni, Samaras Nikolaos Department of Applied Informatics, University of Macedonia, Greece Harhalakis Stefanos, Department of Informatics, TEI of Thessaloniki, Greece Abstract. While examining the problem of the Longest Common Subsequence (LCS) of two strings we faced the problem of the vast amount of memory required from the existing algorithms. When using large strings the requirements in memory usually exceed the RAM memory of the computer while swapping all of the available hard disk. These algorithms are based on the classic LCS algorithm. A sort description of this algorithm is included in the paper as well as a new algorithm that reduces the amount of memory required when examining two strings, the second of which is of small length. There are small limitations for the length of the first string. The algorithm presented in this paper, introduces a new way of storing the data of the problem and for handling the information required to solve it. The experimental results prove that in the cases described above this algorithm has smaller execution time and significantly smaller requirements in memory than the classic LCS algorithm. Keywords. Algorithm, Longest Common Subsequence, Data structures, Data storage. 1. INTRODUCTION Finding the Longest Common Subsequence (LCS) of two strings is a well know problem. The algorithm of Hunt-Szymanski was one of the first algorithms presented to solve the problem [2], [7]. Generally the algorithms that solve the problem find practice in many different sections. First of all they are used in bioinformatics in finding the LCS of two strings of DNA, searching similarities among different people and living organizations. These algorithms are also used when trying to compare two fragments of text or code [1]. Many algorithms have been introduced that solve the LCS problem, in most of which dynamic programming has been used. These algorithms present one major problem, that of the size of memory required for storing the information that will finally lead them in finding the Longest Common Subsequence. In order to reduce the amount of memory required, we used a different approach for solving the LCS problem and for handling the data of the algorithm. This paper is organized as follows. It includes a description of the Classic LCS algorithm and a more detailed description of its major disadvantage. The description of the DFS LCS algorithm is presented. We describe the steps which the algorithm follows in order to compute the Longest Common Subsequence (LCS), the structures that it uses and an illustrative example. The computer, the programming language and compiler, the operating system and the experimental results are presented in detail using tables and charts with the results. The last section of this paper refers to our final conclusions about the usage of this algorithm. It also 1

2 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. refers to our conclusions concerning the comparison of the DFS - LCS algorithm with the classic LCS algorithm. 2. ALGORITHM DESCRIPTION There are many algorithms that deal with the problem of finding the Longest Common Subsequence of two strings. These algorithms consist of two main parts. The first part is the part of creating the initial structure that will be used throughout the execution of the algorithm. By the end of the construction of this structure the length of the Longest Common Subsequence will be know, but the subsequence itself will not. This first part is usually common among the LCS algorithms. These algorithms differ in the second part. The second part is the one that actually returns the longest common subsequence of the two strings. By having the second part different these algorithms vary in time and space complexity [3], [6]. Apart from the algorithms that compute the LCS, there are algorithms that compute the number or even all of the LCSs [4], [5]. 2.1 DESCRIPTION OF THE CLASSIC LCS ALGORITHM The classic algorithm that is being used for finding the longest common subsequence follows the described methodology. Given two strings X, Y and their lengths n, m the algorithm returns the length as well as the Longest Common Subsequence of those two strings. When using the term Longest Common Subsequence we define the characters that appear in both strings, in an advancing order but not necessarily in adjacent positions. Main purpose of the algorithm is to construct a n x m table (table L) where n is the number of lines and m is the number of columns. During the execution of the algorithm, we examine whether or not the current character of X equals the current character of Y. If this condition is satisfied, we set the L[i,j] element of the table to be equal with the L[i-1,j-1] element. If the condition is not satisfied we set the L[i,j] element to be equal with the maximum value of the L[i-1,j] and L[i,j-1] elements of the table. Once we have concluded with the construction of the table following the procedure described above, the number that indicates the length of the longest common subsequence of the two strings is located in the bottom right corner of the table (element L[n,m]). This table returns not only the LCS of the two strings but also the LCS of all prefixes of the strings (Prefix of a string is a substring of the initial string which starts with the first character and has length equal or less than the length of the initial string). In other words, the element L[i,j] is a number that indicates the length of the Longest Common Subsequence when using the i first elements of string X and the j first elements of string Y. The major disadvantage of this algorithm is the vast amount of memory that it requires. As mentioned before, the algorithm constructs a table of nxm proportions. For example if string X consists of characters and string Y consists of 500 characters then the algorithm will construct a x500 table, requiring approximately 5 Gb of memory. Naturally this amount of memory is not common for personal computers, and the algorithm results in swapping the hard drive of the computer. The speed of the hard disk can not be compared with the speed of the Ram memory which is much faster. Therefore transferring blocks from the hard disk to the memory and the other way around is a very slow procedure. 2.2 PSEUDOCODE OF THE CLASSIC LCS ALGORITHM Input: strings X, Y Output: the length L[i,j] of a LCS of X[0..i] and Y[0..j] for i 1 to n-1 do

3 L[i,-1] 0 for j 1 to m-1 do L[-1,j] 0 for I 0 to n-1 do for j 0 to m-1 do if X[i] = Y[j] then L[i,j] L[i-1,j-1] + 1 else L[i,j] max{l[i-1,j], L[i,j-1]} return array L 2.3 DESCRIPTION OF THE DFS LCS ALGORITHM The aim of the DFS LCS algorithm is to reduce the amount of memory required in solving the LCS problem. To do so, the algorithm uses a recursive function. This function examines all the different combinations of characters in Y, to determine which of these can be found in X, and which is the longest among them. The algorithm decides whether using the current character of Y produces a subsequence that exists in both X and Y and is the longest one. In order to improve the performance of the algorithm and to reduce the combinations that will be examined, the algorithm uses a procedure that ignores some of the characters in Y. For instance, if the current character of Y is the letter A (which exists in X and is available for use) and we are examining the case that the specific letter is not included in the LCS, if the next character that we will be examined is also A there is no reason to examine it because it presents no difference from the previous A that we ignored. This procedure of ignoring A will stop as soon as a different character gets included in the LCS. Using this procedure reduces significantly the number of combinations examined by the algorithm. Although we only check each combination once there are some common parts between the combinations that the algorithm examines. To overcome this repetition in the calculations we use a Cache memory. This memory stores information about each character we use from X, Y. For instance, if we use a character from Y in position j and this character is located in position i in X it is possible that this character will appear in many different combinations of Y. We store this character in the cache memory, along with its position in X, its position in Y, the length of the LCS found from this character till the end and the LCS itself. By doing so every time we want to examine a new character from Y, firstly we check the cache memory to see if there is an instance of this character cached and if there is, we replace it with this instance. If there is not an entry for this character in the Cache memory, we proceed with the calculations and as soon as we complete them we add this character in the Cache for future use. Doing so, each different combination (position of character in X, position of character in Y) is only calculated once. The algorithm consists of two parts, the initialization and the main body. The initialization consists of two steps. During the first step the algorithm defines the different characters from which the two strings have been constructed from. By the end of this step we have knowledge of the number of different characters in the alphabet and of the alphabet itself. During the next step of the initialization we construct the Index Table of X. The Index Table of X is a table whose every element is a list. The first list registers the positions in X where we can find the first character of the alphabet. In order to create this index we read each character of X and we add the position of this character in the proper list. 3

4 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. The main body is the algorithm itself which is a recursive function. Each call to the recursive function, function find_lcs(), has as inputs the two strings X, Y, their lengths n, m, the starting positions in X(x_pos) and Y(y_pos) and finally the ignoremask. The ignoremask is a 1xs table, where s is the length of the alphabet. Each element represents one character from the alphabet. If the value stored in the first position of the table is 1, then we should ignore the first character of the alphabet. If the value is 0 we should not ignore the character during the execution of the current call of the recursive function. The recursive function produces as output the length of the LCS (lcssz) and the LCS (lcs) found. We begin with the first letter of Y and make our way to the end. Firstly the algorithm examines whether we have reached the end of one of the two strings. If we have then the algorithm returns that the length of the LCS beginning from x_pos in X and y_pos in Y till the end is 0 and the function call is ended. If we haven t, we set the variable sz2 (length of the LCS from x_pos in X, y_pos in Y till the end when using the character in the y_pos position in Y) to -1 and the variable sz1 (length of the LCS from x_pos in X, y_pos in Y till the end when not using the character in the y_pos position in Y) to -1 as well. Then the algorithm checks whether or not we should ignore the current character of Y. At this point if the value in the corresponding position of the ignoremask table is 0 then we should not ignore the current character of Y and therefore continue with our calculations. The current character of Y is stored in the y_element variable. The next step is to determine if the current character of Y exists in X and is available for use. To do so the algorithm calls the indextable_find_first() function. Two parameters are required for this function, the first one is the current character of Y and the second one is the position in X (x_pos) from which we must start our search for this character. This function returns either -1, if we did not succeed in finding the character in X, or the position in which we found the character in X in the Index table of X. The result is stored in the pos_found variable. If the function returns -1 then the variable sz2 is set to 0, no character was found and the LCS from this character till the end is of 0 length. If the pos_found variables value is not -1 then we set the variable is_cached to 0 and the variable tocheck to 1. The variable is_cached is used to determine whether or not an instance of the specific character exists in the Cache memory or not. The variable tocheck has value 1 when we want to remember that the specific character should be examined whereas value 0 means that we have already inserted this character in the cache memory. The function indextable_get() returns the actual position of the current character of Y in X. It requires two parameters, the current character of Y (y_element) and the position in the Index Table of X (pos_found) in which we found it. The result returned from this function is stored in the variable k. At this point of the execution, the algorithm has determined that the current character in Y in position y_pos, stored in the variable y_element, exists in X (in position k). Up until now we do not know if the specific character exists in Cache memory. To determine that we use the function cache_find(), which requires two parameters, the position of the character in X and the position of the character in Y. This function returns the length of the LCS stored in Cache memory and the actual LCS. These results are stored in the variables sz2 and lcs2. If the value stored in the sz2 variable is larger or equal to 0, we succeeded in finding the character in the Cache memory. Therefore the variable tocheck is set to 0 and the variable is_cached is set to 1. If we did not find the character in the cache memory we need to continue with our calculations to determine the LCS from this character till the end. A new temporary ignore mask is been set here, the im. The algorithm calls the indextable_get_cur_begin() function which requires one parameter, the current element of Y. This function calculates the position in the corresponding list of the Index Table from which our calculations should continue and the value is stored in the pos variable. After that the algorithm calls the indextable_set_cur_begin() function which requires two parameters, the current character of Y (y_element) and the position in the corresponding list of the Index Table of X where we found

5 the current character of Y, incremented by one. Then the algorithm recursively calls the function find_lcs() to determine the LCS from that point forward. The starting position in X has been incremented by one (k+1) as well as the starting position in Y (y_pos+1). The ignore mask for the new call of the function is the one determined previously (table im). The results from this call are stored in the variables lcsx (the LCS) and the sz2 (length of the LCS). The LCS when including the character in the position y_pos of Y, is the character (found in position k of X) followed by the best LCS found from that position forward. After the completion of this step the current position from which we should begin our search in the Index Table of X is set to its previous value (by using the function indextable_set_cur_begin() with the parameter pos). The length of the LCS when using the current character of Y is incremented by one. At this point if the value of the variable is_cached is 0, the element that we examined is added to the Cache memory by using the function cache_update() which takes 4 parameters. The first one is the position of the character in X (k), the second one is the position of the character in Y (y_pos), the third one is the size of the LCS found from this character till the end (sz2) and the fourth one is the LCS (lcs2). Up to this point the algorithm has examined the case where the current character of Y will be used in the LCS. The character was found in X and was available for use and the character should not be ignored. We examined both cases where the specific character was found in Cache memory, and was not found in Cache memory and we determined the LCS. Now we will examine the case where the current character of Y should be ignored. In this case the size of the LCS is set to -1, to illustrate that the character was not used in the LCS. Next the algorithm calculates the LCS without the use of the current character of Y (actually ignores the current character of Y and the starting position in Y is incremented by 1). The variable oldignore has value either 1 (if the current character of Y has been ignored before) or 0 (if the current character of Y hasn t been ignored up to this point). If the current character of Y hasn t been ignored up to now we set the element in the corresponding position in the ignoremask table to 1. Then the algorithm calls the find_lcs() function to determine the LCS without the use of the current character of Y. The starting position in X is x_pos, the starting position in Y is y_pos+1. We use the ignoremask table, presented earlier. The results of this call are stored in the variables sz1 (for the length of the LCS returned) and lcs1 (for the LCS). In order to keep the data stored in the ignore mask table consequent with their previous state we set the element representing the ignore state of the current letter of Y to its previous state. Finally the algorithm examines which is the best result. To do so, the algorithm compares the LCSs length in each case (when using the current character of Y in the LCS and when we do not use it), and stores the best length in the sz variable and the LCS in the lcs variable. 2.4 PSEUDOCODE find_lcs() input: strings X, Y, n, m, x_pos, y_pos, ignoremask output: lcs, lcssz lcs [] lcssz -1 if (x_pos = n or y_pos = m) lcssz 0 return sz1-1 sz2-1 5

6 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. if (ignoremask(y(y_pos))=0) y_element Y(y_pos) pos_found indextable_find_first(y_element, x_pos) if (pos_fount = -1) sz2 0 else is_cached 0 tocheck 1 k indextable_get(y_element, pos_found) (sz2, lcs2) cache_find(k, y_pos) if (sz2 >=0) tocheck 0 is_cached 1 if (tocheck = 1) im [] pos indextable_get_cur_begin(y_element) indextable_set_cur_begin(y_element, pos_found+1) (lcsx, sz2) find_lcs(k+1, y_pos+1, im) lcs2 [k lcsx] indextable_set_cur_begin(y_element, pos) sz2 sz2 + 1 if (is_cached = 0) cache_update(k, y_pos, sz2, lcs2) else sz2-1 k Y(y_pos) oldignore ignoremask(k) if (oldignore=0) ignoremask(k) 1 (lcs1, sz1) find_lcs(x_pos, y_pos+1, ignoremask) if (oldignore=0) ignoremask(k) 0 sz 0 if (sz1>0 or sz2>0) if (sz1>=sz2) lcs lcs1 sz sz1 else lcs lcs2 sz sz2 lcssz sz

7 3. AN ILLUSTRATIVE EXAMPLE The alphabet used to create the following example consisted of 4 letters. These letters were A, C, G, T. The X string consists of 9 characters and Y string of 4 characters. Specifically the X string is GGCTACACC and the Y string is CAAG. In our example the size of the alphabet is 4. In this stage in the execution of the algorithm the Index Table of X is created. The Index Table of X is presented in the table below, where each line refers to the one letter of the alphabet begging from the first. The Cache memory is also initialized as well as the ignoremask table. The ignoremask table has size equal to the size of the alphabet used (in our example has 4 elements) Table 1: The index of string X Once the initialization stage has completed the execution of the main body of the algorithm begins. Here we have the initial call of the recursive function that the algorithm uses. The parameters past in this function are the starting positions in X, Y which are 0, 0, and the ignoremask table (with all its elements set to 0). Each recursive call has two branches, one that searches for the LCS including the current character of Y and one that searches for the LCS without the current character of Y. Each branch is a call to the recursive function. The first includes the current character and the second advances one character in Y and searches for the LCS from that point forward. Call 1 positions to start from in X, Y (0, 0): Step 1.1: With the first letter of Y which is C. The algorithm examines whether or not C exists in X and is available for use. The letter exists in X and the first available for use C in X is located in position 2. Step 1.2: The algorithm examines whether or not the letter in position 2 of X and position 0 of Y exists in the cache memory. Searching the cache for this element (2, 0) does not produce a result. Step 1.3: Call 2 positions to start from in X, Y (2+1, 0+1): Step 2.1: (same as step 1.1) Letter found in position 4. Step 2.2: (same as step 1.2) Element in positions (4, 1) was not found in cache memory Step 2.3: Call 3 positions to start from in X, Y (4+1, 1+1): Step 3.1: (same as step 1.1) Letter found in position 6. Step 3.2: (same as step 1.2) Element in positions (6, 2) was not found in cache memory Step 3.3: Call 4 positions to start from in X, Y (6+1, 2+1): Step 4.1: With the letter in position 3 of string Y which is G. The algorithm examines whether or not G exists in X and is available for use. There is not a letter G in X in or after position 7.The computations stop. 7

8 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. Step 4.2: Without the letter in position 3 of string Y which is G. We advance one character in Y Step 4.3: Call 5 positions to start from in X, Y (7, 3+1): Step 5.1: We have reached the end of Y. No more computations can be done from this point on. So, we go back to the previous call. Step 4.4: The search of this call of the function stops, no character could be used in the LCS. Call 4 returns 0 to call 3. Step 3.4: (same as step 4.2) Step 3.5: Call 6 positions to start from in X, Y (5, 2+1): Step 6.1: (same as step 4.1) Step 6.2: (same as step 4.2) Step 6.3: Call 7 positions to start from in X, Y (5, 3+1): Step 1: (same as step 5.1) Step 6.4: We have finished with the calculations for this call, this call returns to the previous one with LCS 0. Step 3.6: We have finished searching from element (6, 2) and forward and we add to the cache that the element in position 6 of X and 2 of Y returns a LCS of 1 and that LCS is A. Step 2.4: (same as step 3.6). We add to the cache that the element in position 4 of X and 1 of Y returns a LCS of 2 and that LCS is AA Step 2.5: (same as step 4.2) Step 2.6: Call 8 positions to start from in X, Y (3, 1+1): Step 8.1: In this step we should normally examine with the letter in position 2 of Y, which is A. Since we are in the case where we examine the LCS without the previous A there is no reason to include this A in out computations. Therefore this A is ignored. Step 8.2: (same as step 4.2) Step 8.3: Call 9 positions to start from in X, Y (3, 2+1): Step 9.1: (same as step 4.1) Step 9.2: (same as step 4.2) Step 9.3: Call 10 positions to start from in X, Y (3, 3+1): Step 10.1: (same as step 5.1) Step 9.4: The computation for this call and here and this call returns 0 to the previous call. Step 8.4: The computation for this call and here and this call returns 0 to the previous call. Step 2.7: The computations for this call end here and return to the previous call LCS 2. Step 1.4: (same as step 3.6). We add to the cache that the element in position 2 of X and 0 of Y returns a LCS of 3 and that LCS is CAA Step 1.5: (same as step 4.2) Step 1.6: Call 11 positions to start from in X, Y (0, 0+1): Step 11.1: (same as step 1.1) Letter found in position 4.

9 Step 11.2: The algorithm examines whether or not the letter in position 4 of X and position 1 of Y exists in the cache memory. Searching the cache for this element (4, 1) returns the result that including character A in the LCS we have a LCS of size 2 which is AA. This step ends here and returns LCS 2. Step 11.3: (same as step 4.2) Step 11.4: Call 12 positions to start from in X, Y (0, 1+1): Step 12.1: In this step we should normally examine with the letter in position 2 of Y, which is A. Since we are in the case where we examine the LCS without the previous A there is no reason to include this A in out computations. Therefore this A is ignored. Step 12.2: (same as step 4.2) Step 12.3: Call 13 positions to start from in X, Y (0, 2+1): Step 13.1: (same as step 1.1) Letter found in position 0. Step 13.2: Call 14 positions to start from in X, Y (0+1, 3+1): Step 14.1: (same as step 5.1) Step 13.3: (same as step 4.2) Step 13.4: Call 15 positions to start from in X, Y (0, 3+1): Step 15.1: (same as step 5.1) Step 13.5: (same as step 3.6). We add to the cache that the element in position 0 of X and 3 of Y returns a LCS of 1 and that LCS is G. This call returns to the previous one LCS of 1. Step 12.4: We have finished the computations for this call. This call returns to the previous one LCS of 1. Step 11.5: We have finished the computations for this call. This call returns to the previous one LCS 1, which is G. Step 1.4: We have ended the computation for this call. This call returns LCS 3, which is CAA. 4. COMPUTATIONAL EXPERIMENTS The computer we used to run the experiments is an Intel Celeron with a processor at 2.4 GHz, 768MB of RAM memory and 2,048GB of swap memory. The operating system is Debian GNU/ LINUX with a kernel The algorithm was implemented in C/C++ and the algorithm of classic LCS was implemented in C. The compiler was (Debian 1: ). The examples where created randomly, using the Random function of the operating system which uses a non-linear additive feedback random number generator employing a default table of size 31 long integers to return successive pseudo-random numbers in the range from 0 to RAND_MAX. We examined a total of 33 experiments, from the angle of time and memory usage. The results presented in table 2 concern the time required by each algorithm to solve the LCS problem. The first column illustrates the charts we present in this section and which results where used for the construction of each chart. The second one presents the length of string X and the third presents the length of string Y. The fourth and fifth columns present the number of different characters used to construct strings X, Y. The actual size of the alphabet mentioned in the algorithm is the max {size of alphabet X, size of alphabet Y}. The next two columns refer to the DFS LCS algorithm and the last two refer to the Classic LCS algorithm. The User time is the actual time the algorithm required to complete its calculations and the System time is the time consumed by the algorithm and refers to the usage of the operating system. The CPU time 9

10 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. for each algorithm is the sum of the User and System time. The word killed means that the execution of the algorithm was terminated by the operating system because the demands of the algorithm in memory exceeded the available memory of the computer. Chart 1 Length X Length Y X Y Size of alphabet DFS - LCS Classic LCS User (seconds) System (seconds) User (seconds) System (seconds) , ,530 4, , ,287 4, , ,176 4, , ,205 4, , ,128 4, , ,357 12, , ,782 12, , ,033 12, , ,873 12,505 Chart 2 Charts ,210 0,656 Killed ,423 0,723 Killed ,949 0,683 Killed ,229 0,845 Killed ,031 0,967 Killed ,364 1,147 Killed ,342 1,406 Killed ,917 1,705 Killed ,581 2,446 Killed ,741 4,324 Killed ,476 4,616 Killed ,460 5,414 Killed ,130 7,106 Killed ,083 8,756 Killed , ,051 0,

11 ,633 1, ,488 3, Killed Table 2: Experimental results of time usage The results presented in table 3 concern the memory required by each algorithm to solve the LCS problem. The first 5 columns represent the same data as in the previous table. The next two columns refer to the DFS LCS algorithm and the last one refers to the Classic LCS algorithm. The columns named Memory represent the maximum memory consumed by each algorithm, measured in MegaBytes. The second column under the DFS LCS algorithm is the cache memory mentioned earlier. Its size is included in the Memory column. Charts 5 6 Alphabet size DFS LCS Classic LCS Length X Length Y X Y Memory (MB) Cache (MB) Memory (MB) ,523 0, , ,277 1, , ,500 1, , ,500 1, , ,063 0, ,000 Charts 7 8 Chart ,785 0, , ,281 0, , , , , , ,000 1,090 killed ,000 7,835 killed ,000 26,904 killed ,000 57,503 killed , ,678 killed ,335 killed , ,190 killed , ,730 killed , ,135 killed , ,697 killed , ,904 killed , ,493 killed , ,931 killed , ,199 killed Table 3: Experimental results of memory usage 11

12 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. 5,0 4,5 4,0 3,5 Seconds 3,0 2,5 2,0 1,5 1,0 0,5 Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5 USER DFS LCS SYSTEM DFS LCS USER CLASSIC LCS SYSTEM CLASSIC LCS Chart 1: Time usage with a large X, and small Y (X and Y were created using the same alphabet which increases from example to example) The chart presented above (chart 1) illustrates that in both cases the increase in the alphabet size does not affect the execution time of the algorithms severally. In the DFS LCS algorithm there are small time requirements because the algorithm does not have large demands in memory. In the second case the time is significantly larger due the vast amount of memory used, which causes the swapping of the hard disk of the computer and the participation of the operating system to transfer the blocks from the hard to the RAM memory and back. For example, in the second example the CPU time for the DFS LCS algorithm is seconds whereas for the Classic LCS algorithm the CPU time is 7.17 seconds. In other words the time required from the Classic LCS algorithm is 22.2 times the time the DFS LCS algorithm required to solve the problem. Our next experiment focuses in the time required by the DFS LCS algorithm to find the longest common subsequence when X consists of characters. The length of Y increases by 100 characters in each example, and both strings were constructed using the same alphabet of 4 different characters. In the following diagram we can see the increase of the execution time as the length of Y increases Seconds USER DFS-LCS SYSTEM DFS-LCS Length of Y Chart 2: Time usage with a large X, and small Y (X and Y have the same size alphabet, and the size of Y increases by 100 characters in each example) As mentioned earlier the algorithm depends on the size of string Y. When using the same alphabet to construct the strings X and Y, using a string Y with length bigger than 1500

13 characters causes the system to run out of memory and swap memory as well. This situation does not occur when the alphabet of Y is larger than the alphabet of X. In this case the algorithm takes advantage of this difference and computes the LCS of X and Y, and the length of Y can increase significantly. We should also mention that the classic LCS algorithm could not compute the LCS for the examples used in this experiment because it caused the system to run out of memory. Our next experiment consists of a string X of small length (3000 characters) and a Y string whose length increases in each example. Both strings were created from 4 different characters. In the charts that follow (charts 3 & 4) we can see the time requirements for both algorithms. In chart 3, when Y s length becomes 1600 the algorithm runs out of memory and can not compute the LCS. 65,0 0, ,0 5 45, Seconds 4 35,0 3 25,0 2 15,0 Seconds , USER DFS-LCS SYSTEM DFS- LCS USER CLASSIC LCS SYSTEM CLASSIC LCS Chart 3 & 4: Time usage with a small X and small Y (X and Y have the same size alphabet, and the size of Y increases by 100 characters in each example) The following experiment illustrates the amount of memory that each algorithm requires to solve the LCS problem when using a large X ( characters) and a small Y (100 characters). In the first example used we can see that the classic LCS algorithm requires 50.8 times the memory the DFS LCS algorithm does for solving the same problem. 18, , ,0 60 MegaByte 12,0 1 8,0 6,0 4,0 MegaByte ,0 10 Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5 Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5 Memory DFS- LCS Cache DFS-LCS Memory Classic LCS Charts 5 & 6: Memory usage with a large X, and small Y (X and Y were created using the same alphabet which increases from example to example) 13

14 Σχεδίαση Λειτουργιών, Ανάκτηση Πληροφοριών και Διαχείριση Γνώσης. The difference in the alphabet size of X and Y affects the memory required for the execution of the DFS LCS algorithm. As mentioned before one character found in both X and Y is registered in the cache memory. This memory increases in size as more common characters are found among the two strings. In the case examined where characters included in Y are not included in X there will be a smaller amount of registrations in the cache memory and therefore the DFS LCS algorithm requires even less memory as shown in the two charts that follow (charts 7 & 8). In the fourth example the classic LCS algorithm requires 98.2 times the memory the DFS LCS algorithm does. MegaByte 16,0 15,0 14,0 13,0 12,0 11,0 1 9,0 8,0 7,0 6,0 5,0 4,0 3,0 2,0 1,0 Ex. 1 Ex. 2 Ex. 3 Ex. 4 MegaByte Ex. 1 Ex. 2 Ex. 3 Ex. 4 Memory DFS- LCS Cache DFS-LCS Memory Classic LCS Chart 7 & 8: Memory usage with a large X, and small Y (X was created from an alphabet with 4 different characters and Y was created using an alphabet which increases from example to example). In out last experiment we used a large X and a Y whose length increases by 100 characters in each example. The Classic LCS algorithm could not compute the solution to these problems because it caused the operating system to run out of memory. MegaByte Length of Y Memory DFS-LCS Cache DFS-LCS Chart 9: Memory usage with a large X and a Y whose length increases by 100 in every example (X and Y were created using the same alphabet). 5. CONCLUSIONS The algorithm presented in this paper illustrates a new approach in the LCS problem. We have demonstrated that in all cases in which the length of Y is small the algorithm produces the output (the LCS) in execution time that either does not differ significantly from the execution time of the classic LCS algorithm or is better. Also when using this algorithm there

15 are no specific limitations about the length of X. On the contrary the classic LCS algorithm tested in the same examples required significantly larger amount of memory and produced delays due to the swapping of the hard drive. This algorithm depends not only from the size of Y but also from the size of the alphabet used to produce the two strings. In cases where the length of the alphabet used to construct Y is significantly larger than the one used to produce X, the length of Y can increase notably. These are cases that the classic LCS can not distinct from and therefore produces worst results. References [1] Michael T. Goodrich, Roberto Tamassia (2001), Algorithm Design, pp , Wiley Publications. [2] James W. Hunt, Thomas G. Szymanski (1977), A Fast Algorithm for Computing Longest Common Subsequences, Association for Computing Machinery, 20, pp [3] H. Goeman, M. Clausen (1999), A New Practical Linear Space Algorithm for the Longest Common Subsequence Problem, The Prague Stringology Club Workshop [4] Ronald I. Greenberg (2002) Fast and Simple Computation of All Longest Common Subsequences, arxiv:cs.ds/ v1. [5] Ronald I. Greenberg (2003) Computing the Number of Longest Common Subsequences, arxiv:cs.ds/ v1 [6] L. Bergroth, H. Hakonen, T. Raita (2000) A Survey of Longest Common Subsequence Algorithms, Proceedings of the Seventh International Symposium on String Processing and Information Retrieval (IEEE) [7] Maxime Crochemore, Costas S. Iliopoulos, Yoan J.Pinzon (2003), Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms, Fundamenta Informaticae, 56, pp

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department