Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the Suffix Tree structure. As well as a description of the cache performance of the Suffix Tree. Keywords. Suffix Tree, SimpleScalar, REPuter, Probe Selection Problem Algorithm, Cache Aware 1. Introduction Suffix Trees are a well known data structure for algorithms that require string comparisons. A Suffix Tree can be used for various problems such as suffix matching, sub-string matching, index-at, longest common substring, and genome related applications such as string merging. Suffix Tree has the ability to solve most these problems in O(m) time (where m is a substring of length m). It is the defining structure of the Suffix Tree that enables this kind of quick search time. One of the most basic implementations of this structure is a Suffix Trie. This implementation starts by defining a root node and then from the first character of the input string attaches a suffix of the string of size n (n is the length of the input string). It then attempts to add the substring of n-1, subtracting a character from the beginning of the string. The algorithm goes through the entire string until the terminating character $ has been used, which grants a complete Suffix Trie. Fig 1. Suffix Trie Generated from Cocoa

2 Carl I. Bergenhem, Michael T. Smith This particular algorithm grants each character its own node, until a previously attached node can be re-used for the suffix that is currently being attached to the Trie. In order to find a matching substring within this structure, one simply starts from the root node and matches the first character of the input substring with all the children of the root. When a match is found one then matches the second character of the input string with the children of the node who had matched the previous character and so on until either a full match has been found, or a mismatch occurs. When a full match has occurred, one simply traverses the subtree created by allowing the last matching node to become a root node for a suffix tree until all leaf nodes (nodes associated with the terminating character $ ) have been found. These leaf nodes contain the index at which the specific suffix they are attached to started in the initial string. This implementation has the search run time of O(m) which is desired for a Suffix Trie, however the building time can take as long as O(n 2 ). The overall memory efficiency of this implementation is also very low, with a worst-case space requirement of O(n 2 ). A more efficient version of the Suffix Tree algorithm is the Compressed Suffix Tree (alternatively: Suffix Tree). This implementation removes the redundancy that comes with the Suffix Trie and grants more efficiency in runtime and space requirement. The most obvious difference between the compressed and uncompressed tree structure is the number of nodes. Fig 2. Compressed Suffix Tree generated from Cocoa Within the compressed structure each node has a label that can be between the lengths of 1 to n, where n is the size of the input string. In order to achieve this during construction the algorithm simply searches through existing labels of the relevant nodes in the current, but incomplete, tree until a partial, or full, match is found within the tree, or no match is found. When a partial match is found a node is created that separates the matching characters of the previous branch with the current suffix from the unmatched characters. This allows for the current suffix to use the matched characters, and simply attach what characters remain from the suffix onto this node. *Insert picture of this*. This implementation reduces the run time and space requirement from O(n 2 ) to O(n). The Compressed Suffix Tree was not perfected until Esko Ukkonen published his proposal of the construction of a Suffix Tree. Previous Suffix Trees were not online algorithms, in other words they had to know the entire input before the

Study of Data Localities in Suffix-Tree Based Genetic Algorithms 3 construction could start. With Ukkonen s algorithm, not only can the Suffix Tree go character by character it also allows the input to be read from left to right (previous versions only used a backwards progression of the input). Even though the Suffix Tree structure has reached these kinds of theoretical run times and space requirements, there is always the issue of the real world. When applied in practice, the practical running time can be far degraded from these previous estimates due to several reasons. The main focus that we have observed is the degradation of the Suffix Tree structure due to poor cache performance. A universal fact for all of the implementations of a Suffix Tree is as the tree is being generated the nodes that are being created are stored as they are created. Thus, when a search is performed there is a low probability that a cache hit will occur within every node that is traversed. Fig 3. Example hits and misses throughout a simple search As seen in figure 3, when the cache is traversed for a certain search pattern there can be a high amount of misses (assuming the cache block size is large enough for a single node) if the search pattern contains a list of nodes that are scattered over the cache. For each miss that occurs there is an allotted amount of cycles in the CPU to fetch the data from another memory source which will cause delays for the execution of the instructions that are attached to the result of that cache block. This, along with other factors, can result in actual runtimes that are far worse than the runtimes that have been computed theoretically. In order to match the theoretical values with the actual runtime one can modify algorithms to become either cache aware or cache oblivious. An algorithm that is cache aware is modified in accordance to what type of cache the system running the algorithm is implementing, and can then reduce the amount of cache misses by

4 Carl I. Bergenhem, Michael T. Smith adhering to the specific system. A cache oblivious algorithm maintains the same consistency in runtime as the theoretical values regardless of what kind of cache the host system is utilizing. 2. Previous Work Many people have delved into the issue of cache performance with regards to algorithm and data structure performance. As our work is based off two genetic algorithms and the supporting Suffix Tree data structure, the research on memory performance with respect to a Tree data structure, and the implementations of the Suffix Tree algorithm and genetic algorithms are most prevalent. Our work could not have been accomplished without the previous work of our referenced authors. The primary resources we used were guides dealing with the installation and use of SimpleScalar and standard implementations of the Suffix Tree algorithm. Our future work will rely more on the cited thesis papers dealing with cache aware data structures. 3. Methodology In order to be able to observe and track the performance of our algorithms we used the SimpleScalar Suite. The suite is a collection of programs that allows a user to specify what kind of CPU architecture is being used and then simulate said architecture with the given code written in FORTRAN or C. This allows a user to write a program and then measure how well the code would perform on a specific architecture. For this research the program that was used was called sim-cache. Simcache is a cache simulator that allows for simulation of the L1 instruction and data cache, as well as the L2 cache. It also allows the user to specify the level of sociativity along with what kind of replacement algorithm to use for cache misses. In order to successfully simulate the CPU with the code given, a cross-compiler in the Linux environment is needed to compile the code specifically for SimpleScalar. Once the code is compiled, using it with sim-cache generates an output file that can be read with any Linux text editor. This output file contains detailed information such as total amount of cache references, misses and hits. A guide for SimpleScalar installation along with the commands needed to utilize sim-cache has been included in section 4.2. 3.1 REPuter Algorithm The REPuter algorithm is used by genetic researchers to find maximal repeats in a given genomic sequence. A maximal repeat is any sub-section of the genome that appears in multiple locations within the genome. A simple example would be the string an within banana. A valid maximal repeat is any sub-string of the given

Study of Data Localities in Suffix-Tree Based Genetic Algorithms 5 text that appears at least twice and has a length greater than a set threshold. If the above example were to be valid, that length threshold would have to be set to 2. A slightly larger example is as follows. Take the sequence banabana and a threshold of two. This sequence has several maximal repeats as the conditions are that the substring appears at least twice and the length is at least the threshold. Thus, bana appears twice, ana appears twice, an appears twice, and na appears twice. The use of this algorithm with the genome allows researchers to find recurring motifs within a DNA sequence. Or, it can become a part of a larger algorithm to find recurring sequences that have minor mutations. What makes this algorithm so powerful is the data structure it is implemented with, the Suffix Tree. The Suffix Tree allows all repeating substrings, and their locations, to be found efficiently. This means that the algorithm runs in a time and space linear to the length of the genome sequence being operated on. The running of this algorithm operates in the following way. Starting from the root, and for every node thereafter- proceed as follows. REPUTER(Node current node) If the current node is a leaf node, return 1 as a counter;(marks an occurrence) If the current node is not a leaf Keep a sum starting at 0 then for each child node/path sum the results of calling REPUTER on the child nodes If the sum is at least 2 (the number of occurrences) And the length of the common string is at least the threshold A maximal repeat has been found Then return the sum to keep a tally for the parent nodes The above over-simplified algorithm will find all the maximal repeats, however a few details have been left out for ease of understanding. What is clear from the description though is that the whole tree must be traversed. The operation at each node is a constant time act as finding the length is a simple operation using the start and end indexes stored within it and no actual character comparisons need to be performed as an internal node means all its children share that substring. That is the power that the suffix tree offers the REPuter algorithm. The ability to find all maximal repeats while only iterating a number of times linearly proportional to the size of the input sequence. The problems encountered while trying to measure the cache hit ratio of the algorithm while using simple scalar was the sim-cache tool configuration. Using a file with a sample sequence 1 million characters in length consisting of A, C, G, and T resulted in statistics that were probabilistically much too high. Building the tree itself was in the upper 90 s for the hit rate percentage, and the REPuter algorithm running on top of that was only slightly lower. The reason this should not be is the way the Suffix Tree sprawls out across various memory blocks due to the way it is created and nodes are inserted out of order meaning the last inserted node could be the first node from the root. We ran our tests with a cache configuration of 1 kilobyte for the 1st level data configuration and a tree size around 20Mbs (based on a 20 byte size node).

6 Carl I. Bergenhem, Michael T. Smith When the tree is constructed, random paths of the tree are always being accessed in different orders which should alone yield a low hit rate as the algorithm has low spatial locality. Thus, the REPuter function should not perform much better as a full tree traversal must be performed. What this traversal means is that each path which consists of nodes in different memory blocks must all be loaded for one path to be evaluated. Then when the next path is traversed, different blocks must be called upon, or the same blocks- but in a different order leading to constant replacements within the data cache. 3.2 Probe Selection Problem (PSP) Algorithm In order to identify viruses that cause diseases and to control the quality of items in the food industry the usage of DNA arrays are very popular for fast identification of biological agents present in a given sample. A large part of this is the selection of oligos that are to be attached to the array surface. Given a set A of genomic sequences, one has to find at minimum one olignucleotide (probe) for each sequence S. This probe must be identified in a way that allows it to not hybridize with any other sequences aside from the target. Also, all probes must hybridize to their specific targets under the same reaction conditions. The most important condition is the temperature T under which the experiment is conducted. The Probe Selection Problem Algorithm, using the Suffix Tree structure, allows for the computation of the temperature T efficiently. Before any modification of any aspects of the Suffix Tree were to be made, an understanding and implementation were required. Initially a simple program implemented in Java was written. This program allowed, through a graphical interface, a user to load their string to be used for the suffix tree through a text file. It also allowed for a search to be done on said tree, giving an output of all occurrences of the substring within the original string. Another feature includes generating a random string of length L consisting only of A, T, C, and G. Along with this, generating a substring with the same letters of a length K in order to allow the randomization of the experiments. Unfortunately the later usage of SimpleScalar forced the usage of the C language. The installation of SimpleScalar generated another string of problems, resulting in the discovery that the latest cross-compiler designed for SimpleScalar was severely out-dated, and thusly an old version of Linux was required in order to configure SimpleScalar. Once set up on Red Hat Linux 9, SimpleScalar was configured and sim-cache was tested on a simple program. Once an implementation of the Suffix Tree was written, it was run through the sim-cache utility with cache sizes ranging from 0.5-8 kilobytes, along with 1-8 way associativity. All CPU configurations had direct-mapping as the replacement structure. The results observed were however not what we expected. According to the output files generated by sim-cache the hit-rate of the Suffix Tree implementation ranged from 97% to 99% during creation, and for a search ranged between 95% and 97%. As seen in the previous example, when searching for a substring within the suffix tree the expected hit rate should be around 50% or 60%. In order to confirm that the SimpleScalar suite is working correctly a simple program was designed that generated a two dimensional array and filled each entry with a number. Then a

Study of Data Localities in Suffix-Tree Based Genetic Algorithms 7 traversal of the array both row-wise and column-wise was done. These different forms of traversal should have yielded a large difference in the hit ratio, due to the fact that for the row-wise traversal the next index in the array is most likely the next block in the cache, thus making the miss ratio fairly small. However, for the column-wise traversal there should be a cache miss for almost every index that is traversed. 4. Conclusion Although our theoretical computations generated a hit ratio around 50% when run through sim-cache the implementations of the Suffix Tree had hit ratios around 98%. Even the check program, a simple two dimensional array which was then traversed row-wise as well as column-wise, granted high results for the hit ratios. Especially the column-wise traversal which theoretically should have a lower hit rate in comparison to the row-wise traversal. This, however, hints towards the conclusion that there is an issue with the SimpleScalar suite. Whether this issue was from the usage of sim-cache or sim-cache itself is still left to be looked further into. The fact that both programs yielded much higher results than expected grants consistency and thus a claim can still be made the implementations of the Suffix Tree data structure still are correct, and can be used for future research within the Suffix Tree. 4.1 Future Work The value of our work as presented in this paper is that it will serve as a launch pad to now explore the various modifications to the algorithms and the runtime impacts they have on them. As the procedures for and commands have been documented now on more up to date systems the SimpleScalar suite can now be used easily and effectively to monitor cache performance along with the many other tools it offers. A last hurdle is understanding why the sim-cache simulator was yielding such high hit rates when it should obviously be much lower. However, once that is past, serious modifications and improvements can begin to be made to the Suffix Tree creation algorithm and the two genetic algorithms allowing for decreased actual runtime and increased productivity for the researchers who rely on these tools. Some of the larger modifications that can be made to the Suffix Tree could include a reconstruction of how the tree is allocated in memory. Despite the fact that nodes are created out of order compared to the way they may be accessed at a later time, a simple mechanism to, in constant time, allocate related nodes (parents and children) to the same block in memory would dramatically improve the performance of tree traversals. 4.2 SimpleScalar Installation Guide As a major complication arose in understanding the use of the simple scalar toolset, the following is a brief guide on how to use the rather un-maintained simple scalar

8 Carl I. Bergenhem, Michael T. Smith simulator package. The following is tested on a 7.04 ubuntu system. The source files and 'installer' were created by Cameron Palmer and are hosted by csrl.unt.edu. From a terminal window run the following commands: sudo apt-get install subversion svn co http://csrl.unt.edu/svn/simplescalar sudo apt-get install bison sudo apt-get install g++-3.3 gcc-3.3 cd simplescalar/ sudo sh simpleinstaller-little.sh The above takes care of the installation. To compile your programs and run them, the following two commands run from the simplescalar/ directory will work. bin/sslittle-na-sstrix-gcc -o simple_program simple_program.c simplesim-3.0/sim-cache simple_program 5. Acknowledgements This project was supported in part by the National Science Foundation Grant CCF- 0755373, and was supervised by Professor Chun-Hsi Huang. 6. References 1. SimpleScalar 3.0 BBSWiki http://wiki.bigbuddysociety.net/index.php?title=simplescalar_3.0 2. Data Structures, Algorithms, & Applications in Java Suffix Trees, http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm 3. Thomas B. Puzak, B.S.: The Effects of Spatial Locality on the Cache Performance of Binary Search Trees, MS Thesis, University of Connecticut Department of Computer Science 4. Stefan Kurtz, Chris Schleiermacher: PERuter: fast computation of maximal repeats in complete genomes. Bioinformatics Applications. Vol 15, 426-427 5. ANSI C implementation of a Suffix Tree, http://mila.cs.technion.ac.il/~yona/suffix_tree/ 6. SimpleScalar LLC, http://www.simplescalar.com 7. SimpleScalar Evolved: Archived Mail, http://csrl.unt.edu/pipermail/research/2007- August/000070.html 8. Growing A Suffix Tree, http://pauillac.inria.fr/~quercia/documents-info/luminy- 98/albert/JAVA+html/SuffixTreeGrow.html 9. Fast String Searching with Suffix Trees, http://marknelson.us/1996/08/01/suffixtrees/ 10. dynamicsimplescalar, http://www-ali.cs.umass.edu/dss/ 11. Suffix Tree, http://www.allisons.org/ll/algds/tree/suffix/