Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Size: px
Start display at page:

Download "Study of Data Localities in Suffix-Tree Based Genetic Algorithms"

Transcription

1 Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the Suffix Tree structure. As well as a description of the cache performance of the Suffix Tree. Keywords. Suffix Tree, SimpleScalar, REPuter, Probe Selection Problem Algorithm, Cache Aware 1. Introduction Suffix Trees are a well known data structure for algorithms that require string comparisons. A Suffix Tree can be used for various problems such as suffix matching, sub-string matching, index-at, longest common substring, and genome related applications such as string merging. Suffix Tree has the ability to solve most these problems in O(m) time (where m is a substring of length m). It is the defining structure of the Suffix Tree that enables this kind of quick search time. One of the most basic implementations of this structure is a Suffix Trie. This implementation starts by defining a root node and then from the first character of the input string attaches a suffix of the string of size n (n is the length of the input string). It then attempts to add the substring of n-1, subtracting a character from the beginning of the string. The algorithm goes through the entire string until the terminating character $ has been used, which grants a complete Suffix Trie. Fig 1. Suffix Trie Generated from Cocoa

2 2 Carl I. Bergenhem, Michael T. Smith This particular algorithm grants each character its own node, until a previously attached node can be re-used for the suffix that is currently being attached to the Trie. In order to find a matching substring within this structure, one simply starts from the root node and matches the first character of the input substring with all the children of the root. When a match is found one then matches the second character of the input string with the children of the node who had matched the previous character and so on until either a full match has been found, or a mismatch occurs. When a full match has occurred, one simply traverses the subtree created by allowing the last matching node to become a root node for a suffix tree until all leaf nodes (nodes associated with the terminating character $ ) have been found. These leaf nodes contain the index at which the specific suffix they are attached to started in the initial string. This implementation has the search run time of O(m) which is desired for a Suffix Trie, however the building time can take as long as O(n 2 ). The overall memory efficiency of this implementation is also very low, with a worst-case space requirement of O(n 2 ). A more efficient version of the Suffix Tree algorithm is the Compressed Suffix Tree (alternatively: Suffix Tree). This implementation removes the redundancy that comes with the Suffix Trie and grants more efficiency in runtime and space requirement. The most obvious difference between the compressed and uncompressed tree structure is the number of nodes. Fig 2. Compressed Suffix Tree generated from Cocoa Within the compressed structure each node has a label that can be between the lengths of 1 to n, where n is the size of the input string. In order to achieve this during construction the algorithm simply searches through existing labels of the relevant nodes in the current, but incomplete, tree until a partial, or full, match is found within the tree, or no match is found. When a partial match is found a node is created that separates the matching characters of the previous branch with the current suffix from the unmatched characters. This allows for the current suffix to use the matched characters, and simply attach what characters remain from the suffix onto this node. *Insert picture of this*. This implementation reduces the run time and space requirement from O(n 2 ) to O(n). The Compressed Suffix Tree was not perfected until Esko Ukkonen published his proposal of the construction of a Suffix Tree. Previous Suffix Trees were not online algorithms, in other words they had to know the entire input before the

3 Study of Data Localities in Suffix-Tree Based Genetic Algorithms 3 construction could start. With Ukkonen s algorithm, not only can the Suffix Tree go character by character it also allows the input to be read from left to right (previous versions only used a backwards progression of the input). Even though the Suffix Tree structure has reached these kinds of theoretical run times and space requirements, there is always the issue of the real world. When applied in practice, the practical running time can be far degraded from these previous estimates due to several reasons. The main focus that we have observed is the degradation of the Suffix Tree structure due to poor cache performance. A universal fact for all of the implementations of a Suffix Tree is as the tree is being generated the nodes that are being created are stored as they are created. Thus, when a search is performed there is a low probability that a cache hit will occur within every node that is traversed. Fig 3. Example hits and misses throughout a simple search As seen in figure 3, when the cache is traversed for a certain search pattern there can be a high amount of misses (assuming the cache block size is large enough for a single node) if the search pattern contains a list of nodes that are scattered over the cache. For each miss that occurs there is an allotted amount of cycles in the CPU to fetch the data from another memory source which will cause delays for the execution of the instructions that are attached to the result of that cache block. This, along with other factors, can result in actual runtimes that are far worse than the runtimes that have been computed theoretically. In order to match the theoretical values with the actual runtime one can modify algorithms to become either cache aware or cache oblivious. An algorithm that is cache aware is modified in accordance to what type of cache the system running the algorithm is implementing, and can then reduce the amount of cache misses by

4 4 Carl I. Bergenhem, Michael T. Smith adhering to the specific system. A cache oblivious algorithm maintains the same consistency in runtime as the theoretical values regardless of what kind of cache the host system is utilizing. 2. Previous Work Many people have delved into the issue of cache performance with regards to algorithm and data structure performance. As our work is based off two genetic algorithms and the supporting Suffix Tree data structure, the research on memory performance with respect to a Tree data structure, and the implementations of the Suffix Tree algorithm and genetic algorithms are most prevalent. Our work could not have been accomplished without the previous work of our referenced authors. The primary resources we used were guides dealing with the installation and use of SimpleScalar and standard implementations of the Suffix Tree algorithm. Our future work will rely more on the cited thesis papers dealing with cache aware data structures. 3. Methodology In order to be able to observe and track the performance of our algorithms we used the SimpleScalar Suite. The suite is a collection of programs that allows a user to specify what kind of CPU architecture is being used and then simulate said architecture with the given code written in FORTRAN or C. This allows a user to write a program and then measure how well the code would perform on a specific architecture. For this research the program that was used was called sim-cache. Simcache is a cache simulator that allows for simulation of the L1 instruction and data cache, as well as the L2 cache. It also allows the user to specify the level of sociativity along with what kind of replacement algorithm to use for cache misses. In order to successfully simulate the CPU with the code given, a cross-compiler in the Linux environment is needed to compile the code specifically for SimpleScalar. Once the code is compiled, using it with sim-cache generates an output file that can be read with any Linux text editor. This output file contains detailed information such as total amount of cache references, misses and hits. A guide for SimpleScalar installation along with the commands needed to utilize sim-cache has been included in section REPuter Algorithm The REPuter algorithm is used by genetic researchers to find maximal repeats in a given genomic sequence. A maximal repeat is any sub-section of the genome that appears in multiple locations within the genome. A simple example would be the string an within banana. A valid maximal repeat is any sub-string of the given

5 Study of Data Localities in Suffix-Tree Based Genetic Algorithms 5 text that appears at least twice and has a length greater than a set threshold. If the above example were to be valid, that length threshold would have to be set to 2. A slightly larger example is as follows. Take the sequence banabana and a threshold of two. This sequence has several maximal repeats as the conditions are that the substring appears at least twice and the length is at least the threshold. Thus, bana appears twice, ana appears twice, an appears twice, and na appears twice. The use of this algorithm with the genome allows researchers to find recurring motifs within a DNA sequence. Or, it can become a part of a larger algorithm to find recurring sequences that have minor mutations. What makes this algorithm so powerful is the data structure it is implemented with, the Suffix Tree. The Suffix Tree allows all repeating substrings, and their locations, to be found efficiently. This means that the algorithm runs in a time and space linear to the length of the genome sequence being operated on. The running of this algorithm operates in the following way. Starting from the root, and for every node thereafter- proceed as follows. REPUTER(Node current node) If the current node is a leaf node, return 1 as a counter;(marks an occurrence) If the current node is not a leaf Keep a sum starting at 0 then for each child node/path sum the results of calling REPUTER on the child nodes If the sum is at least 2 (the number of occurrences) And the length of the common string is at least the threshold A maximal repeat has been found Then return the sum to keep a tally for the parent nodes The above over-simplified algorithm will find all the maximal repeats, however a few details have been left out for ease of understanding. What is clear from the description though is that the whole tree must be traversed. The operation at each node is a constant time act as finding the length is a simple operation using the start and end indexes stored within it and no actual character comparisons need to be performed as an internal node means all its children share that substring. That is the power that the suffix tree offers the REPuter algorithm. The ability to find all maximal repeats while only iterating a number of times linearly proportional to the size of the input sequence. The problems encountered while trying to measure the cache hit ratio of the algorithm while using simple scalar was the sim-cache tool configuration. Using a file with a sample sequence 1 million characters in length consisting of A, C, G, and T resulted in statistics that were probabilistically much too high. Building the tree itself was in the upper 90 s for the hit rate percentage, and the REPuter algorithm running on top of that was only slightly lower. The reason this should not be is the way the Suffix Tree sprawls out across various memory blocks due to the way it is created and nodes are inserted out of order meaning the last inserted node could be the first node from the root. We ran our tests with a cache configuration of 1 kilobyte for the 1st level data configuration and a tree size around 20Mbs (based on a 20 byte size node).

6 6 Carl I. Bergenhem, Michael T. Smith When the tree is constructed, random paths of the tree are always being accessed in different orders which should alone yield a low hit rate as the algorithm has low spatial locality. Thus, the REPuter function should not perform much better as a full tree traversal must be performed. What this traversal means is that each path which consists of nodes in different memory blocks must all be loaded for one path to be evaluated. Then when the next path is traversed, different blocks must be called upon, or the same blocks- but in a different order leading to constant replacements within the data cache. 3.2 Probe Selection Problem (PSP) Algorithm In order to identify viruses that cause diseases and to control the quality of items in the food industry the usage of DNA arrays are very popular for fast identification of biological agents present in a given sample. A large part of this is the selection of oligos that are to be attached to the array surface. Given a set A of genomic sequences, one has to find at minimum one olignucleotide (probe) for each sequence S. This probe must be identified in a way that allows it to not hybridize with any other sequences aside from the target. Also, all probes must hybridize to their specific targets under the same reaction conditions. The most important condition is the temperature T under which the experiment is conducted. The Probe Selection Problem Algorithm, using the Suffix Tree structure, allows for the computation of the temperature T efficiently. Before any modification of any aspects of the Suffix Tree were to be made, an understanding and implementation were required. Initially a simple program implemented in Java was written. This program allowed, through a graphical interface, a user to load their string to be used for the suffix tree through a text file. It also allowed for a search to be done on said tree, giving an output of all occurrences of the substring within the original string. Another feature includes generating a random string of length L consisting only of A, T, C, and G. Along with this, generating a substring with the same letters of a length K in order to allow the randomization of the experiments. Unfortunately the later usage of SimpleScalar forced the usage of the C language. The installation of SimpleScalar generated another string of problems, resulting in the discovery that the latest cross-compiler designed for SimpleScalar was severely out-dated, and thusly an old version of Linux was required in order to configure SimpleScalar. Once set up on Red Hat Linux 9, SimpleScalar was configured and sim-cache was tested on a simple program. Once an implementation of the Suffix Tree was written, it was run through the sim-cache utility with cache sizes ranging from kilobytes, along with 1-8 way associativity. All CPU configurations had direct-mapping as the replacement structure. The results observed were however not what we expected. According to the output files generated by sim-cache the hit-rate of the Suffix Tree implementation ranged from 97% to 99% during creation, and for a search ranged between 95% and 97%. As seen in the previous example, when searching for a substring within the suffix tree the expected hit rate should be around 50% or 60%. In order to confirm that the SimpleScalar suite is working correctly a simple program was designed that generated a two dimensional array and filled each entry with a number. Then a

7 Study of Data Localities in Suffix-Tree Based Genetic Algorithms 7 traversal of the array both row-wise and column-wise was done. These different forms of traversal should have yielded a large difference in the hit ratio, due to the fact that for the row-wise traversal the next index in the array is most likely the next block in the cache, thus making the miss ratio fairly small. However, for the column-wise traversal there should be a cache miss for almost every index that is traversed. 4. Conclusion Although our theoretical computations generated a hit ratio around 50% when run through sim-cache the implementations of the Suffix Tree had hit ratios around 98%. Even the check program, a simple two dimensional array which was then traversed row-wise as well as column-wise, granted high results for the hit ratios. Especially the column-wise traversal which theoretically should have a lower hit rate in comparison to the row-wise traversal. This, however, hints towards the conclusion that there is an issue with the SimpleScalar suite. Whether this issue was from the usage of sim-cache or sim-cache itself is still left to be looked further into. The fact that both programs yielded much higher results than expected grants consistency and thus a claim can still be made the implementations of the Suffix Tree data structure still are correct, and can be used for future research within the Suffix Tree. 4.1 Future Work The value of our work as presented in this paper is that it will serve as a launch pad to now explore the various modifications to the algorithms and the runtime impacts they have on them. As the procedures for and commands have been documented now on more up to date systems the SimpleScalar suite can now be used easily and effectively to monitor cache performance along with the many other tools it offers. A last hurdle is understanding why the sim-cache simulator was yielding such high hit rates when it should obviously be much lower. However, once that is past, serious modifications and improvements can begin to be made to the Suffix Tree creation algorithm and the two genetic algorithms allowing for decreased actual runtime and increased productivity for the researchers who rely on these tools. Some of the larger modifications that can be made to the Suffix Tree could include a reconstruction of how the tree is allocated in memory. Despite the fact that nodes are created out of order compared to the way they may be accessed at a later time, a simple mechanism to, in constant time, allocate related nodes (parents and children) to the same block in memory would dramatically improve the performance of tree traversals. 4.2 SimpleScalar Installation Guide As a major complication arose in understanding the use of the simple scalar toolset, the following is a brief guide on how to use the rather un-maintained simple scalar

8 8 Carl I. Bergenhem, Michael T. Smith simulator package. The following is tested on a 7.04 ubuntu system. The source files and 'installer' were created by Cameron Palmer and are hosted by csrl.unt.edu. From a terminal window run the following commands: sudo apt-get install subversion svn co sudo apt-get install bison sudo apt-get install g gcc-3.3 cd simplescalar/ sudo sh simpleinstaller-little.sh The above takes care of the installation. To compile your programs and run them, the following two commands run from the simplescalar/ directory will work. bin/sslittle-na-sstrix-gcc -o simple_program simple_program.c simplesim-3.0/sim-cache simple_program 5. Acknowledgements This project was supported in part by the National Science Foundation Grant CCF , and was supervised by Professor Chun-Hsi Huang. 6. References 1. SimpleScalar 3.0 BBSWiki 2. Data Structures, Algorithms, & Applications in Java Suffix Trees, 3. Thomas B. Puzak, B.S.: The Effects of Spatial Locality on the Cache Performance of Binary Search Trees, MS Thesis, University of Connecticut Department of Computer Science 4. Stefan Kurtz, Chris Schleiermacher: PERuter: fast computation of maximal repeats in complete genomes. Bioinformatics Applications. Vol 15, ANSI C implementation of a Suffix Tree, 6. SimpleScalar LLC, 7. SimpleScalar Evolved: Archived Mail, August/ html 8. Growing A Suffix Tree, 98/albert/JAVA+html/SuffixTreeGrow.html 9. Fast String Searching with Suffix Trees, dynamicsimplescalar, Suffix Tree,

Analysis of parallel suffix tree construction

Analysis of parallel suffix tree construction 168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)

More information

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm

More information

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises 308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an

More information

Given a text file, or several text files, how do we search for a query string?

Given a text file, or several text files, how do we search for a query string? CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key

More information

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

High-throughput Sequence Alignment using Graphics Processing Units

High-throughput Sequence Alignment using Graphics Processing Units High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all

More information

Lecture 26. Introduction to Trees. Trees

Lecture 26. Introduction to Trees. Trees Lecture 26 Introduction to Trees Trees Trees are the name given to a versatile group of data structures. They can be used to implement a number of abstract interfaces including the List, but those applications

More information

Growth of the Internet Network capacity: A scarce resource Good Service

Growth of the Internet Network capacity: A scarce resource Good Service IP Route Lookups 1 Introduction Growth of the Internet Network capacity: A scarce resource Good Service Large-bandwidth links -> Readily handled (Fiber optic links) High router data throughput -> Readily

More information

11/5/09 Comp 590/Comp Fall

11/5/09 Comp 590/Comp Fall 11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors

More information

11/5/13 Comp 555 Fall

11/5/13 Comp 555 Fall 11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Analysis of Algorithms

Analysis of Algorithms Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and

More information

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree. The Lecture Contains: Index structure Binary search tree (BST) B-tree B+-tree Order file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture13/13_1.htm[6/14/2012

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

CIS265/ Trees Red-Black Trees. Some of the following material is from:

CIS265/ Trees Red-Black Trees. Some of the following material is from: CIS265/506 2-3-4 Trees Red-Black Trees Some of the following material is from: Data Structures for Java William H. Ford William R. Topp ISBN 0-13-047724-9 Chapter 27 Balanced Search Trees Bret Ford 2005,

More information

A Suffix Tree Construction Algorithm for DNA Sequences

A Suffix Tree Construction Algorithm for DNA Sequences A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan State

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Accelerating Protein Classification Using Suffix Trees

Accelerating Protein Classification Using Suffix Trees From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Memory Management (2)

Memory Management (2) EECS 3221.3 Operating System Fundamentals No.9 Memory Management (2) Prof. Hui Jiang Dept of Electrical Engineering and Computer Science, York University Memory Management Approaches Contiguous Memory

More information

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Questions 1 Question 13 1: (Solution, p ) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Question 13 : (Solution, p ) In implementing HYMN s control unit, the fetch cycle

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Chapter 12: Indexing and Hashing (Cnt(

Chapter 12: Indexing and Hashing (Cnt( Chapter 12: Indexing and Hashing (Cnt( Cnt.) Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Backtracking. Chapter 5

Backtracking. Chapter 5 1 Backtracking Chapter 5 2 Objectives Describe the backtrack programming technique Determine when the backtracking technique is an appropriate approach to solving a problem Define a state space tree for

More information

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18 istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

Advanced Algorithms: Project

Advanced Algorithms: Project Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and

More information

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010 Uses for About Binary January 31, 2010 Uses for About Binary Uses for Uses for About Basic Idea Implementing Binary Example: Expression Binary Search Uses for Uses for About Binary Uses for Storage Binary

More information

14.4 Description of Huffman Coding

14.4 Description of Huffman Coding Mastering Algorithms with C By Kyle Loudon Slots : 1 Table of Contents Chapter 14. Data Compression Content 14.4 Description of Huffman Coding One of the oldest and most elegant forms of data compression

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

BUNDLED SUFFIX TREES

BUNDLED SUFFIX TREES Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science

More information

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved. Chapter 7 Space and Time Tradeoffs Copyright 2007 Pearson Addison-Wesley. All rights reserved. Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement preprocess the input

More information

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

Main Memory and the CPU Cache

Main Memory and the CPU Cache Main Memory and the CPU Cache CPU cache Unrolled linked lists B Trees Our model of main memory and the cost of CPU operations has been intentionally simplistic The major focus has been on determining

More information

CprE Computer Architecture and Assembly Level Programming Spring Lab-8

CprE Computer Architecture and Assembly Level Programming Spring Lab-8 CprE 381 - Computer Architecture and Assembly Level Programming Spring 2017 Lab-8 INTRODUCTION: In this lab, you will use the sim-cache simulator from the SimpleScalar toolset to compare the performance

More information

CSE 373 OCTOBER 25 TH B-TREES

CSE 373 OCTOBER 25 TH B-TREES CSE 373 OCTOBER 25 TH S ASSORTED MINUTIAE Project 2 is due tonight Make canvas group submissions Load factor: total number of elements / current table size Can select any load factor (but since we don

More information

6. Finding Efficient Compressions; Huffman and Hu-Tucker

6. Finding Efficient Compressions; Huffman and Hu-Tucker 6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?

More information

TERM PROJECT COEN 283. Enhancing data hit ratio by using adaptive caching Technique. Operating System. Prepared By: Darshita Shah.

TERM PROJECT COEN 283. Enhancing data hit ratio by using adaptive caching Technique. Operating System. Prepared By: Darshita Shah. COEN 283 Operating System TERM PROJECT Enhancing data hit ratio by using adaptive caching Technique Prepared By: Darshita Shah Preethi Yellappa Nidhi Singh Table of Content Topics Page No 1. Introduction

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

In this chapter you ll learn:

In this chapter you ll learn: Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd Will you walk a little faster? said a whiting to a snail, There s a porpoise close behind us, and he s treading on

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Parallel Distributed Memory String Indexes

Parallel Distributed Memory String Indexes Parallel Distributed Memory String Indexes Efficient Construction and Querying Patrick Flick & Srinivas Aluru Computational Science and Engineering Georgia Institute of Technology 1 In this talk Overview

More information

19 Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd

19 Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd 19 Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd Will you walk a little faster? said a whiting to a snail, There s a porpoise close behind us, and he s treading

More information

(for more info see:

(for more info see: Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

TREES. Trees - Introduction

TREES. Trees - Introduction TREES Chapter 6 Trees - Introduction All previous data organizations we've studied are linear each element can have only one predecessor and successor Accessing all elements in a linear sequence is O(n)

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

Basic Compression Library

Basic Compression Library Basic Compression Library Manual API version 1.2 July 22, 2006 c 2003-2006 Marcus Geelnard Summary This document describes the algorithms used in the Basic Compression Library, and how to use the library

More information

Binary Search Tree (3A) Young Won Lim 6/2/18

Binary Search Tree (3A) Young Won Lim 6/2/18 Binary Search Tree (A) /2/1 Copyright (c) 2015-201 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2

More information

Friday Four Square! 4:15PM, Outside Gates

Friday Four Square! 4:15PM, Outside Gates Binary Search Trees Friday Four Square! 4:15PM, Outside Gates Implementing Set On Monday and Wednesday, we saw how to implement the Map and Lexicon, respectively. Let's now turn our attention to the Set.

More information

Overview of Presentation. Heapsort. Heap Properties. What is Heap? Building a Heap. Two Basic Procedure on Heap

Overview of Presentation. Heapsort. Heap Properties. What is Heap? Building a Heap. Two Basic Procedure on Heap Heapsort Submitted by : Hardik Parikh(hjp0608) Soujanya Soni (sxs3298) Overview of Presentation Heap Definition. Adding a Node. Removing a Node. Array Implementation. Analysis What is Heap? A Heap is a

More information

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis Objective Input! Output! A B C D Several

More information

N N Sudoku Solver. Sequential and Parallel Computing

N N Sudoku Solver. Sequential and Parallel Computing N N Sudoku Solver Sequential and Parallel Computing Abdulaziz Aljohani Computer Science. Rochester Institute of Technology, RIT Rochester, United States aaa4020@rit.edu Abstract 'Sudoku' is a logic-based

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Report Seminar Algorithm Engineering

Report Seminar Algorithm Engineering Report Seminar Algorithm Engineering G. S. Brodal, R. Fagerberg, K. Vinther: Engineering a Cache-Oblivious Sorting Algorithm Iftikhar Ahmad Chair of Algorithm and Complexity Department of Computer Science

More information

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management Hashing Symbol Table Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management In general, the following operations are performed on

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Computer Science 210 Data Structures Siena College Fall Topic Notes: Trees

Computer Science 210 Data Structures Siena College Fall Topic Notes: Trees Computer Science 0 Data Structures Siena College Fall 08 Topic Notes: Trees We ve spent a lot of time looking at a variety of structures where there is a natural linear ordering of the elements in arrays,

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Suffix trees and applications. String Algorithms

Suffix trees and applications. String Algorithms Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x

More information

Organizing Spatial Data

Organizing Spatial Data Organizing Spatial Data Spatial data records include a sense of location as an attribute. Typically location is represented by coordinate data (in 2D or 3D). 1 If we are to search spatial data using the

More information

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler Lecture 12: BT Trees Courtesy to Goodrich, Tamassia and Olga Veksler Instructor: Yuzhen Xie Outline B-tree Special case of multiway search trees used when data must be stored on the disk, i.e. too large

More information

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Combinatorial Pattern Matching. CS 466 Saurabh Sinha Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Practical methods for constructing suffix trees

Practical methods for constructing suffix trees The VLDB Journal (25) 14(3): 281 299 DOI 1.17/s778-5-154-8 REGULAR PAPER Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel Practical methods for constructing suffix trees Received: 14 October

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Properties of red-black trees

Properties of red-black trees Red-Black Trees Introduction We have seen that a binary search tree is a useful tool. I.e., if its height is h, then we can implement any basic operation on it in O(h) units of time. The problem: given

More information

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5 Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

Text Compression through Huffman Coding. Terminology

Text Compression through Huffman Coding. Terminology Text Compression through Huffman Coding Huffman codes represent a very effective technique for compressing data; they usually produce savings between 20% 90% Preliminary example We are given a 100,000-character

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

ADAPTATION OF REPRESENTATION IN GP

ADAPTATION OF REPRESENTATION IN GP 1 ADAPTATION OF REPRESENTATION IN GP CEZARY Z. JANIKOW University of Missouri St. Louis Department of Mathematics and Computer Science St Louis, Missouri RAHUL A DESHPANDE University of Missouri St. Louis

More information

University of Waterloo CS240R Fall 2017 Review Problems

University of Waterloo CS240R Fall 2017 Review Problems University of Waterloo CS240R Fall 2017 Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems do not

More information

IMPROVING A GREEDY DNA MOTIF SEARCH USING A MULTIPLE GENOMIC SELF-ADAPTATING GENETIC ALGORITHM

IMPROVING A GREEDY DNA MOTIF SEARCH USING A MULTIPLE GENOMIC SELF-ADAPTATING GENETIC ALGORITHM Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 4th, 2007 IMPROVING A GREEDY DNA MOTIF SEARCH USING A MULTIPLE GENOMIC SELF-ADAPTATING GENETIC ALGORITHM Michael L. Gargano, mgargano@pace.edu

More information

CSE 214 Computer Science II Introduction to Tree

CSE 214 Computer Science II Introduction to Tree CSE 214 Computer Science II Introduction to Tree Fall 2017 Stony Brook University Instructor: Shebuti Rayana shebuti.rayana@stonybrook.edu http://www3.cs.stonybrook.edu/~cse214/sec02/ Tree Tree is a non-linear

More information

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu

More information

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004 Computer Science 136 Spring 2004 Professor Bruce Final Examination May 19, 2004 Question Points Score 1 10 2 8 3 15 4 12 5 12 6 8 7 10 TOTAL 65 Your name (Please print) I have neither given nor received

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Bioinformatics I, WS 09-10, D. Huson, February 10,

Bioinformatics I, WS 09-10, D. Huson, February 10, Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

B-Trees. Introduction. Definitions

B-Trees. Introduction. Definitions 1 of 10 B-Trees Introduction A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node may contain a large number of keys. The number of subtrees of each node,

More information

Suffix Tree and Array

Suffix Tree and Array Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data

More information

CSCI-401 Examlet #5. Name: Class: Date: True/False Indicate whether the sentence or statement is true or false.

CSCI-401 Examlet #5. Name: Class: Date: True/False Indicate whether the sentence or statement is true or false. Name: Class: Date: CSCI-401 Examlet #5 True/False Indicate whether the sentence or statement is true or false. 1. The root node of the standard binary tree can be drawn anywhere in the tree diagram. 2.

More information