Study of Data Localities in Suffix-Tree Based Genetic Algorithms
|
|
- Felicity Leonard
- 6 years ago
- Views:
Transcription
1 Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the Suffix Tree structure. As well as a description of the cache performance of the Suffix Tree. Keywords. Suffix Tree, SimpleScalar, REPuter, Probe Selection Problem Algorithm, Cache Aware 1. Introduction Suffix Trees are a well known data structure for algorithms that require string comparisons. A Suffix Tree can be used for various problems such as suffix matching, sub-string matching, index-at, longest common substring, and genome related applications such as string merging. Suffix Tree has the ability to solve most these problems in O(m) time (where m is a substring of length m). It is the defining structure of the Suffix Tree that enables this kind of quick search time. One of the most basic implementations of this structure is a Suffix Trie. This implementation starts by defining a root node and then from the first character of the input string attaches a suffix of the string of size n (n is the length of the input string). It then attempts to add the substring of n-1, subtracting a character from the beginning of the string. The algorithm goes through the entire string until the terminating character $ has been used, which grants a complete Suffix Trie. Fig 1. Suffix Trie Generated from Cocoa
2 2 Carl I. Bergenhem, Michael T. Smith This particular algorithm grants each character its own node, until a previously attached node can be re-used for the suffix that is currently being attached to the Trie. In order to find a matching substring within this structure, one simply starts from the root node and matches the first character of the input substring with all the children of the root. When a match is found one then matches the second character of the input string with the children of the node who had matched the previous character and so on until either a full match has been found, or a mismatch occurs. When a full match has occurred, one simply traverses the subtree created by allowing the last matching node to become a root node for a suffix tree until all leaf nodes (nodes associated with the terminating character $ ) have been found. These leaf nodes contain the index at which the specific suffix they are attached to started in the initial string. This implementation has the search run time of O(m) which is desired for a Suffix Trie, however the building time can take as long as O(n 2 ). The overall memory efficiency of this implementation is also very low, with a worst-case space requirement of O(n 2 ). A more efficient version of the Suffix Tree algorithm is the Compressed Suffix Tree (alternatively: Suffix Tree). This implementation removes the redundancy that comes with the Suffix Trie and grants more efficiency in runtime and space requirement. The most obvious difference between the compressed and uncompressed tree structure is the number of nodes. Fig 2. Compressed Suffix Tree generated from Cocoa Within the compressed structure each node has a label that can be between the lengths of 1 to n, where n is the size of the input string. In order to achieve this during construction the algorithm simply searches through existing labels of the relevant nodes in the current, but incomplete, tree until a partial, or full, match is found within the tree, or no match is found. When a partial match is found a node is created that separates the matching characters of the previous branch with the current suffix from the unmatched characters. This allows for the current suffix to use the matched characters, and simply attach what characters remain from the suffix onto this node. *Insert picture of this*. This implementation reduces the run time and space requirement from O(n 2 ) to O(n). The Compressed Suffix Tree was not perfected until Esko Ukkonen published his proposal of the construction of a Suffix Tree. Previous Suffix Trees were not online algorithms, in other words they had to know the entire input before the
3 Study of Data Localities in Suffix-Tree Based Genetic Algorithms 3 construction could start. With Ukkonen s algorithm, not only can the Suffix Tree go character by character it also allows the input to be read from left to right (previous versions only used a backwards progression of the input). Even though the Suffix Tree structure has reached these kinds of theoretical run times and space requirements, there is always the issue of the real world. When applied in practice, the practical running time can be far degraded from these previous estimates due to several reasons. The main focus that we have observed is the degradation of the Suffix Tree structure due to poor cache performance. A universal fact for all of the implementations of a Suffix Tree is as the tree is being generated the nodes that are being created are stored as they are created. Thus, when a search is performed there is a low probability that a cache hit will occur within every node that is traversed. Fig 3. Example hits and misses throughout a simple search As seen in figure 3, when the cache is traversed for a certain search pattern there can be a high amount of misses (assuming the cache block size is large enough for a single node) if the search pattern contains a list of nodes that are scattered over the cache. For each miss that occurs there is an allotted amount of cycles in the CPU to fetch the data from another memory source which will cause delays for the execution of the instructions that are attached to the result of that cache block. This, along with other factors, can result in actual runtimes that are far worse than the runtimes that have been computed theoretically. In order to match the theoretical values with the actual runtime one can modify algorithms to become either cache aware or cache oblivious. An algorithm that is cache aware is modified in accordance to what type of cache the system running the algorithm is implementing, and can then reduce the amount of cache misses by
4 4 Carl I. Bergenhem, Michael T. Smith adhering to the specific system. A cache oblivious algorithm maintains the same consistency in runtime as the theoretical values regardless of what kind of cache the host system is utilizing. 2. Previous Work Many people have delved into the issue of cache performance with regards to algorithm and data structure performance. As our work is based off two genetic algorithms and the supporting Suffix Tree data structure, the research on memory performance with respect to a Tree data structure, and the implementations of the Suffix Tree algorithm and genetic algorithms are most prevalent. Our work could not have been accomplished without the previous work of our referenced authors. The primary resources we used were guides dealing with the installation and use of SimpleScalar and standard implementations of the Suffix Tree algorithm. Our future work will rely more on the cited thesis papers dealing with cache aware data structures. 3. Methodology In order to be able to observe and track the performance of our algorithms we used the SimpleScalar Suite. The suite is a collection of programs that allows a user to specify what kind of CPU architecture is being used and then simulate said architecture with the given code written in FORTRAN or C. This allows a user to write a program and then measure how well the code would perform on a specific architecture. For this research the program that was used was called sim-cache. Simcache is a cache simulator that allows for simulation of the L1 instruction and data cache, as well as the L2 cache. It also allows the user to specify the level of sociativity along with what kind of replacement algorithm to use for cache misses. In order to successfully simulate the CPU with the code given, a cross-compiler in the Linux environment is needed to compile the code specifically for SimpleScalar. Once the code is compiled, using it with sim-cache generates an output file that can be read with any Linux text editor. This output file contains detailed information such as total amount of cache references, misses and hits. A guide for SimpleScalar installation along with the commands needed to utilize sim-cache has been included in section REPuter Algorithm The REPuter algorithm is used by genetic researchers to find maximal repeats in a given genomic sequence. A maximal repeat is any sub-section of the genome that appears in multiple locations within the genome. A simple example would be the string an within banana. A valid maximal repeat is any sub-string of the given
5 Study of Data Localities in Suffix-Tree Based Genetic Algorithms 5 text that appears at least twice and has a length greater than a set threshold. If the above example were to be valid, that length threshold would have to be set to 2. A slightly larger example is as follows. Take the sequence banabana and a threshold of two. This sequence has several maximal repeats as the conditions are that the substring appears at least twice and the length is at least the threshold. Thus, bana appears twice, ana appears twice, an appears twice, and na appears twice. The use of this algorithm with the genome allows researchers to find recurring motifs within a DNA sequence. Or, it can become a part of a larger algorithm to find recurring sequences that have minor mutations. What makes this algorithm so powerful is the data structure it is implemented with, the Suffix Tree. The Suffix Tree allows all repeating substrings, and their locations, to be found efficiently. This means that the algorithm runs in a time and space linear to the length of the genome sequence being operated on. The running of this algorithm operates in the following way. Starting from the root, and for every node thereafter- proceed as follows. REPUTER(Node current node) If the current node is a leaf node, return 1 as a counter;(marks an occurrence) If the current node is not a leaf Keep a sum starting at 0 then for each child node/path sum the results of calling REPUTER on the child nodes If the sum is at least 2 (the number of occurrences) And the length of the common string is at least the threshold A maximal repeat has been found Then return the sum to keep a tally for the parent nodes The above over-simplified algorithm will find all the maximal repeats, however a few details have been left out for ease of understanding. What is clear from the description though is that the whole tree must be traversed. The operation at each node is a constant time act as finding the length is a simple operation using the start and end indexes stored within it and no actual character comparisons need to be performed as an internal node means all its children share that substring. That is the power that the suffix tree offers the REPuter algorithm. The ability to find all maximal repeats while only iterating a number of times linearly proportional to the size of the input sequence. The problems encountered while trying to measure the cache hit ratio of the algorithm while using simple scalar was the sim-cache tool configuration. Using a file with a sample sequence 1 million characters in length consisting of A, C, G, and T resulted in statistics that were probabilistically much too high. Building the tree itself was in the upper 90 s for the hit rate percentage, and the REPuter algorithm running on top of that was only slightly lower. The reason this should not be is the way the Suffix Tree sprawls out across various memory blocks due to the way it is created and nodes are inserted out of order meaning the last inserted node could be the first node from the root. We ran our tests with a cache configuration of 1 kilobyte for the 1st level data configuration and a tree size around 20Mbs (based on a 20 byte size node).
6 6 Carl I. Bergenhem, Michael T. Smith When the tree is constructed, random paths of the tree are always being accessed in different orders which should alone yield a low hit rate as the algorithm has low spatial locality. Thus, the REPuter function should not perform much better as a full tree traversal must be performed. What this traversal means is that each path which consists of nodes in different memory blocks must all be loaded for one path to be evaluated. Then when the next path is traversed, different blocks must be called upon, or the same blocks- but in a different order leading to constant replacements within the data cache. 3.2 Probe Selection Problem (PSP) Algorithm In order to identify viruses that cause diseases and to control the quality of items in the food industry the usage of DNA arrays are very popular for fast identification of biological agents present in a given sample. A large part of this is the selection of oligos that are to be attached to the array surface. Given a set A of genomic sequences, one has to find at minimum one olignucleotide (probe) for each sequence S. This probe must be identified in a way that allows it to not hybridize with any other sequences aside from the target. Also, all probes must hybridize to their specific targets under the same reaction conditions. The most important condition is the temperature T under which the experiment is conducted. The Probe Selection Problem Algorithm, using the Suffix Tree structure, allows for the computation of the temperature T efficiently. Before any modification of any aspects of the Suffix Tree were to be made, an understanding and implementation were required. Initially a simple program implemented in Java was written. This program allowed, through a graphical interface, a user to load their string to be used for the suffix tree through a text file. It also allowed for a search to be done on said tree, giving an output of all occurrences of the substring within the original string. Another feature includes generating a random string of length L consisting only of A, T, C, and G. Along with this, generating a substring with the same letters of a length K in order to allow the randomization of the experiments. Unfortunately the later usage of SimpleScalar forced the usage of the C language. The installation of SimpleScalar generated another string of problems, resulting in the discovery that the latest cross-compiler designed for SimpleScalar was severely out-dated, and thusly an old version of Linux was required in order to configure SimpleScalar. Once set up on Red Hat Linux 9, SimpleScalar was configured and sim-cache was tested on a simple program. Once an implementation of the Suffix Tree was written, it was run through the sim-cache utility with cache sizes ranging from kilobytes, along with 1-8 way associativity. All CPU configurations had direct-mapping as the replacement structure. The results observed were however not what we expected. According to the output files generated by sim-cache the hit-rate of the Suffix Tree implementation ranged from 97% to 99% during creation, and for a search ranged between 95% and 97%. As seen in the previous example, when searching for a substring within the suffix tree the expected hit rate should be around 50% or 60%. In order to confirm that the SimpleScalar suite is working correctly a simple program was designed that generated a two dimensional array and filled each entry with a number. Then a
7 Study of Data Localities in Suffix-Tree Based Genetic Algorithms 7 traversal of the array both row-wise and column-wise was done. These different forms of traversal should have yielded a large difference in the hit ratio, due to the fact that for the row-wise traversal the next index in the array is most likely the next block in the cache, thus making the miss ratio fairly small. However, for the column-wise traversal there should be a cache miss for almost every index that is traversed. 4. Conclusion Although our theoretical computations generated a hit ratio around 50% when run through sim-cache the implementations of the Suffix Tree had hit ratios around 98%. Even the check program, a simple two dimensional array which was then traversed row-wise as well as column-wise, granted high results for the hit ratios. Especially the column-wise traversal which theoretically should have a lower hit rate in comparison to the row-wise traversal. This, however, hints towards the conclusion that there is an issue with the SimpleScalar suite. Whether this issue was from the usage of sim-cache or sim-cache itself is still left to be looked further into. The fact that both programs yielded much higher results than expected grants consistency and thus a claim can still be made the implementations of the Suffix Tree data structure still are correct, and can be used for future research within the Suffix Tree. 4.1 Future Work The value of our work as presented in this paper is that it will serve as a launch pad to now explore the various modifications to the algorithms and the runtime impacts they have on them. As the procedures for and commands have been documented now on more up to date systems the SimpleScalar suite can now be used easily and effectively to monitor cache performance along with the many other tools it offers. A last hurdle is understanding why the sim-cache simulator was yielding such high hit rates when it should obviously be much lower. However, once that is past, serious modifications and improvements can begin to be made to the Suffix Tree creation algorithm and the two genetic algorithms allowing for decreased actual runtime and increased productivity for the researchers who rely on these tools. Some of the larger modifications that can be made to the Suffix Tree could include a reconstruction of how the tree is allocated in memory. Despite the fact that nodes are created out of order compared to the way they may be accessed at a later time, a simple mechanism to, in constant time, allocate related nodes (parents and children) to the same block in memory would dramatically improve the performance of tree traversals. 4.2 SimpleScalar Installation Guide As a major complication arose in understanding the use of the simple scalar toolset, the following is a brief guide on how to use the rather un-maintained simple scalar
8 8 Carl I. Bergenhem, Michael T. Smith simulator package. The following is tested on a 7.04 ubuntu system. The source files and 'installer' were created by Cameron Palmer and are hosted by csrl.unt.edu. From a terminal window run the following commands: sudo apt-get install subversion svn co sudo apt-get install bison sudo apt-get install g gcc-3.3 cd simplescalar/ sudo sh simpleinstaller-little.sh The above takes care of the installation. To compile your programs and run them, the following two commands run from the simplescalar/ directory will work. bin/sslittle-na-sstrix-gcc -o simple_program simple_program.c simplesim-3.0/sim-cache simple_program 5. Acknowledgements This project was supported in part by the National Science Foundation Grant CCF , and was supervised by Professor Chun-Hsi Huang. 6. References 1. SimpleScalar 3.0 BBSWiki 2. Data Structures, Algorithms, & Applications in Java Suffix Trees, 3. Thomas B. Puzak, B.S.: The Effects of Spatial Locality on the Cache Performance of Binary Search Trees, MS Thesis, University of Connecticut Department of Computer Science 4. Stefan Kurtz, Chris Schleiermacher: PERuter: fast computation of maximal repeats in complete genomes. Bioinformatics Applications. Vol 15, ANSI C implementation of a Suffix Tree, 6. SimpleScalar LLC, 7. SimpleScalar Evolved: Archived Mail, August/ html 8. Growing A Suffix Tree, 98/albert/JAVA+html/SuffixTreeGrow.html 9. Fast String Searching with Suffix Trees, dynamicsimplescalar, Suffix Tree,
Analysis of parallel suffix tree construction
168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)
More informationAn Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario
An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm
More informationA Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises
308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an
More informationGiven a text file, or several text files, how do we search for a query string?
CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key
More informationNew Implementation for the Multi-sequence All-Against-All Substring Matching Problem
New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationHigh-throughput Sequence Alignment using Graphics Processing Units
High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all
More informationLecture 26. Introduction to Trees. Trees
Lecture 26 Introduction to Trees Trees Trees are the name given to a versatile group of data structures. They can be used to implement a number of abstract interfaces including the List, but those applications
More informationGrowth of the Internet Network capacity: A scarce resource Good Service
IP Route Lookups 1 Introduction Growth of the Internet Network capacity: A scarce resource Good Service Large-bandwidth links -> Readily handled (Fiber optic links) High router data throughput -> Readily
More information11/5/09 Comp 590/Comp Fall
11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors
More information11/5/13 Comp 555 Fall
11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationAnalysis of Algorithms
Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and
More informationModule 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.
The Lecture Contains: Index structure Binary search tree (BST) B-tree B+-tree Order file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture13/13_1.htm[6/14/2012
More informationLecture 5: Suffix Trees
Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common
More informationCIS265/ Trees Red-Black Trees. Some of the following material is from:
CIS265/506 2-3-4 Trees Red-Black Trees Some of the following material is from: Data Structures for Java William H. Ford William R. Topp ISBN 0-13-047724-9 Chapter 27 Balanced Search Trees Bret Ford 2005,
More informationA Suffix Tree Construction Algorithm for DNA Sequences
A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan State
More informationKnowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey
Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya
More informationFigure 1. The Suffix Trie Representing "BANANAS".
The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular
More informationIndexing Variable Length Substrings for Exact and Approximate Matching
Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of
More informationAccelerating Protein Classification Using Suffix Trees
From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science
More informationUSING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT
IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationChapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction
Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.
More informationMemory Management (2)
EECS 3221.3 Operating System Fundamentals No.9 Memory Management (2) Prof. Hui Jiang Dept of Electrical Engineering and Computer Science, York University Memory Management Approaches Contiguous Memory
More informationQuestion 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.
Questions 1 Question 13 1: (Solution, p ) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Question 13 : (Solution, p ) In implementing HYMN s control unit, the fetch cycle
More informationData Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.
Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data
More informationChapter 12: Indexing and Hashing (Cnt(
Chapter 12: Indexing and Hashing (Cnt( Cnt.) Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationA Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms
A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More information17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.
17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications
More informationBacktracking. Chapter 5
1 Backtracking Chapter 5 2 Objectives Describe the backtrack programming technique Determine when the backtracking technique is an appropriate approach to solving a problem Define a state space tree for
More informationDistributed and Paged Suffix Trees for Large Genetic Databases p.1/18
istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for
More informationAn introduction to suffix trees and indexing
An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet
More informationAdvanced Algorithms: Project
Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and
More informationUses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010
Uses for About Binary January 31, 2010 Uses for About Binary Uses for Uses for About Basic Idea Implementing Binary Example: Expression Binary Search Uses for Uses for About Binary Uses for Storage Binary
More information14.4 Description of Huffman Coding
Mastering Algorithms with C By Kyle Loudon Slots : 1 Table of Contents Chapter 14. Data Compression Content 14.4 Description of Huffman Coding One of the oldest and most elegant forms of data compression
More informationSuffix Vector: A Space-Efficient Suffix Tree Representation
Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,
More informationBUNDLED SUFFIX TREES
Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science
More informationChapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 7 Space and Time Tradeoffs Copyright 2007 Pearson Addison-Wesley. All rights reserved. Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement preprocess the input
More informationData structures for string pattern matching: Suffix trees
Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems
More informationMain Memory and the CPU Cache
Main Memory and the CPU Cache CPU cache Unrolled linked lists B Trees Our model of main memory and the cost of CPU operations has been intentionally simplistic The major focus has been on determining
More informationCprE Computer Architecture and Assembly Level Programming Spring Lab-8
CprE 381 - Computer Architecture and Assembly Level Programming Spring 2017 Lab-8 INTRODUCTION: In this lab, you will use the sim-cache simulator from the SimpleScalar toolset to compare the performance
More informationCSE 373 OCTOBER 25 TH B-TREES
CSE 373 OCTOBER 25 TH S ASSORTED MINUTIAE Project 2 is due tonight Make canvas group submissions Load factor: total number of elements / current table size Can select any load factor (but since we don
More information6. Finding Efficient Compressions; Huffman and Hu-Tucker
6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?
More informationTERM PROJECT COEN 283. Enhancing data hit ratio by using adaptive caching Technique. Operating System. Prepared By: Darshita Shah.
COEN 283 Operating System TERM PROJECT Enhancing data hit ratio by using adaptive caching Technique Prepared By: Darshita Shah Preethi Yellappa Nidhi Singh Table of Content Topics Page No 1. Introduction
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More information9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology
Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive
More informationIn this chapter you ll learn:
Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd Will you walk a little faster? said a whiting to a snail, There s a porpoise close behind us, and he s treading on
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory
More informationDatabase System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static
More informationParallel Distributed Memory String Indexes
Parallel Distributed Memory String Indexes Efficient Construction and Querying Patrick Flick & Srinivas Aluru Computational Science and Engineering Georgia Institute of Technology 1 In this talk Overview
More information19 Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd
19 Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd Will you walk a little faster? said a whiting to a snail, There s a porpoise close behind us, and he s treading
More information(for more info see:
Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire
More informationPAPER Constructing the Suffix Tree of a Tree with a Large Alphabet
IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is
More informationTREES. Trees - Introduction
TREES Chapter 6 Trees - Introduction All previous data organizations we've studied are linear each element can have only one predecessor and successor Accessing all elements in a linear sequence is O(n)
More informationChapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,
Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations
More informationBasic Compression Library
Basic Compression Library Manual API version 1.2 July 22, 2006 c 2003-2006 Marcus Geelnard Summary This document describes the algorithms used in the Basic Compression Library, and how to use the library
More informationBinary Search Tree (3A) Young Won Lim 6/2/18
Binary Search Tree (A) /2/1 Copyright (c) 2015-201 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2
More informationFriday Four Square! 4:15PM, Outside Gates
Binary Search Trees Friday Four Square! 4:15PM, Outside Gates Implementing Set On Monday and Wednesday, we saw how to implement the Map and Lexicon, respectively. Let's now turn our attention to the Set.
More informationOverview of Presentation. Heapsort. Heap Properties. What is Heap? Building a Heap. Two Basic Procedure on Heap
Heapsort Submitted by : Hardik Parikh(hjp0608) Soujanya Soni (sxs3298) Overview of Presentation Heap Definition. Adding a Node. Removing a Node. Array Implementation. Analysis What is Heap? A Heap is a
More informationsplitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014
splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis Objective Input! Output! A B C D Several
More informationN N Sudoku Solver. Sequential and Parallel Computing
N N Sudoku Solver Sequential and Parallel Computing Abdulaziz Aljohani Computer Science. Rochester Institute of Technology, RIT Rochester, United States aaa4020@rit.edu Abstract 'Sudoku' is a logic-based
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationReport Seminar Algorithm Engineering
Report Seminar Algorithm Engineering G. S. Brodal, R. Fagerberg, K. Vinther: Engineering a Cache-Oblivious Sorting Algorithm Iftikhar Ahmad Chair of Algorithm and Complexity Department of Computer Science
More informationSymbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management
Hashing Symbol Table Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management In general, the following operations are performed on
More informationChapter 8 & Chapter 9 Main Memory & Virtual Memory
Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array
More informationComputer Science 210 Data Structures Siena College Fall Topic Notes: Trees
Computer Science 0 Data Structures Siena College Fall 08 Topic Notes: Trees We ve spent a lot of time looking at a variety of structures where there is a natural linear ordering of the elements in arrays,
More informationPhysical Level of Databases: B+-Trees
Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,
More informationCS229 Lecture notes. Raphael John Lamarre Townshend
CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based
More informationSuffix trees and applications. String Algorithms
Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x
More informationOrganizing Spatial Data
Organizing Spatial Data Spatial data records include a sense of location as an attribute. Typically location is represented by coordinate data (in 2D or 3D). 1 If we are to search spatial data using the
More informationTrees. Courtesy to Goodrich, Tamassia and Olga Veksler
Lecture 12: BT Trees Courtesy to Goodrich, Tamassia and Olga Veksler Instructor: Yuzhen Xie Outline B-tree Special case of multiway search trees used when data must be stored on the disk, i.e. too large
More informationCombinatorial Pattern Matching. CS 466 Saurabh Sinha
Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary
More informationProject Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio
Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationPractical methods for constructing suffix trees
The VLDB Journal (25) 14(3): 281 299 DOI 1.17/s778-5-154-8 REGULAR PAPER Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel Practical methods for constructing suffix trees Received: 14 October
More informationDDS Dynamic Search Trees
DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion
More informationIntroduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far
Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing
More informationProperties of red-black trees
Red-Black Trees Introduction We have seen that a binary search tree is a useful tool. I.e., if its height is h, then we can implement any basic operation on it in O(h) units of time. The problem: given
More informationExact String Matching Part II. Suffix Trees See Gusfield, Chapter 5
Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationText Compression through Huffman Coding. Terminology
Text Compression through Huffman Coding Huffman codes represent a very effective technique for compressing data; they usually produce savings between 20% 90% Preliminary example We are given a 100,000-character
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationADAPTATION OF REPRESENTATION IN GP
1 ADAPTATION OF REPRESENTATION IN GP CEZARY Z. JANIKOW University of Missouri St. Louis Department of Mathematics and Computer Science St Louis, Missouri RAHUL A DESHPANDE University of Missouri St. Louis
More informationUniversity of Waterloo CS240R Fall 2017 Review Problems
University of Waterloo CS240R Fall 2017 Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems do not
More informationIMPROVING A GREEDY DNA MOTIF SEARCH USING A MULTIPLE GENOMIC SELF-ADAPTATING GENETIC ALGORITHM
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 4th, 2007 IMPROVING A GREEDY DNA MOTIF SEARCH USING A MULTIPLE GENOMIC SELF-ADAPTATING GENETIC ALGORITHM Michael L. Gargano, mgargano@pace.edu
More informationCSE 214 Computer Science II Introduction to Tree
CSE 214 Computer Science II Introduction to Tree Fall 2017 Stony Brook University Instructor: Shebuti Rayana shebuti.rayana@stonybrook.edu http://www3.cs.stonybrook.edu/~cse214/sec02/ Tree Tree is a non-linear
More informationEfficient Non-Sequential Access and More Ordering Choices in a Search Tree
Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu
More informationComputer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004
Computer Science 136 Spring 2004 Professor Bruce Final Examination May 19, 2004 Question Points Score 1 10 2 8 3 15 4 12 5 12 6 8 7 10 TOTAL 65 Your name (Please print) I have neither given nor received
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationCS301 - Data Structures Glossary By
CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationBioinformatics I, WS 09-10, D. Huson, February 10,
Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationB-Trees. Introduction. Definitions
1 of 10 B-Trees Introduction A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node may contain a large number of keys. The number of subtrees of each node,
More informationSuffix Tree and Array
Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data
More informationCSCI-401 Examlet #5. Name: Class: Date: True/False Indicate whether the sentence or statement is true or false.
Name: Class: Date: CSCI-401 Examlet #5 True/False Indicate whether the sentence or statement is true or false. 1. The root node of the standard binary tree can be drawn anywhere in the tree diagram. 2.
More information