splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014
|
|
- Alexander Wilcox
- 6 years ago
- Views:
Transcription
1 splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014
2 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis
3 Objective Input! Output! A B C D Several complete genomes! Available today for many microbial species, near future for higher eukaryotes! Pan-genome: analyze multiple genomes of species together Compressed de Bruijn graph! Graphical representation depicts how population variants relate to each other, especially where they diverge at branch points! How well conserved is a sequence?! What are network properties?!
4 de Bruijn graph Node for each distinct kmer Directed edge connects consecutive kmers Nodes overlap by k-1 bp Self-loops, multi-edges AGAAGTCC ATAAGTTA Reconstruct original sequence: Eulerian path through graph, visit each edge once
5 Compressed de Bruijn graph Merge non-branching chains of nodes Min. number of nodes that preserve path labels! Usually built from uncompressed graph! We build directly in O(n log n) time and space
6 Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25
7 Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=1000
8 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis
9 Suffix Tree Rooted, directed tree with leaf for each suffix. Each internal node, except the root, has at least two children. Each edge is labeled with nonempty substring. No two siblings begin with the same character. Path from root to leaf i spells suffix S[i... n]. Append special character $ to guarantee each suffix ends at leaf.
10 Constructing Suffix Tree NaïveAlgorithm S=banana$ suf 1 banana$ suf 1
11 Constructing Suffix Tree NaïveAlgorithm S=banana$ suf 2 suf 1 banana$ anana$ suf 2
12 Constructing Suffix Tree NaïveAlgorithm S=banana$ nana$ suf 2 suf 1 suf 3 suf 3 nana$
13 Constructing Suffix Tree NaïveAlgorithm S=banana$ nana$ suf 2 suf 1 suf 3 ana$ suf 4
14 Constructing Suffix Tree NaïveAlgorithm S=banana$ suf 2 suf 4 banana$ suf 1 suf 3 ana$ suf 4
15 Constructing Suffix Tree suf 6 suf 2 NaïveAlgorithm $ suf 4 na banana$ suf 1 O(n 2 )Eme suf 3 suf 7 suf 5 S=banana$ banana$ anana$ nana$ ana$ na$ a$ $ suf 1 suf 2 suf 3 suf 4 suf 5 suf 6 suf 7
16 suf 6 suf 2 $ Constructing Suffix Tree suf 3 suf 4 na banana$ suf 1 O(n)Eme suf 7 suf 5 On#line'Construc/n'of'Suffix'Trees,E.Ukkonen Algorithmica(1995) SuffixLinks ana a na
17 Suffix Tree Query suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 S=banana$ Search for ban
18 Suffix Tree Query suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 S=banana$ Search for ban
19 Suffix Tree Query suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 S=banana$ Search for ban
20 Suffix Tree Query S=banana$ suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 Search for ban Found 1 occurrence
21 Suffix Tree Query S=banana$ suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 Search for band Not found
22 Suffix Tree Query S=banana$ suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 Search for an Found 2 occurrences
23 Suffix Tree! Many applications in computational biology! Linear time construction algorithms Linear time solutions to Genome alignment Finding longest common substring All-pairs suffix-prefix matching Locating all maximal repetitions And many more
24 MEMs MaximalExactMatch(MEM) Exactmatchwithinsequencethatcannotbe extendedletorrightwithoutintroducing mismatch. TGCACGCAA Weareinterested inmemslength k
25 MEMs Maximal Exact Match (MEM) Exact match within sequence that cannot be extended left or right without introducing mismatch. MEMs are internal nodes in the suffix tree that have left-diverse descendants. (have descendant leaves that represent suffixes with different characters preceding them)! Linear-time suffix tree traversal to locate MEMs.
26 MEMs in Suffix Tree suf 6 suf 2 Possible MEMs: a, ana, na $ MEM? suf 4 MEM? na banana$ suf 1 MEM? suf 3 MEMs are internal nodes in suffix tree with left-diverse descendants suf 7 suf 5 S=banana$ banana$ anana$ nana$ ana$ na$ a$ $ suf 1 suf 2 suf 3 suf 4 suf 5 suf 6 suf 7
27 MEMs in Suffix Tree suf 2! MEM?! $ suf 6 MEM? suf 4 na MEMs: a, ana banana$ suf 1 MEM? suf 3 MEMs are internal nodes in suffix tree with left-diverse descendants suf 7 suf 5 S=banana$ banana$ nana$ suf 2 suf 4
28 MEMs in Suffix Tree suf 6 suf 2 $ MEM na banana$ suf 3 suf 4 MEM MEMs: a, ana suf 1 XMEM? MEMs are internal nodes in suffix tree with left-diverse descendants suf 7 suf 5 S=banana$ anana$ ana$ suf 3 suf 5
29 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis
30 Compresssed de Bruijn graph Types of nodes: i. repeatnodes ii. uniquenodes Input: AGAAGTCC$ATAAGTTA
31 splitmem Nodes in compressed de Bruijn graph classified as i. repeatnodes ii. uniquenodes Algorithm: 1 Construct set of repeatnodes 2 Sort start positions of repeatnodes 3 Create edges and uniquenodes to link noncontiguous repeatnodes
32 repeatnodes 1 Construct set of repeatnodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length k 3. Preprocess suffix tree for LMA queries 4. Compute repeatnodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length k
33 1 MEM occurs twice TGCAC GGCAA GCA
34 Overlapping MEMs TGCCATCGCCAACCAT TGCCATCGCCAACCAT
35 Tandem Repeat AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA
36 repeatnodes 1 Construct set of repeatnodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length k 3. Preprocess suffix tree for LMA queries 4. Compute repeatnodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length k
37 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z x y z α u α α β α γ α α β MEM
38 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM FindMEMinsuffixtree.
39 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM Traversesuffixlink. LookforMEMasancestor.
40 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM Traversesuffixlink. LookforMEMasancestor.
41 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM Traversesuffixlink. LookforMEMasancestor.
42 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z x y z α u α α β α γ α α β MEM FoundMEMasancestor.Decompose. RemoveembeddedMEM(suffixlinks).FindnextembeddedMEM.
43 Suffix Skips! Reduce O(n 2 ) time to O(n log n) time Suffix link: quickly navigate to distant part of tree o Pointer from internal node labeled xs to node S o Trim 1 character in O(1) time o Trim c characters in O(c) time Suffix skip: o Trim c characters in O(log c) time
44 Genome: babab Suffix Skips Suffix skips 0 (dist = 1; suffix links) Suffix skips 1 (dist=2) Suffix skips 2 (dist=4)! Additional Preprocessing:! pointer jumping to rapidly add additional links!
45 splitmem software o C++ splitmem o open source Input modes: o single genome: fasta file o pan-genome: multi-fasta file Multi k-mer construct several compressed de Bruijn graphs without rebuilding suffix tree
46 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis
47 B. Anthracis and E. coli Pan-genome analysis Examine graph properties: Number nodes, edges, avg. degree Node length distribution Genome sharing among nodes Distribution of node distances to core genome Other properties that can be studied: Girth, Diameter, Modularity, Network Motifs, etc. Functional enrichment of highly conserved or genome specific genes.
48 Pan-genome analysis
49 Pan-genome analysis Graphs of main chromosomes 9 strains of Bacillus anthracis Selection of 9 strains of Escherichia coli Species K Nodes Edges Avg. Degree B. anthracis B. anthracis B. anthracis E. coli E. coli E. coli
50 Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome
51 Pan-genome analysis Histogram of Node Lengths
52 Pan-genome analysis Histogram of Node Lengths Spike at 2k: SNPs
53 Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome
54 Pan-genome analysis B. anthracis E. coli Fraction of nodes with each level of genome sharing
55 Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome
56 Pan-genome analysis Graph encodes sequence context of segments. Core genome: subsequences that occur in at least 70% of underlying genomes. Branch and Bound Search Nodes can be further in terms of hops while closer by base pairs.
57 Pan-genome analysis Node distances to core genome Searched hop radius
58 Summary Identify pan-genome relationships graphically. Topological relationship between suffix tree and compressed de Bruijn graph. Direct construction of compressed de Bruijn graph for single or pan-genome. Introduce suffix skips. Explore pan-genome graphs of B. anthracis, E. coli. SplitMEM: Graphical pan-genome analysis with suffix skips. Marcus, S, Lee, H, Schatz, MC (2014) BioRxiv
59 Improve splitmem software: Future work Reduce space using compressed full-text index instead of suffix tree Approximate indexing of strains to form a pan-genome graph Alignment of reads to pan-genome Biological applications: Functional enrichment of core-genome and genome specific segments Expand study to larger collection of microbes and larger genomes
60 Acknowledgments Michael Schatz Hayan Lee Giuseppe Narzisi James Gurtowski Schatz Lab IT department Todd Heywood
61 Thank You!
62 Pan-genome analysis Branch and bound search (like Dijkstra s shortest path algorithm) to compute bp distance from each non-core node to core genome: Traverse all distinct paths from source until OR o a core node is reached o current node was visited by a shorter path Bounded search once a core node is found, its distance bounds maximum search distance along other paths
An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario
An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm
More informationLecture 5: Suffix Trees
Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationSuffix Array Construction
Suffix Array Construction Suffix array construction means simply sorting the set of all suffixes. Using standard sorting or string sorting the time complexity is Ω(DP (T [0..n] )). Another possibility
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationLecture L16 April 19, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where
More informationAn introduction to suffix trees and indexing
An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet
More informationIDBA - A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationCombinatorial Pattern Matching. CS 466 Saurabh Sinha
Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary
More informationUSING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT
IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah
More informationLecture 18 April 12, 2005
6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today
More informationIDBA A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong
More informationLCP Array Construction
LCP Array Construction The LCP array is easy to compute in linear time using the suffix array SA and its inverse SA 1. The idea is to compute the lcp values by comparing the suffixes, but skip a prefix
More informationOmega: an Overlap-graph de novo Assembler for Metagenomics
Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n
More informationData structures for string pattern matching: Suffix trees
Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems
More information1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator
15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays
More informationGenome Assembly Using de Bruijn Graphs. Biostatistics 666
Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position
More informationHigh-throughput Sequence Alignment using Graphics Processing Units
High-throughput Sequence Alignment using Graphics Processing Units Michael Schatz & Cole Trapnell May 21, 2009 UMD NVIDIA CUDA Center Of Excellence Presentation Searching Wikipedia How do you find all
More informationReporting Consecutive Substring Occurrences Under Bounded Gap Constraints
Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Gonzalo Navarro University of Chile, Chile gnavarro@dcc.uchile.cl Sharma V. Thankachan Georgia Institute of Technology, USA sthankachan@gatech.edu
More information11/5/09 Comp 590/Comp Fall
11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationSpace Efficient Linear Time Construction of
Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics
More informationExact String Matching Part II. Suffix Trees See Gusfield, Chapter 5
Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm
More informationApplications of Suffix Tree
Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences
More informationSuffix Tree and Array
Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data
More information11/5/13 Comp 555 Fall
11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace
More information12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006
12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 3 Sequence comparison by compression This chapter is based on the following articles, which are all recommended reading: X. Chen,
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory
More informationSuffix trees and applications. String Algorithms
Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x
More informationA representation of a compressed de Bruijn graph for pan genome analysis that enables search
DOI 10.1186/s13015-016-0083-7 Algorithms for Molecular Biology RESEARCH Open Access A representation of a compressed de Bruijn graph for pan genome analysis that enables search Timo Beller and Enno Ohlebusch
More informationBioinformatics I, WS 09-10, D. Huson, February 10,
Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationGSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu
GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics
More informationFigure 1. The Suffix Trie Representing "BANANAS".
The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular
More informationBurrows Wheeler Transform
Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is an important technique for text compression, text indexing, and their combination compressed text indexing. Let T [0..n] be the text with
More informationLecture 6: Suffix Trees and Their Construction
Biosequence Algorithms, Spring 2007 Lecture 6: Suffix Trees and Their Construction Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 6: Intro to suffix trees p.1/46 II:
More informationDistributed Suffix Array Construction
Distributed Suffix Array Construction Huibin Shen Department of Computer Science University of Helsinki String Processing Project April 25, 2012 Huibin Shen (U.H.) Distributed Suffix Array Construction
More informationMotif Discovery using optimized Suffix Tries
Motif Discovery using optimized Suffix Tries Sergio Prado Promoter: Prof. dr. ir. Jan Fostier Supervisor: ir. Dieter De Witte Faculty of Engineering and Architecture Department of Information Technology
More informationExact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from
Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/ Goals for Today Understand how suffix links are used in Ukkonen's algorithm
More informationChapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 7 Space and Time Tradeoffs Copyright 2007 Pearson Addison-Wesley. All rights reserved. Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement preprocess the input
More informationGiven a text file, or several text files, how do we search for a query string?
CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key
More informationCombinatorial Pattern Matching
Combinatorial Pattern Matching Outline Exact Pattern Matching Keyword Trees Suffix Trees Approximate String Matching Local alignment is to slow Quadratic local alignment is too slow while looking for similarities
More informationTutorial. Aligning contigs manually using the Genome Finishing. Sample to Insight. February 6, 2019
Aligning contigs manually using the Genome Finishing Module February 6, 2019 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com
More informationCSC 421: Algorithm Design Analysis. Spring 2017
CSC 421: Algorithm Design Analysis Spring 2017 Transform & conquer transform-and-conquer approach presorting, balanced search trees, heaps, Horner's Rule problem reduction space/time tradeoffs heap sort,
More information7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14
7.36/7.91 recitation DG Lectures 5 & 6 2/26/14 1 Announcements project specific aims due in a little more than a week (March 7) Pset #2 due March 13, start early! Today: library complexity BWT and read
More informationDO NOT. In the following, long chains of states with a single child are condensed into an edge showing all the letters along the way.
CS61B, Fall 2009 Test #3 Solutions P. N. Hilfinger Unless a question says otherwise, time estimates refer to asymptotic bounds (O( ), Ω( ), Θ( )). Always give the simplest bounds possible (O(f(x)) is simpler
More informationSuffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):
Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for
More informationMining Frequent Patterns without Candidate Generation
Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationStudy of Data Localities in Suffix-Tree Based Genetic Algorithms
Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationSupplementary Note 1: Detailed methods for vg implementation
Supplementary Note 1: Detailed methods for vg implementation GCSA2 index generation We generate the GCSA2 index for a vg graph by transforming the graph into an effective De Bruijn graph with k = 256,
More informationFrom Smith-Waterman to BLAST
From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is
More informationNew Implementation for the Multi-sequence All-Against-All Substring Matching Problem
New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of
More informationSmall-Space 2D Compressed Dictionary Matching
Small-Space 2D Compressed Dictionary Matching Shoshana Neuburger 1 and Dina Sokol 2 1 Department of Computer Science, The Graduate Center of the City University of New York, New York, NY, 10016 shoshana@sci.brooklyn.cuny.edu
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationGenome 373: Mapping Short Sequence Reads I. Doug Fowler
Genome 373: Mapping Short Sequence Reads I Doug Fowler Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR Two different strategies for parallel amplification BRIDGE PCR EMULSION
More informationCSC 421: Algorithm Design & Analysis. Spring Space vs. time
CSC 421: Algorithm Design & Analysis Spring 2015 Space vs. time space/time tradeoffs examples: heap sort, data structure redundancy, hashing string matching brute force, Horspool algorithm, Boyer-Moore
More informationGreedy Approach: Intro
Greedy Approach: Intro Applies to optimization problems only Problem solving consists of a series of actions/steps Each action must be 1. Feasible 2. Locally optimal 3. Irrevocable Motivation: If always
More informationLinear-Time Suffix Array Implementation in Haskell
Linear-Time Suffix Array Implementation in Haskell Anna Geiduschek and Ben Isaacs CS240H Final Project GitHub: https://github.com/ageiduschek/dc3 Suffix Array June 5th, 2014 Abstract The purpose of our
More informationAligners. J Fass 21 June 2017
Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21
More information58093 String Processing Algorithms. Lectures, Autumn 2013, period II
58093 String Processing Algorithms Lectures, Autumn 2013, period II Juha Kärkkäinen 1 Contents 0. Introduction 1. Sets of strings Search trees, string sorting, binary search 2. Exact string matching Finding
More informationToward Optimal Motif Enumeration
Toward Optimal Motif Enumeration Patricia A. Evans 1 and Andrew D. Smith 1,2 1 University of New Brunswick, P.O. Box 4400, Fredericton N.B., E3B 5A3, Canada 2 Ontario Cancer Institute, University Health
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationProject Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio
Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade
More informationAdam M Phillippy Center for Bioinformatics and Computational Biology
Adam M Phillippy Center for Bioinformatics and Computational Biology WGS sequencing shearing sequencing assembly WGS assembly Overlap reads identify reads with shared k-mers calculate edit distance Layout
More informationDynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.
Dynamic Programming Course: A structure based flexible search method for motifs in RNA By: Veksler, I., Ziv-Ukelson, M., Barash, D., Kedem, K Outline Background Motivation RNA s structure representations
More informationParallel Distributed Memory String Indexes
Parallel Distributed Memory String Indexes Efficient Construction and Querying Patrick Flick & Srinivas Aluru Computational Science and Engineering Georgia Institute of Technology 1 In this talk Overview
More informationm6aviewer Version Documentation
m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.
More informationBinary Trees, Binary Search Trees
Binary Trees, Binary Search Trees Trees Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search, insert, delete)
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationEfficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms
Efficient Sequential Algorithms, Comp39 Part 3. String Algorithms University of Liverpool References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to Algorithms, Second Edition. MIT Press (21).
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationExternal Memory Algorithms and Data Structures Fall Project 3 A GIS system
External Memory Algorithms and Data Structures Fall 2003 1 Project 3 A GIS system GSB/RF November 17, 2003 1 Introduction The goal of this project is to implement a rudimentary Geographical Information
More informationAccurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing
Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation
More informationSanger Data Assembly in SeqMan Pro
Sanger Data Assembly in SeqMan Pro DNASTAR provides two applications for assembling DNA sequence fragments: SeqMan NGen and SeqMan Pro. SeqMan NGen is primarily used to assemble Next Generation Sequencing
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationde Bruijn graphs for sequencing data
de Bruijn graphs for sequencing data Rayan Chikhi CNRS Bonsai team, CRIStAL/INRIA, Univ. Lille 1 SMPGD 2016 1 MOTIVATION - de Bruijn graphs are instrumental for reference-free sequencing data analysis:
More informationAccelerating Protein Classification Using Suffix Trees
From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science
More informationSequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics
Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments
More informationTRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES
TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES BENJARATH PHOOPHAKDEE AND MOHAMMED J. ZAKI Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY,
More information6. Finding Efficient Compressions; Huffman and Hu-Tucker
6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?
More informationThe Adaptive Radix Tree
Department of Informatics, University of Zürich MSc Basismodul The Adaptive Radix Tree Rafael Kallis Matrikelnummer: -708-887 Email: rk@rafaelkallis.com September 8, 08 supervised by Prof. Dr. Michael
More informationTCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?
Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall
More informationData Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.
Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data
More informationSolution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.
Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 The Fractional Knapsack Problem Huffman Coding 2 Sample Problems to Illustrate The Fractional Knapsack Problem Variable-length (Huffman)
More informationDivya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by
Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of
More informationLempel-Ziv compression: how and why?
Lempel-Ziv compression: how and why? Algorithms on Strings Paweł Gawrychowski July 9, 2013 s July 9, 2013 2/18 Outline Lempel-Ziv compression Computing the factorization Using the factorization July 9,
More informationThis paper proposes: Mining Frequent Patterns without Candidate Generation
Mining Frequent Patterns without Candidate Generation a paper by Jiawei Han, Jian Pei and Yiwen Yin School of Computing Science Simon Fraser University Presented by Maria Cutumisu Department of Computing
More informationGraph and Digraph Glossary
1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics
More informationBuilding approximate overlap graphs for DNA assembly using random-permutations-based search.
An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,
More informationCSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions
CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions Note: Solutions to problems 4, 5, and 6 are due to Marius Nicolae. 1. Consider the following algorithm: for i := 1 to α n log e n
More informationShotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.
Shotgun sequencing Genome (unknown) Reads (randomly chosen; have errors) Coverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line
More informationAlgorithms Dr. Haim Levkowitz
91.503 Algorithms Dr. Haim Levkowitz Fall 2007 Lecture 4 Tuesday, 25 Sep 2007 Design Patterns for Optimization Problems Greedy Algorithms 1 Greedy Algorithms 2 What is Greedy Algorithm? Similar to dynamic
More informationComputing the Longest Common Substring with One Mismatch 1
ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy
More information