Suffix trees and applications. String Algorithms

Size: px
Start display at page:

Download "Suffix trees and applications. String Algorithms"

Transcription

1 Suffix trees and applications String Algorithms

2 Tries a trie is a data structure for storing and retrieval of strings.

3 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x 2 = a b c

4 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b a x 2 = a b c

5 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x 2 = a b c a b 1

6 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x 2 = a b c a b 1 a

7 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x 2 = a b c a b 1 a b

8 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x 2 = a b c a b 1 2 a b c

9 Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x 2 = a b c a b 1 2 a b c Observations: shared prefixes implies shared initial paths Often we want each string to correspond to a unique root-to-leaf path, i.e. make sure that no input-string is a prefix of another. How?

10 Tries a trie is a data structure for storing and retrieval of strings. a a x 1 = a b b b x 1 = a b $ x 2 = a b c 1 c x 2 = a b c $ 2 x 3 = $ Observations: shared prefixes implies shared initial paths Often we want each string to correspond to a unique root-to-leaf path, i.e. make sure that no input-string is a prefix of another. How?

11 Tries a trie is a data structure for storing and retrieval of strings. a a a x 1 = a b x 2 = a b c b 1 2 b c x 1 = a b $ x 2 = a b c $ x 3 = $ b $ 1 Observations: shared prefixes implies shared initial paths Often we want each string to correspond to a unique root-to-leaf path, i.e. make sure that no input-string is a prefix of another. How?

12 Tries a trie is a data structure for storing and retrieval of strings. a a a a x 1 = a b x 2 = a b c b 1 2 b c x 1 = a b $ x 2 = a b c $ x 3 = $ b b c $ $ 1 2 Observations: shared prefixes implies shared initial paths Often we want each string to correspond to a unique root-to-leaf path, i.e. make sure that no input-string is a prefix of another. How?

13 Tries a trie is a data structure for storing and retrieval of strings. a a a a $ x 1 = a b x 1 = a b $ b b b b x 2 = a b c 1 x 2 = a b c $ c c $ $ 2 x 3 = $ Observations: shared prefixes implies shared initial paths Often we want each string to correspond to a unique root-to-leaf path, i.e. make sure that no input-string is a prefix of another. How?

14 Tries a trie is a data structure for storing and retrieval of strings. a a a a $ x 1 = a b x 1 = a b $ b b b b x 2 = a b c 1 x 2 = a b c $ c c $ $ 2 x 3 = $ Observations: shared prefixes implies shared initial paths Application: Given a query-string y[1...m], we can determine if y equals one of the input-strings (or a prefix of one) in time O(m)...

15 What is the space complexity?

16 Compacted tries Saving space: Eliminate all internal nodes of degree 2... x 1 = a b $ x 2 = a b c $ x 3 = $ a b a b $ c $ $ a b $ 1 $ c $ 3 2 If we have n input-strings, then the trie has n+1 leaves and at most n internal nodes, i.e space O(n) for the tree. What about the labels?

17 Compacted tries Saving space: Eliminate all internal nodes of degree 2... x 1 = a b $ x 2 = a b c $ x 3 = $ a b a b $ c $ $ a b $ 1 $ c $ 3 2 If we have n input-strings, then the trie has n+1 leaves and at most n internal nodes, i.e space O(n) for the tree. What about the labels? Labels can be represented in space O(1), i.e. ab (1,1,2)

18 If there are n strings... How many leaves are there? How many inner nodes? How many edges?

19 Suffix tree The suffix tree T(x) of string x[1..n] is the compacted trie of all suffixes x[i..n] for i = 1,.., n+1, i.e. including the empty suffix

20 Suffix tree The suffix tree T(x) of string x[1..n] is the compacted trie of all suffixes x[i..n] for i = 1,..,n+1, i.e. including the empty suffix Example for x = tatat tatat$ atat$ tat$ at$ t$ ε$

21 Suffix tree The suffix tree T(x) of string x[1..n] is the compacted trie of all suffixes x[i..n] for i = 1,..,n+1, i.e. including the empty suffix Example for x = tatat tatat$ atat$ tat$ at$ t$ ε$

22 A larger example

23 A larger example Node has path-label ssi and is at depth 3...

24 A larger example Node has path-label ssi and is at depth 3... Path-label of leaf i is suffix i, i.e. x[i..n]$...

25 A larger example Path-label of lowest common ancestor of leaf i and j, is longest common prefix of suffix i and j of x... Node has path-label ssi and is at depth 3... Path-label of leaf i is suffix i, i.e. x[i..n]$...

26 What is the space complexity?

27 Space consumption

28 Space consumption

29 Space consumption

30 Space consumption

31 Constructing suffix trees Constructing T(x) by inserting each suffix one by one takes time O(n 2 ) Can we do better?

32 Constructing suffix trees Constructing T(x) by inserting each suffix one by one takes time O(n 2 ) Can we do better? [Weiner 1973]: T(x) can be constructed in time O(n)... There are two practical algorithms that construct the suffix tree in linear time: McCreight (1976) and Ukkonen (1993)...

33 What about applications?... exact matching, finding repeats, longest common substring...

34 Exact matching Given string x and pattern y, report where y occurs in x If y occurs in x at position i, then y is a prefix of suffix i of x i j y is spelled by an initial part of the path from the root to leaf i in T(x)

35 Exact matching Given string x and pattern y, report where y occurs in x

36 Exact matching Given string x and pattern y, report where y occurs in x

37 Exact matching Given string x and pattern y, report where y occurs in x

38 Exact matching Given string x and pattern y, report where y occurs in x

39 Exact matching Given string x and pattern y, report where y occurs in x Pattern ata occurs at position 2 in tatat Time: O( P ) using the suffix tree T(S)

40 Exact matching Given string x and pattern y, report where y occurs in x

41 Exact matching Given string x and pattern y, report where y occurs in x

42 Exact matching Given string x and pattern y, report where y occurs in x

43 Exact matching Given string x and pattern y, report where y occurs in x

44 Exact matching Given string x and pattern y, report where y occurs in x Pattern tatt does not occur in tatat Time: O( P ) using the suffix tree T(S)

45 Repeats A pair of substrings R=(S[i 1..j 1 ], S[i 2..j 2 ]) is a...

46 Repeats A pair of substrings R=(S[i 1..j 1 ], S[i 2..j 2 ]) is a...

47 Repeats A pair of substrings R=(S[i 1..j 1 ], S[i 2..j 2 ]) is a...

48 Finding exact repeats

49 Finding exact repeats

50 Finding exact repeats

51 Finding exact repeats

52 Finding exact repeats

53 Finding exact repeats

54 Finding exact repeats

55 A larger example

56 Finding maximal exact repeats

57 Finding maximal exact repeats

58 Finding maximal exact repeats

59 Finding maximal exact repeats

60 Finding maximal exact repeats

61 Other repeats Maximal repeats with bounded gap in time O(n log n + z) Tandem repeats in time O(n log n + z) Palindromic repeats in O(n + z)... all using suffix trees...

62 More strings The longest common substring of x[1..n] and y[1..m] is the longest string z which occurs in both x and y... Can this be found efficiently using a suffix tree?

63 More strings The longest common substring of x[1..n] and y[1..m] is the longest string z which occurs in both x and y... Can this be found efficiently using a suffix tree? z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] x z y

64 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε

65 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 a a t a

66 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a a t a

67 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a a t a a 3

68 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a 4 a t a a 3

69 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a 4 a t a 5 a 3

70 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] Idea: Build a compacted trie of all suffixes of x and y, such that each suffix of x and y corresponds to unique root-to-leaf paths... tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a 4 a t a 5 a 3 6

71 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a 4 a t a 5 a 3 6 Observe: z is the path-label of the deepest node with suffixes from both x and y as leaves in its sub-tree...

72 More strings z is the longest common prefix of any pair of suffixes x[i..n] and y[j..m] tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a 4 a t a 5 a 3 6 Observe: z is the path-label of the deepest node with suffixes from both x and y as leaves in its sub-tree... Time: O(n+m)

73 Generalized suffix tree This is the generalized suffix tree of tatat and aataa tatat$ atat$ tat$ at$ t$ ε$ aataa ataa taa aa a ε 1 2 a a 4 a t a 5 a 3 6 Can be constructed by constructing the suffix tree of... tatat$aataa

74 Generalized suffix tree 1 n n+2 n+m+1 $ 1 m... we must argue that we get the same branching structure...

75 Generalized suffix tree 1 n n+2 n+m+1 $ 1 m Case 1: i j $ $

76 Generalized suffix tree 1 n n+2 n+m+1 $ 1 m Case 1: i j $ $ i j $ $ $ $ j i

77 Generalized suffix tree 1 n n+2 n+m+1 $ 1 m Case 1: i j $ $ i j $ $ $ i $ j $ (1, j) $ (1, i)

78 Generalized suffix tree 1 n n+2 n+m+1 $ 1 m Case 2: i j i +(n+1) j +(n+1) j i (2, j ) (2, i )

79 Generalized suffix tree 1 n n+2 n+m+1 $ 1 m Case 3: i j $ i j +(n+1) $ j $ (2, j ) $ (1, i) i

80 Is everything great?

81 Space consumption Fact: T(x) requires O(n) space, where n= x... but how much space does it consume in practice?

82

83 Space consumption Fact: T(x) requires O(n) space, where n= x, but in practice somewhere between 10 and 40 bytes per letter in x... Is this a problem? Depends on n, if then yes...

84 Alphabet size How much time does it take to find the proper edge out from a node when searching in a suffix tree?

85 Alphabet size How much time does it take to find the proper edge out from a node when searching in a suffix tree? Time proportional to the out-degree of the node A search time in pratice is O( A P )... If A is large, e.g. 256, this matters!!

86 Alphabet size How much time does it take to find the proper edge out from a node when searching in a suffix tree? Idea 1: Organising children in a search-tree, reduces search time from A to O(log A )... (requires an ordered alphabet)

87 Alphabet size How much time does it take to find the proper edge out from a node when searching in a suffix tree? Idea 2: Organising children in a vector of size A indexed by letters, reduces search time from A to O(1)... (requires a finite alphabet)

88 Alphabet size How much time does it take to find the proper edge out from a node when searching in a suffix tree? Idea 3: Use some other dictionary for mapping letters to children the alphabet size matters in practice...

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

Ukkonen s suffix tree algorithm

Ukkonen s suffix tree algorithm Ukkonen s suffix tree algorithm Recall McCreight s approach: For i = 1.. n+1, build compressed trie of {x[..n]$ i} Ukkonen s approach: For i = 1.. n+1, build compressed trie of {$ i} Compressed trie of

More information

Lecture 18 April 12, 2005

Lecture 18 April 12, 2005 6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today

More information

Suffix-based text indices, construction algorithms, and applications.

Suffix-based text indices, construction algorithms, and applications. Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in

More information

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d): Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator 15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays

More information

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18 istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for

More information

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5 Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm

More information

Lecture 6: Suffix Trees and Their Construction

Lecture 6: Suffix Trees and Their Construction Biosequence Algorithms, Spring 2007 Lecture 6: Suffix Trees and Their Construction Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 6: Intro to suffix trees p.1/46 II:

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Suffix Trees and its Construction

Suffix Trees and its Construction Chapter 5 Suffix Trees and its Construction 5.1 Introduction to Suffix Trees Sometimes fundamental techniques do not make it into the mainstream of computer science education in spite of its importance,

More information

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Combinatorial Pattern Matching. CS 466 Saurabh Sinha Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

More information

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/ Goals for Today Understand how suffix links are used in Ukkonen's algorithm

More information

Friday Four Square! 4:15PM, Outside Gates

Friday Four Square! 4:15PM, Outside Gates Binary Search Trees Friday Four Square! 4:15PM, Outside Gates Implementing Set On Monday and Wednesday, we saw how to implement the Map and Lexicon, respectively. Let's now turn our attention to the Set.

More information

11/5/13 Comp 555 Fall

11/5/13 Comp 555 Fall 11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace

More information

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

58093 String Processing Algorithms. Lectures, Autumn 2013, period II 58093 String Processing Algorithms Lectures, Autumn 2013, period II Juha Kärkkäinen 1 Contents 0. Introduction 1. Sets of strings Search trees, string sorting, binary search 2. Exact string matching Finding

More information

Applications of Suffix Tree

Applications of Suffix Tree Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences

More information

BUNDLED SUFFIX TREES

BUNDLED SUFFIX TREES Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Suffix Trees and Arrays

Suffix Trees and Arrays Suffix Trees and Arrays Yufei Tao KAIST May 1, 2013 We will discuss the following substring matching problem: Problem (Substring Matching) Let σ be a single string of n characters. Given a query string

More information

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor

More information

11/5/09 Comp 590/Comp Fall

11/5/09 Comp 590/Comp Fall 11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors

More information

COMP4128 Programming Challenges

COMP4128 Programming Challenges Multi- COMP4128 Programming Challenges School of Computer Science and Engineering UNSW Australia Table of Contents 2 Multi- 1 2 Multi- 3 3 Multi- Given two strings, a text T and a pattern P, find the first

More information

marc skodborg, simon fischer,

marc skodborg, simon fischer, E F F I C I E N T I M P L E M E N TAT I O N S O F S U F - F I X T R E E S marc skodborg, 201206073 simon fischer, 201206049 master s thesis June 2017 Advisor: Christian Nørgaard Storm Pedersen AARHUS AU

More information

Lecture L16 April 19, 2012

Lecture L16 April 19, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where

More information

Suffix Tree and Array

Suffix Tree and Array Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data

More information

Given a text file, or several text files, how do we search for a query string?

Given a text file, or several text files, how do we search for a query string? CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key

More information

1.3 Functions and Equivalence Relations 1.4 Languages

1.3 Functions and Equivalence Relations 1.4 Languages CSC4510 AUTOMATA 1.3 Functions and Equivalence Relations 1.4 Languages Functions and Equivalence Relations f : A B means that f is a function from A to B To each element of A, one element of B is assigned

More information

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc Suffix Trees Martin Farach-Colton Rutgers University & Tokutek, Inc What s in this talk? What s a suffix tree? What can you do with them? How do you build them? A challenge problem What s in this talk?

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

Computing the Longest Common Substring with One Mismatch 1

Computing the Longest Common Substring with One Mismatch 1 ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy

More information

Longest Common Substring

Longest Common Substring 2012 Longest Common Substring CS:255 Design and Analysis of Algorithms Computer Science Department San Jose State University San Jose, CA-95192 Project Guide: Dr Sami Khuri Team Members: Avinash Anantharamu

More information

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Gonzalo Navarro University of Chile, Chile gnavarro@dcc.uchile.cl Sharma V. Thankachan Georgia Institute of Technology, USA sthankachan@gatech.edu

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

CSCI 104 Tries. Mark Redekopp David Kempe

CSCI 104 Tries. Mark Redekopp David Kempe 1 CSCI 104 Tries Mark Redekopp David Kempe TRIES 2 3 Review of Set/Map Again Recall the operations a set or map performs Insert(key) Remove(key) find(key) : bool/iterator/pointer Get(key) : value [Map

More information

Binary Trees, Binary Search Trees

Binary Trees, Binary Search Trees Binary Trees, Binary Search Trees Trees Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search, insert, delete)

More information

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

CSE 417 Dynamic Programming (pt 5) Multiple Inputs CSE 417 Dynamic Programming (pt 5) Multiple Inputs Reminders > HW5 due Wednesday Dynamic Programming Review > Apply the steps... optimal substructure: (small) set of solutions, constructed from solutions

More information

Efficient Implementation of Suffix Trees

Efficient Implementation of Suffix Trees SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(2), 129 141 (FEBRUARY 1995) Efficient Implementation of Suffix Trees ARNE ANDERSSON AND STEFAN NILSSON Department of Computer Science, Lund University, Box 118,

More information

Combinatorial Pattern Matching

Combinatorial Pattern Matching Combinatorial Pattern Matching Outline Exact Pattern Matching Keyword Trees Suffix Trees Approximate String Matching Local alignment is to slow Quadratic local alignment is too slow while looking for similarities

More information

Applications of Succinct Dynamic Compact Tries to Some String Problems

Applications of Succinct Dynamic Compact Tries to Some String Problems Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular

More information

Verifying a Border Array in Linear Time

Verifying a Border Array in Linear Time Verifying a Border Array in Linear Time František Franěk Weilin Lu P. J. Ryan W. F. Smyth Yu Sun Lu Yang Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario

More information

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming (2017F) Lecture12: Strings and Dynamic Programming Daijin Kim CSE, POSTECH dkim@postech.ac.kr Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence

More information

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

University of Waterloo CS240R Fall 2017 Solutions to Review Problems University of Waterloo CS240R Fall 2017 Solutions to Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems

More information

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010 Uses for About Binary January 31, 2010 Uses for About Binary Uses for Uses for About Basic Idea Implementing Binary Example: Expression Binary Search Uses for Uses for About Binary Uses for Storage Binary

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

Algorithms and Data S tructures Structures Optimal S earc Sear h ch Tr T ees; r T ries Tries Ulf Leser

Algorithms and Data S tructures Structures Optimal S earc Sear h ch Tr T ees; r T ries Tries Ulf Leser Algorithms and Data Structures Optimal Search Trees; Tries Ulf Leser Static Key Sets Sometimes, the setofkeysisfixed fixed Names in a city, customers of many companies, elements of a programming language,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Introduction to Suffix Trees

Introduction to Suffix Trees Algorithms on Strings, Trees, and Sequences Dan Gusfield University of California, Davis Cambridge University Press 1997 Introduction to Suffix Trees A suffix tree is a data structure that exposes the

More information

A Suffix Tree Construction Algorithm for DNA Sequences

A Suffix Tree Construction Algorithm for DNA Sequences A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan State

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Trees. CSE 373 Data Structures

Trees. CSE 373 Data Structures Trees CSE 373 Data Structures Readings Reading Chapter 7 Trees 2 Why Do We Need Trees? Lists, Stacks, and Queues are linear relationships Information often contains hierarchical relationships File directories

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Predecessor. Predecessor Problem van Emde Boas Tries. Philip Bille

Predecessor. Predecessor Problem van Emde Boas Tries. Philip Bille Predecessor Predecessor Problem van Emde Boas Tries Philip Bille Predecessor Predecessor Problem van Emde Boas Tries Predecessors Predecessor problem. Maintain a set S U = {,..., u-} supporting predecessor(x):

More information

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms??? SORTING + STRING COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges book.

More information

x-fast and y-fast Tries

x-fast and y-fast Tries x-fast and y-fast Tries Problem Problem Set Set 7 due due in in the the box box up up front. front. That's That's the the last last problem problem set set of of the the quarter! quarter! Outline for Today

More information

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016 CMPUT 403: Strings Zachary Friggstad March 11, 2016 Outline Tries Suffix Arrays Knuth-Morris-Pratt Pattern Matching Tries Given a dictionary D of strings and a query string s, determine if s is in D. Using

More information

Data Structures and Algorithms(12)

Data Structures and Algorithms(12) Ming Zhang "Data s and Algorithms" Data s and Algorithms(12) Instructor: Ming Zhang Textbook Authors: Ming Zhang, Tengjiao Wang and Haiyan Zhao Higher Education Press, 28.6 (the "Eleventh Five-Year" national

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science Data Structures and Methods Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 Lecture Objectives 1. To this point:

More information

Predecessor. Predecessor. Predecessors. Predecessors. Predecessor Problem van Emde Boas Tries. Predecessor Problem van Emde Boas Tries.

Predecessor. Predecessor. Predecessors. Predecessors. Predecessor Problem van Emde Boas Tries. Predecessor Problem van Emde Boas Tries. Philip Bille s problem. Maintain a set S U = {,..., u-} supporting predecessor(x): return the largest element in S that is x. sucessor(x): return the smallest element in S that is x. insert(x): set S =

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

String Patterns and Algorithms on Strings

String Patterns and Algorithms on Strings String Patterns and Algorithms on Strings Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 11 Objectives To introduce the pattern matching problem and the important of algorithms

More information

Syllabus. 5. String Problems. strings recap

Syllabus. 5. String Problems. strings recap Introduction to Algorithms Syllabus Recap on Strings Pattern Matching: Knuth-Morris-Pratt Longest Common Substring Edit Distance Context-free Parsing: Cocke-Younger-Kasami Huffman Compression strings recap

More information

Lowest Common Ancestor (LCA) Queries

Lowest Common Ancestor (LCA) Queries Lowest Common Ancestor (LCA) Queries A technique with application to approximate matching Chris Lewis Approximate Matching Match pattern to text Insertion/Deletion/Substitution Applications Bioinformatics,

More information

University of Waterloo CS240 Spring 2018 Help Session Problems

University of Waterloo CS240 Spring 2018 Help Session Problems University of Waterloo CS240 Spring 2018 Help Session Problems Reminder: Final on Wednesday, August 1 2018 Note: This is a sample of problems designed to help prepare for the final exam. These problems

More information

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises 308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an

More information

Alphabet-Dependent String Searching with Wexponential Search Trees

Alphabet-Dependent String Searching with Wexponential Search Trees Alphabet-Dependent String Searching with Wexponential Search Trees Johannes Fischer and Pawe l Gawrychowski February 15, 2013 arxiv:1302.3347v1 [cs.ds] 14 Feb 2013 Abstract It is widely assumed that O(m

More information

K-structure, Separating Chain, Gap Tree, and Layered DAG

K-structure, Separating Chain, Gap Tree, and Layered DAG K-structure, Separating Chain, Gap Tree, and Layered DAG Presented by Dave Tahmoush Overview Improvement on Gap Tree and K-structure Faster point location Encompasses Separating Chain Better storage Designed

More information

Mapping a Dynamic Prefix Tree on a P2P Network

Mapping a Dynamic Prefix Tree on a P2P Network Mapping a Dynamic Prefix Tree on a P2P Network Eddy Caron, Frédéric Desprez, Cédric Tedeschi GRAAL WG - October 26, 2006 Outline 1 Introduction 2 Related Work 3 DLPT architecture 4 Mapping 5 Conclusion

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Optimal Search Trees; Tries Ulf Leser Content of this Lecture Optimal Search Trees Definition Construction Analysis Searching Strings: Tries Ulf Leser: Algorithms and Data

More information

MITOCW watch?v=ninwepprkdq

MITOCW watch?v=ninwepprkdq MITOCW watch?v=ninwepprkdq The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

Linearized Suffix Tree: an Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays

Linearized Suffix Tree: an Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays Algorithmica (2008) 52: 350 377 DOI 10.1007/s00453-007-9061-2 Linearized Suffix Tree: an Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays Dong Kyue Kim Minhwan Kim

More information

Suffix Arrays Slides by Carl Kingsford

Suffix Arrays Slides by Carl Kingsford Suffix Arrays 02-714 Slides by Carl Kingsford Suffix Arrays Even though Suffix Trees are O(n) space, the constant hidden by the big-oh notation is somewhat big : 20 bytes / character in good implementations.

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

L02 : 08/21/2015 L03 : 08/24/2015.

L02 : 08/21/2015 L03 : 08/24/2015. L02 : 08/21/2015 http://www.csee.wvu.edu/~adjeroh/classes/cs493n/ Multimedia use to be the idea of Big Data Definition of Big Data is moving data (It will be different in 5..10 yrs). Big data is highly

More information

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions Note: Solutions to problems 4, 5, and 6 are due to Marius Nicolae. 1. Consider the following algorithm: for i := 1 to α n log e n

More information

Final Exam Solutions

Final Exam Solutions COS 226 FINAL SOLUTIONS, FALL 214 1 COS 226 Algorithms and Data Structures Fall 214 Final Exam Solutions 1. Digraph traversal. (a) 8 5 6 1 4 2 3 (b) 4 2 3 1 5 6 8 2. Analysis of algorithms. (a) N (b) log

More information

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we

More information

Exact Matching: Hash-tables and Automata

Exact Matching: Hash-tables and Automata 18.417 Introduction to Computational Molecular Biology Lecture 10: October 12, 2004 Scribe: Lele Yu Lecturer: Ross Lippert Editor: Mark Halsey Exact Matching: Hash-tables and Automata While edit distances

More information

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms Efficient Sequential Algorithms, Comp39 Part 3. String Algorithms University of Liverpool References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to Algorithms, Second Edition. MIT Press (21).

More information

Algorithms Theory. 15 Text Search (2)

Algorithms Theory. 15 Text Search (2) Algorithms Theory 15 Text Search (2) Construction of suffix trees Prof. Dr. S. Albers Suffix tree t = x a b x a $ 1 2 3 4 5 6 a x a b x a $ 1 $ a x b $ $ 4 3 $ b x a $ 6 5 2 2 : implicit suffix trees Definition:

More information

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis. Implementation: Finite Automata Lexical Analysis Implementation: Finite Automata Outline Specifying lexical structure using regular expressions Finite automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs)

More information

Chapter 12 Digital Search Structures

Chapter 12 Digital Search Structures Chapter Digital Search Structures Digital Search Trees Binary Tries and Patricia Multiway Tries C-C Tsai P. Digital Search Tree A digital search tree is a binary tree in which each node contains one element.

More information

Parallel Distributed Memory String Indexes

Parallel Distributed Memory String Indexes Parallel Distributed Memory String Indexes Efficient Construction and Querying Patrick Flick & Srinivas Aluru Computational Science and Engineering Georgia Institute of Technology 1 In this talk Overview

More information

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis Objective Input! Output! A B C D Several

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information