An introduction to suffix trees and indexing

Size: px
Start display at page:

Download "An introduction to suffix trees and indexing"

Transcription

1 An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012

2 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet and strings 3 Dictionaries Trie Patricia tree 4 Suffix tree Suffix trie Suffix tree Ukkonen s algorithm 5 Example 6 Overview

3 Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

4 Introduction Introduction Two main problem areas in text retrieval 1 String matching 2 Indexing and querying

5 Introduction Introduction Two main problem areas in text retrieval 1 String matching 2 Indexing and querying Exact and approximate cases!

6 Introduction Exact string matching Many efficient algorithms exist Knuth-Morris-Pratt algorithm Boyer-Moore, Boyer-Moore-Horspool, Turbo-Boyer-Moore, etc. Aho-Corasick...

7 Introduction Indexing - 1 Problem Given a text T, we need to construct an efficient data structure D which will serve as an index of T, so that we can efficiently query text T. What do we expect from an efficient indexing data structure?

8 Introduction Indexing - 2 Given a query pattern P, we want to find all occurrences of P in preprocessed text T using the indexing data structure D The data structure D is efficient if It can be built in linear time in the size of T (O( T )) It occupies space linear in the size of T (O( T )) It can answer a query whether P exists in T in time linear in the size of P (O( P )) It can report all occurrences of P in T in time O( P +occ), where occ is the number of occurrences

9 Introduction Indexing - 2 Some efficient indexing data structures include Suffix automata (DAWG) and variations such as CDAWG Suffix trees Position heaps Suffix arrays In this lecture we will concentrate only on suffix trees

10 Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

11 Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V

12 Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V

13 Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E

14 Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E

15 Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E. Cycle A path v 0, v 1,... v n, v 0, where n 2, is called a cycle

16 Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E. Cycle A path v 0, v 1,... v n, v 0, where n 2, is called a cycle

17 Graph theory Rooted tree, subtree, tree height, node height Tree A rooted tree is an acyclic graph T = (V, E) with a special vertex v V called the root. Nodes with degree 1 are called leaves.

18 Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

19 Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ.

20 Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε.

21 Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ.

22 Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ. Definition (Length of string) The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by x.

23 Alphabet and strings Alphabet and strings We denote by x[i], for all 1 i x, the letter at index i of x. We also call index i, for all 1 i x, a position in x when x ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1.. x ]

24 Alphabet and strings Alphabet and strings We denote by x[i], for all 1 i x, the letter at index i of x. We also call index i, for all 1 i x, a position in x when x ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1.. x ] Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. We denote the factor (substring) of x starting at position i and ending at position j as x[i.. j].

25 Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

26 Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty} m a A r b B C D n e o E F y d n n J t b M g G H I y K L t N O e P Q y S r R T

27 Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty} m a A r b B C D n e o E F y d n n J t b M g G H I K $ y $ $ L t $ N O e P $ Q y S r R T $ $

28 Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge m a A r b B C D n e o E F y d n n J t b M g G H I y K L t N O e P Q y S r R T

29 Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge A B F G I J K M N P R T C D E H L Q O S a n n n b r o b e m y d y t t y g e r

30 Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge a A ro my G B dy n F be J M n n b I K tty N ger P R T

31 Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

32 Suffix trie Suffix trie Given some text, i.e. t = banana, construct the suffix trie. 1 Generate the set Suff(t) 2 Construct a trie from Suff(t) The resulting data structure is called a suffix trie. Example Given the t = banana$, the set Suff(t) is Suff(t) = {banana$, anana$, nana$, ana$, na$, a$}

33 Suffix trie Suffix trie - Example Given the text t = banana$, construct the suffix trie. a b n $ n a a 6 a n $ n $ n a 5 a 4 a n $ 3 $ a 2 $ 1

34 Suffix tree Suffix tree Definition A suffix tree is a patricia tree of the suffix trie. Construction 1 Construct a suffix trie of text x 2 Eliminate all nodes with out-degree 1 and concatenate the labels in the corresponding edges to one edge.

35 Suffix tree Suffix tree - Example a b n $ n a a 6 a n $ n $ n a 5 a 4 a n $ 3 $ a 2 $ 1

36 Suffix tree Suffix tree - Example a b n $ n a a 6 a n $ n $ n a 5 a 4 a n $ 3 $ a 2 $ 1

37 Suffix tree Suffix tree - Example a na 6 $ na $ $ banana$ 5 na$ 4 na$ 3 2 1

38 Suffix tree Size of suffix tree Theorem A suffix tree consists of at most 2n 1 nodes (or 2n if empty suffix $ is taken into account). Proof (by induction) Base case For 2 leaves we have 1 internal node. Inductive step Assume that any binary tree with m < N leaves consists of at exactly m 1 internal nodes. We must prove that a binary tree with N leaves has exactly N 1 internal nodes. A binary tree with N leaves is made up of: A root node. A left binary tree with k leaves. A right binary tree with N k leaves.

39 Suffix tree Size of suffix tree Proof (by induction) According to the induction assumption The left binary tree with k leaves consists of k 1 internal nodes. The right binary tree with N k leaves consists of N k 1 internal nodes. Therefore, the total number of internal nodes in a binary tree with N leaves is (k 1)+(N k 1)+1 = N 1 and thus, the total number of nodes is 2N 1.

40 Suffix tree Suffix tree construction algorithms Weiner s algorithm (1973) Introduced as position tree Construction in linear time (for constant size alphabets) Characterized as algorithm of the year McCreight s algorithm (1976) Improved space requirements over Weiner s method Construction in linear time (for constant size alphabets) Ukkonen s algorithm (1995) Same time and space requirements as McCreight s Easier to understand On-line Farach s algorithm (1997) Linear time construction algorithm for any type of alphabet Hard to implement The basis for new algorithms i.e. position heaps and suffix arrays in linear time

41 Ukkonen s algorithm Implicit suffix tree Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child a na a na $ na banana$ $ na$ na banana na 6 $ na$ na

42 Ukkonen s algorithm Implicit suffix tree Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child a na a na $ na banana$ $ na$ na banana na 6 $ na$ 5 3 na

43 Ukkonen s algorithm Implicit suffix tree Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child a na nana $ na banana$ $ na$ anana banana 6 $ na$

44 Ukkonen s algorithm Implicit suffix tree The implicit suffix tree of a string is what results by applying Ukkonen s algorithm to the string without an added end marker $. All suffixes are included, but not necessarily as labels of complete paths leading to leaves. By appending a unique character at the end of the string (in our case the $), the implicit suffix tree is essentially the same as the (true) suffix tree (only without $).

45 Ukkonen s algorithm String paths of implicit suffix trees Given a string y[1.. n], an implicit suffix tree I i contains each suffix y[1.. i], y[2.. i],..., y[i] of y as a label of some path (possibly ending at the middle of an edge) That is, a string path is a string that can be matched along the edges, starting from the root, or equivalently a prefix of any node label

46 Ukkonen s algorithm Ukkonen s algorithm 1 Start with T = I 1. 2 Consecutively update T to I 2, I 3,..., I n+1 in n phases, where I i represents the implicit suffix tree of prefix y[1.. i]. Phase i + 1 updates T from I i (with all suffixes of y[1.. i]) to I i+1 (with all suffixes of y[1.. i + 1]). Each phase i + 1 consists of extensions j = 1, 2,..., i + 1 (one for each suffix of y[1.. i + 1]). Extension j ensures that suffix y[j.. i + 1] is in I i+1.

47 Ukkonen s algorithm Suffix extension rules Rule 1 y[j.. i] ends at a leaf Insert y[i + 1] at the end of the edge label Rule 2 y[j.. i] doesn t end at a leaf, and the following character is not y[i + 1] Connect the end of the path to a new leaf j by an edge labeled y[i + 1]. If the path ended at the middle of an edge, split that edge and insert a new node as the parent of leaf j. Rule 3 If the path y[j.. i] is already in the tree. No update.

48 Which is even worse than the naive algorithm which runs in O(n 2 ). We will see how this approach, with the use of some simple tricks, can achieve linear run-time. Ukkonen s algorithm Complexity Complexity The so-far presented algorithmic approach runs in O(n 3 ). Proof Consider a single phase i + 1. Each extension rule can be applied in O(1) Applying all i + 1 extensions takes time Θ(i). Locating the ends of string paths y[1.. i],..., y[i] by traversing the edge labels takes time Σ i k=1 = Θ(i2 ). Therefore, the total time for all phases i = 1, 2,..., n is Σ n i=1i 2 = Θ(n 3 )

49 Ukkonen s algorithm Suffix links The extensions of phase i + 1 need to locate the ends of all i + 1 suffixes of y[1.. i], and apply Rules 1-3. How to do this efficiently? For each internal node v of I i labeled xα, where x Σ and α Σ, define s(v) to be the node labeled by α. (Do these nodes actually exist?) Then a pointer from v to s(v) is called the suffix link of v. Note: If node v is labeled by a single character then α = ε and s(v) is the root node.

50 Ukkonen s algorithm Example of suffix links Suffix tree for x = xabxac bxac c a xa 3 6 c bxac c bxac

51 Ukkonen s algorithm Why do we need suffix links? Extension j (of phase i + 1) finds the end of the path y[j.. i] in the tree (and extends it with character y[i + 1]) Extension j + 1 similarly finds the end of the path y[j i] Assume that v is an internal node whose string path y[j]α is (essentialy) a prefix of y[j.. i]. Then we can avoid traversing path α when locating the end of path y[j i], by starting from node s(v). Do suffix links always exist?

52 Ukkonen s algorithm Suffix links existence Observation If an internal node v is created during extension j (of phase i + 1), then extension j + 1 will find out the node s(v). Let v be labeled xα Node v can only be created by extension Rule 2. That is, v is inserted at the end of path y[j.. i], which continued by some character c y[i + 1]. Therefore, paths xαc and αc have been entered before phase i + 1. in extension j + 1, node s(v) is either found or created at the end of path α = y[j i].

53 Ukkonen s algorithm Speeding up path traversals Consider extensions of phase i + 1 Extension 1 extends path y[1.. i] with character y[i + 1]. Extension 1 is easy as path y[1.. i] always ends at leaf 1, and is thus extended by Rule 1. We can perform extension 1 in constant time, if we maintain a pointer to the edge at the end of y[1.. i]. What about subsequent extensions j + 1 (for j = 1, 2,..., i)?

54 Ukkonen s algorithm Locating subsequent paths Extension j has located the end of path y[j.. i] and v is the node last visited. Starting from there, walk up at most one node either 1 to the root, or 2 to a node s(v) with a suffix link from v In case of (1), traverse path y[j i] explicitly down-wards from the root.

55 Ukkonen s algorithm Locating subsequent paths In case of (2), let xα be the label of v y[j.. i] = xαβ for some β Σ Then follow the suffix link of v, and continue by matching β down-wards from node s(v) (whose string-path is α). Having found the end of path αβ = y[j i], apply extension rules to ensure that it extends with y[i + 1]. Finally, if a new internal node w was created in extension j, set its suffix link to point to the end node of path y[j i]

56 Ukkonen s algorithm Locating subsequent paths - Illustration In case of (2), let xα be the label of v y[j.. i] = xαβ for some β Σ (in this case β = abcd) xα α s(v) a v abcd bc d

57 Ukkonen s algorithm Speeding up explicit traversals Skip/Count trick In phase i + 1, each path y[j.. i], which is followed in extension j, is known to exist in the tree The path can be followed by choosing the correct edges, instead of examining every character Let y[k] be the next character to be matched on path y[j.. i] Now an edge labeled by y[p.. q] can be traversed simply by checking that y[p] = y[k], and skipping the next q p characters of y[j.. i] The time to traverse a path is proportional to the number of nodes on the path (instead of its string length)

58 Ukkonen s algorithm Speeding up explicit traversals Lemma For any node v with a suffix link to s(v), it holds that depth(v) 1 depth(s(v)) depth(v) Sketch of proof The suffix links for any ancestor of v lead to distinct ancestors of s(v).

59 Ukkonen s algorithm Linear bound for any single phase Theorem Using suffix links and the skip/count trick, a single phase i takes time O(n) Proof There are i + 1 n+1 extensions in phase i + 1 In any extension, other work except tree-traversal (that is, extension rules) takes O(1) time only How to bound the work for traversing the tree? To find the end of the next path, an extension first moves at most one level up. Then a suffix link may be followed, which is followed by a down-traversal to match the rest of the path

60 Ukkonen s algorithm Linear bound for any single phase The up-walk in any extension decreases the current node depth by at most one (since it moves up at most one node) and each suffix link traversal decreases the node-depth by at most another one (previous Lemma). Thus the current node depth is decremented at most 2n times during the entire phase. On the other hand, the current node depth cannot exceed n it is incremented (by following downward edges) at most 3n times total run-time of a phase is thus O(n) Improvement Since there are n phases, the total run-time is O(n 2 )

61 Ukkonen s algorithm Final improvements (1) Some extensions can be found unnecessary to compute explicitly Observation 1 - Rule 3 terminates current phase If path y[j.. i + 1] is already in the tree, so are paths y[j i + 1]... y[i + 1] Phase i + 1 can be finished at the first extension j that applies Rule 3

62 Ukkonen s algorithm Final improvements (2) Observation 2 - Once a leaf, always a leaf A node created as a leaf remains a leaf thereafter because no extension rule adds children to a leaf. If extension j created a leaf (numbered j), extension j of any later phase i + 1 applies Rule 1 (appending the next character y[i + 1] to label of the edge ending at leaf j. Explicit applications of Rule 1 can be eliminated as follows: Use compressed edge representation (i.e. indices p and q instead of substring y[p.. q]), and represent the end position of each terminal edge by a global value e, for the current end position (phase).

63 Ukkonen s algorithm Eliminating extensions Denote by j i the last non-void extension of phase i (that is, application of Rule 1 or 2) Obs 1 extensions 1,..., j i of phase i are non-void leaves 1,..., j i have been created at the end of phase i Obs 2 extensions 1,..., j i of any subsequent phase all apply Rule 1 j i+1 j i Execute only extensions j i + 1, j i + 2,... explicitly in phase i + 1

64 Ukkonen s algorithm Single phase algorithm Algorithm for phase i + 1 with unnecessary extensions eliminated 1 Set e = i + 1 (implements extensions 1,..., j i implicitly 2 Compute extensions j i + 1,..., j until j > i + 1 or Rule 3 was applied in extension j 3 Set j i+1 = j 1 (for the next phase) All these tricks together can be shown to lead to linear run-time

65 Ukkonen s algorithm Complexity of the tuned implementation (1) Theorem Ukkonen s algorithm builds the suffix tree for y[1.. n] in time O(n), when implemented using the mentioned tricks. Proof The extensions computed explicitly in any two phases i and i + 1 are disjoint except for extension j, which may be computed anew in phase i + 1. The second computation of extension j can be done in O(1) by remembering the end of the path entered in the previous computation

66 Ukkonen s algorithm Complexity of the tuned implementation (2) Let j = 1,..., n+1 denote the index of the current extension Over all phases 2,..., n+1 index j never decreases, but it can remain the same at the start of phases 3,..., n+1 at most 2n extensions are computed explicitly. Similarly to the previous proof (skip/count), the current node depth can be decremented at most 4n times, and thus the total length of all downward traversals is bounded by 5n

67 Ukkonen s algorithm Obtaining the true suffix tree Finally, the implicit suffix tree I n+1 can be converted to the true suffix tree of y[1.. n]$ in the following way All occurrences of the current end position marker e on edge labels can be replaced by n+1 (with a simple tree traversal, in time O(n))

68 Ukkonen s algorithm Ukkonen s algorith Reads a string x of size n from left to right. The algorithm is on-line, i.e. at step 1 i n it constructs an implicit suffix tree of prefix y[1.. i] which can then be easily converted to the (true) suffix tree by appending a unique symbol $ that has not appeared before. Runs in O(n) time for constant-size alphabets or O(n log n) for general alphabets. Requires O(n) space.

69 Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

70 Suffix tree - Example y = a b c a b x a b c $ Phase 1

71 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, e) 1

72 Suffix tree - Example y = a b c a b x a b c $ Phase 2 (1, e) 1

73 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) 1

74 Suffix tree - Example y = a b c a b x a b c $ Explicit (1, e) (2, e) 1 2

75 Suffix tree - Example y = a b c a b x a b c $ Phase 3 (1, e) (2, e) 1 2

76 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 1 2

77 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 1 2

78 Suffix tree - Example y = a b c a b x a b c $ Explicit (1, e) (2, e) 3 1 2

79 Suffix tree - Example y = a b c a b x a b c $ Phase 4 (1, e) (2, e) 3 1 2

80 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

81 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

82 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

83 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 3 (1, e) (2, e) 3 1 2

84 Suffix tree - Example y = a b c a b x a b c $ Phase 5 (1, e) (2, e) 3 1 2

85 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

86 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

87 Suffix tree - Example y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

88 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 3 (1, e) (2, e) 3 1 2

89 Suffix tree - Example y = a b c a b x a b c $ Phase 6 (1, e) (2, e) 3 1 2

90 Suffix tree - Example y = a b c a b x a b c $ Skip all implicit (1, e) (2, e) 3 1 2

91 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, e) (2, e) 3 1 2

92 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, e) 3 1 2

93 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) 4 (2, e) 3 1 2

94 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, e)

95 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2)

96 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2)

97 Suffix tree - Example y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2)

98 Suffix tree - Example y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2)

99 Suffix tree - Example y = a b c a b x a b c $ (1, 2) (2, 2)

100 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2)

101 Suffix tree - Example y = a b c a b x a b c $ Phase 7 (1, 2) (2, 2)

102 Suffix tree - Example y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2)

103 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2)

104 Suffix tree - Example y = a b c a b x a b c $ Phase 8 (1, 2) (2, 2)

105 Suffix tree - Example y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2)

106 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2)

107 Suffix tree - Example y = a b c a b x a b c $ Phase 9 (1, 2) (2, 2)

108 Suffix tree - Example y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2)

109 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2)

110 Suffix tree - Example y = a b c a b x a b c $ Phase 10 (1, 2) (2, 2)

111 Suffix tree - Example y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2)

112 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2)

113 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) 6 (4, e)

114 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) 6 4 (4, e) (10, e)

115 Suffix tree - Example y = a b c a b x a b c $ Follow suffix link (1, 2) (3, 3) (2, 2) 6 4 (4, e) (10, e)

116 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) 6 4 (4, e) (10, e)

117 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) (3, 3) 6 4 (4, e) (10, e)

118 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) (3, 3) 6 4 (4, e) (10, e) 5 (10, e)

119 Suffix tree - Example y = a b c a b x a b c $ Follow suffix link (1, 2) (3, 3) (2, 2) (3, 3) (4, e) (10, e) (10, e)

120 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) (3, 3) (4, e) (10, e) (10, e)

121 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (4, e) (10, e) (10, e)

122 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) (4, e) (10, e) (10, e)

123 Suffix tree - Example y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) (4, e) (10, e) (10, e)

124 Suffix tree - Example y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) (4, e) (10, e) (10, e)

125 Suffix tree - Example y = a b c a b x a b c $ (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) (4, e) (10, e) (10, e)

126 Suffix tree - Example y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) (4, e) (10, e) (10, e) (10, e) 10 6

127 Application - finding all occurrences of a query y = a b c a b x a b c $ ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ abxabc$ $ $ abxabc$ Query the string a

128 Application - finding all occurrences of a query y = a b c a b x a b c $ Find the node to which the string path a leads to ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ abxabc$ $ $ abxabc$ Query the string a

129 Application - finding all occurrences of a query y = a b c a b x a b c $ Get the leafs of that node ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ abxabc$ $ $ abxabc$ Query the string a

130 Application - finding all occurrences of a query y = a b c a b x a b c $ Leaves indicate the starting positions of a ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ abxabc$ $ $ abxabc$ Query the string a

131 Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

132 Overview We had a quick look on indexing. Preprocessing a given text Efficient querying afterwards We ve seen what suffix trees are and some of their properties. Patricia suffix tries for a string x[1.. n] At most 2n 1 nodes Exactly n leaves We ve seen Ukkonen s algorithm. Fairly simple to understand Linear time construction for constant-size alphabets

133 Reminder - Next week Next week s lecture will take place at SR 148, Building 50.34

Lecture 6: Suffix Trees and Their Construction

Lecture 6: Suffix Trees and Their Construction Biosequence Algorithms, Spring 2007 Lecture 6: Suffix Trees and Their Construction Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 6: Intro to suffix trees p.1/46 II:

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d): Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for

More information

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5 Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm

More information

Lecture L16 April 19, 2012

Lecture L16 April 19, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where

More information

Applications of Suffix Tree

Applications of Suffix Tree Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences

More information

Lecture 18 April 12, 2005

Lecture 18 April 12, 2005 6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

Suffix trees and applications. String Algorithms

Suffix trees and applications. String Algorithms Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x

More information

Suffix Trees and its Construction

Suffix Trees and its Construction Chapter 5 Suffix Trees and its Construction 5.1 Introduction to Suffix Trees Sometimes fundamental techniques do not make it into the mainstream of computer science education in spite of its importance,

More information

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator 15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays

More information

Computing the Longest Common Substring with One Mismatch 1

Computing the Longest Common Substring with One Mismatch 1 ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy

More information

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

58093 String Processing Algorithms. Lectures, Autumn 2013, period II 58093 String Processing Algorithms Lectures, Autumn 2013, period II Juha Kärkkäinen 1 Contents 0. Introduction 1. Sets of strings Search trees, string sorting, binary search 2. Exact string matching Finding

More information

Verifying a Border Array in Linear Time

Verifying a Border Array in Linear Time Verifying a Border Array in Linear Time František Franěk Weilin Lu P. J. Ryan W. F. Smyth Yu Sun Lu Yang Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario

More information

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/ Goals for Today Understand how suffix links are used in Ukkonen's algorithm

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Ukkonen s suffix tree algorithm

Ukkonen s suffix tree algorithm Ukkonen s suffix tree algorithm Recall McCreight s approach: For i = 1.. n+1, build compressed trie of {x[..n]$ i} Ukkonen s approach: For i = 1.. n+1, build compressed trie of {$ i} Compressed trie of

More information

Given a text file, or several text files, how do we search for a query string?

Given a text file, or several text files, how do we search for a query string? CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key

More information

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017 Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of

More information

EE 368. Weeks 5 (Notes)

EE 368. Weeks 5 (Notes) EE 368 Weeks 5 (Notes) 1 Chapter 5: Trees Skip pages 273-281, Section 5.6 - If A is the root of a tree and B is the root of a subtree of that tree, then A is B s parent (or father or mother) and B is A

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular

More information

marc skodborg, simon fischer,

marc skodborg, simon fischer, E F F I C I E N T I M P L E M E N TAT I O N S O F S U F - F I X T R E E S marc skodborg, 201206073 simon fischer, 201206049 master s thesis June 2017 Advisor: Christian Nørgaard Storm Pedersen AARHUS AU

More information

11/5/13 Comp 555 Fall

11/5/13 Comp 555 Fall 11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace

More information

Range Minimum Queries Part Two

Range Minimum Queries Part Two Range Minimum Queries Part Two Recap from Last Time The RMQ Problem The Range Minimum Query (RMQ) problem is the following: Given a fixed array A and two indices i j, what is the smallest element out of

More information

Non-context-Free Languages. CS215, Lecture 5 c

Non-context-Free Languages. CS215, Lecture 5 c Non-context-Free Languages CS215 Lecture 5 c 2007 1 The Pumping Lemma Theorem (Pumping Lemma) Let be context-free There exists a positive integer divided into five pieces Proof for for each and Let and

More information

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved. Chapter 7 Space and Time Tradeoffs Copyright 2007 Pearson Addison-Wesley. All rights reserved. Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement preprocess the input

More information

Compressed Indexes for Dynamic Text Collections

Compressed Indexes for Dynamic Text Collections Compressed Indexes for Dynamic Text Collections HO-LEUNG CHAN University of Hong Kong and WING-KAI HON National Tsing Hua University and TAK-WAH LAM University of Hong Kong and KUNIHIKO SADAKANE Kyushu

More information

11/5/09 Comp 590/Comp Fall

11/5/09 Comp 590/Comp Fall 11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors

More information

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University U Kang (2016) 1 In This Lecture The concept of binary tree, its terms, and its operations Full binary tree theorem Idea

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational

More information

Analysis of Algorithms

Analysis of Algorithms Analysis of Algorithms Concept Exam Code: 16 All questions are weighted equally. Assume worst case behavior and sufficiently large input sizes unless otherwise specified. Strong induction Consider this

More information

CMSC th Lecture: Graph Theory: Trees.

CMSC th Lecture: Graph Theory: Trees. CMSC 27100 26th Lecture: Graph Theory: Trees. Lecturer: Janos Simon December 2, 2018 1 Trees Definition 1. A tree is an acyclic connected graph. Trees have many nice properties. Theorem 2. The following

More information

March 20/2003 Jayakanth Srinivasan,

March 20/2003 Jayakanth Srinivasan, Definition : A simple graph G = (V, E) consists of V, a nonempty set of vertices, and E, a set of unordered pairs of distinct elements of V called edges. Definition : In a multigraph G = (V, E) two or

More information

Advanced Algorithms: Project

Advanced Algorithms: Project Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and

More information

Range Minimum Queries Part Two

Range Minimum Queries Part Two Range Minimum Queries Part Two Recap from Last Time The RMQ Problem The Range Minimum Query (RMQ) problem is the following: Given a fied array A and two indices i j, what is the smallest element out of

More information

Foundations of Computer Science Spring Mathematical Preliminaries

Foundations of Computer Science Spring Mathematical Preliminaries Foundations of Computer Science Spring 2017 Equivalence Relation, Recursive Definition, and Mathematical Induction Mathematical Preliminaries Mohammad Ashiqur Rahman Department of Computer Science College

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Algorithms and Theory of Computation. Lecture 7: Priority Queue

Algorithms and Theory of Computation. Lecture 7: Priority Queue Algorithms and Theory of Computation Lecture 7: Priority Queue Xiaohui Bei MAS 714 September 5, 2017 Nanyang Technological University MAS 714 September 5, 2017 1 / 15 Priority Queues Priority Queues Store

More information

Introduction to Suffix Trees

Introduction to Suffix Trees Algorithms on Strings, Trees, and Sequences Dan Gusfield University of California, Davis Cambridge University Press 1997 Introduction to Suffix Trees A suffix tree is a data structure that exposes the

More information

(2,4) Trees. 2/22/2006 (2,4) Trees 1

(2,4) Trees. 2/22/2006 (2,4) Trees 1 (2,4) Trees 9 2 5 7 10 14 2/22/2006 (2,4) Trees 1 Outline and Reading Multi-way search tree ( 10.4.1) Definition Search (2,4) tree ( 10.4.2) Definition Search Insertion Deletion Comparison of dictionary

More information

Suffix Trees and Arrays

Suffix Trees and Arrays Suffix Trees and Arrays Yufei Tao KAIST May 1, 2013 We will discuss the following substring matching problem: Problem (Substring Matching) Let σ be a single string of n characters. Given a query string

More information

COMP4128 Programming Challenges

COMP4128 Programming Challenges Multi- COMP4128 Programming Challenges School of Computer Science and Engineering UNSW Australia Table of Contents 2 Multi- 1 2 Multi- 3 3 Multi- Given two strings, a text T and a pattern P, find the first

More information

Binary search trees. Binary search trees are data structures based on binary trees that support operations on dynamic sets.

Binary search trees. Binary search trees are data structures based on binary trees that support operations on dynamic sets. COMP3600/6466 Algorithms 2018 Lecture 12 1 Binary search trees Reading: Cormen et al, Sections 12.1 to 12.3 Binary search trees are data structures based on binary trees that support operations on dynamic

More information

Final Examination CSE 100 UCSD (Practice)

Final Examination CSE 100 UCSD (Practice) Final Examination UCSD (Practice) RULES: 1. Don t start the exam until the instructor says to. 2. This is a closed-book, closed-notes, no-calculator exam. Don t refer to any materials other than the exam

More information

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018 Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018 Lecture 11 Ana Bove April 26th 2018 Recap: Regular Languages Decision properties of RL: Is it empty? Does it contain this word? Contains

More information

Fast Substring Matching

Fast Substring Matching Fast Substring Matching Andreas Klein 1 2 3 4 5 6 7 8 9 10 Abstract The substring matching problem occurs in several applications. Two of the well-known solutions are the Knuth-Morris-Pratt algorithm (which

More information

Randomized incremental construction. Trapezoidal decomposition: Special sampling idea: Sample all except one item

Randomized incremental construction. Trapezoidal decomposition: Special sampling idea: Sample all except one item Randomized incremental construction Special sampling idea: Sample all except one item hope final addition makes small or no change Method: process items in order average case analysis randomize order to

More information

Introduction to Computers and Programming. Concept Question

Introduction to Computers and Programming. Concept Question Introduction to Computers and Programming Prof. I. K. Lundqvist Lecture 7 April 2 2004 Concept Question G1(V1,E1) A graph G(V, where E) is V1 a finite = {}, nonempty E1 = {} set of G2(V2,E2) vertices and

More information

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof. Priority Queues 1 Introduction Many applications require a special type of queuing in which items are pushed onto the queue by order of arrival, but removed from the queue based on some other priority

More information

Alphabet-Dependent String Searching with Wexponential Search Trees

Alphabet-Dependent String Searching with Wexponential Search Trees Alphabet-Dependent String Searching with Wexponential Search Trees Johannes Fischer and Pawe l Gawrychowski February 15, 2013 arxiv:1302.3347v1 [cs.ds] 14 Feb 2013 Abstract It is widely assumed that O(m

More information

SFU CMPT Lecture: Week 9

SFU CMPT Lecture: Week 9 SFU CMPT-307 2008-2 1 Lecture: Week 9 SFU CMPT-307 2008-2 Lecture: Week 9 Ján Maňuch E-mail: jmanuch@sfu.ca Lecture on July 8, 2008, 5.30pm-8.20pm SFU CMPT-307 2008-2 2 Lecture: Week 9 Binary search trees

More information

DO NOT. In the following, long chains of states with a single child are condensed into an edge showing all the letters along the way.

DO NOT. In the following, long chains of states with a single child are condensed into an edge showing all the letters along the way. CS61B, Fall 2009 Test #3 Solutions P. N. Hilfinger Unless a question says otherwise, time estimates refer to asymptotic bounds (O( ), Ω( ), Θ( )). Always give the simplest bounds possible (O(f(x)) is simpler

More information

Cache-Oblivious String Dictionaries

Cache-Oblivious String Dictionaries Cache-Oblivious String Dictionaries Gerth Stølting Brodal Rolf Fagerberg Abstract We present static cache-oblivious dictionary structures for strings which provide analogues of tries and suffix trees in

More information

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

University of Waterloo CS240R Fall 2017 Solutions to Review Problems University of Waterloo CS240R Fall 2017 Solutions to Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Module 2: Classical Algorithm Design Techniques

Module 2: Classical Algorithm Design Techniques Module 2: Classical Algorithm Design Techniques Dr. Natarajan Meghanathan Associate Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Module

More information

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

Two Dimensional Dictionary Matching

Two Dimensional Dictionary Matching Two Dimensional Dictionary Matching Amihood Amir Martin Farach Georgia Tech DIMACS September 10, 1992 Abstract Most traditional pattern matching algorithms solve the problem of finding all occurrences

More information

Disjoint-set data structure: Union-Find. Lecture 20

Disjoint-set data structure: Union-Find. Lecture 20 Disjoint-set data structure: Union-Find Lecture 20 Disjoint-set data structure (Union-Find) Problem: Maintain a dynamic collection of pairwise-disjoint sets S = {S 1, S 2,, S r }. Each set S i has one

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Graph Algorithms Using Depth First Search

Graph Algorithms Using Depth First Search Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth

More information

Problem Set 5 Solutions

Problem Set 5 Solutions Introduction to Algorithms November 4, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 21 Problem Set 5 Solutions Problem 5-1. Skip

More information

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics 1 Sorting 1.1 Problem Statement You are given a sequence of n numbers < a 1, a 2,..., a n >. You need to

More information

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and International Journal of Foundations of Computer Science c World Scientific Publishing Company MODELING DELTA ENCODING OF COMPRESSED FILES SHMUEL T. KLEIN Department of Computer Science, Bar-Ilan University

More information

Binary search trees 3. Binary search trees. Binary search trees 2. Reading: Cormen et al, Sections 12.1 to 12.3

Binary search trees 3. Binary search trees. Binary search trees 2. Reading: Cormen et al, Sections 12.1 to 12.3 Binary search trees Reading: Cormen et al, Sections 12.1 to 12.3 Binary search trees 3 Binary search trees are data structures based on binary trees that support operations on dynamic sets. Each element

More information

Lower Bound on Comparison-based Sorting

Lower Bound on Comparison-based Sorting Lower Bound on Comparison-based Sorting Different sorting algorithms may have different time complexity, how to know whether the running time of an algorithm is best possible? We know of several sorting

More information

Algorithms Dr. Haim Levkowitz

Algorithms Dr. Haim Levkowitz 91.503 Algorithms Dr. Haim Levkowitz Fall 2007 Lecture 4 Tuesday, 25 Sep 2007 Design Patterns for Optimization Problems Greedy Algorithms 1 Greedy Algorithms 2 What is Greedy Algorithm? Similar to dynamic

More information

Algorithms Theory. 15 Text Search (2)

Algorithms Theory. 15 Text Search (2) Algorithms Theory 15 Text Search (2) Construction of suffix trees Prof. Dr. S. Albers Suffix tree t = x a b x a $ 1 2 3 4 5 6 a x a b x a $ 1 $ a x b $ $ 4 3 $ b x a $ 6 5 2 2 : implicit suffix trees Definition:

More information

Suffix-based text indices, construction algorithms, and applications.

Suffix-based text indices, construction algorithms, and applications. Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in

More information

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices. Trees Trees form the most widely used subclasses of graphs. In CS, we make extensive use of trees. Trees are useful in organizing and relating data in databases, file systems and other applications. Formal

More information

Suffix Tree and Array

Suffix Tree and Array Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g) Introduction to Algorithms March 11, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Sivan Toledo and Alan Edelman Quiz 1 Solutions Problem 1. Quiz 1 Solutions Asymptotic orders

More information

Cache-Oblivious String Dictionaries

Cache-Oblivious String Dictionaries Cache-Oblivious String Dictionaries Gerth Stølting Brodal University of Aarhus Joint work with Rolf Fagerberg #"! Outline of Talk Cache-oblivious model Basic cache-oblivious techniques Cache-oblivious

More information

implementing the breadth-first search algorithm implementing the depth-first search algorithm

implementing the breadth-first search algorithm implementing the depth-first search algorithm Graph Traversals 1 Graph Traversals representing graphs adjacency matrices and adjacency lists 2 Implementing the Breadth-First and Depth-First Search Algorithms implementing the breadth-first search algorithm

More information

Search Trees. Undirected graph Directed graph Tree Binary search tree

Search Trees. Undirected graph Directed graph Tree Binary search tree Search Trees Undirected graph Directed graph Tree Binary search tree 1 Binary Search Tree Binary search key property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then

More information

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong Department of Computer Science and Engineering Chinese University of Hong Kong In this lecture, we will revisit the dictionary search problem, where we want to locate an integer v in a set of size n or

More information

MITOCW watch?v=ninwepprkdq

MITOCW watch?v=ninwepprkdq MITOCW watch?v=ninwepprkdq The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

University of Waterloo CS240R Fall 2017 Review Problems

University of Waterloo CS240R Fall 2017 Review Problems University of Waterloo CS240R Fall 2017 Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems do not

More information

Recursively Defined Functions

Recursively Defined Functions Section 5.3 Recursively Defined Functions Definition: A recursive or inductive definition of a function consists of two steps. BASIS STEP: Specify the value of the function at zero. RECURSIVE STEP: Give

More information

University of Waterloo CS240R Winter 2018 Help Session Problems

University of Waterloo CS240R Winter 2018 Help Session Problems University of Waterloo CS240R Winter 2018 Help Session Problems Reminder: Final on Monday, April 23 2018 Note: This is a sample of problems designed to help prepare for the final exam. These problems do

More information

Dynamic indexes vs. static hierarchies for substring search

Dynamic indexes vs. static hierarchies for substring search Dynamic indexes vs. static hierarchies for substring search Nils Grimsmo 25-6-15 2 Preface This is a master thesis for the Master of Technology program at the Department of Computer and Information Science

More information

Analysis of Algorithms

Analysis of Algorithms Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and

More information

BUNDLED SUFFIX TREES

BUNDLED SUFFIX TREES Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science

More information

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18 istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for

More information

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1 Introduction to Automata Theory BİL405 - Automata Theory and Formal Languages 1 Automata, Computability and Complexity Automata, Computability and Complexity are linked by the question: What are the fundamental

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

Small-Space 2D Compressed Dictionary Matching

Small-Space 2D Compressed Dictionary Matching Small-Space 2D Compressed Dictionary Matching Shoshana Neuburger 1 and Dina Sokol 2 1 Department of Computer Science, The Graduate Center of the City University of New York, New York, NY, 10016 shoshana@sci.brooklyn.cuny.edu

More information

Binary Heaps in Dynamic Arrays

Binary Heaps in Dynamic Arrays Yufei Tao ITEE University of Queensland We have already learned that the binary heap serves as an efficient implementation of a priority queue. Our previous discussion was based on pointers (for getting

More information