Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Size: px

Start display at page:

Download "Position Heaps: A Simple and Dynamic Text Indexing Data Structure"

Kristopher Green
6 years ago
Views:

1 Position Heps: A Simple nd Dynmic Text Indexing Dt Structure Andrzej Ehrenfeucht, Ross M. McConnell, Niss Osheim, Sung-Whn Woo Dept. of Computer Science, 40 UCB, University of Colordo t Boulder, Boulder, CO , USA. Dept. of Computer Science, Colordo Stte University, Fort Collins, CO USA Astrct We ddress the prolem of finding the loctions of ll instnces of string P in text T, where preprocessing of T is llowed in order to fcilitte the queries. Previous dt structures for this prolem include the suffix tree, the suffix rry, nd the compct DAWG. We modify dt structure clled sequence tree, which ws proposed y Coffmn nd Eve for hshing [], nd dpt it to the new prolem. We cn then produce list of k occurrences of ny string P in T in O( P +k) time. Becuse of properties shred y suffixes of text tht re not shred y ritrry hsh keys, we cn uild the structure in O( T ) time, which is much fster thn Coffmn nd Eve s lgorithm. These ounds re s good s those for the suffix tree, suffix rry, nd the compct DAWG. The dvntges re the elementry nture of some of the lgorithms for constructing nd using the dt structure nd the symptotic ounds we cn give for updting the dt structure when the text is edited. Keywords: position hep string serching. Introduction In this pper, we consider the prolem of finding occurrences of pttern string P in text T, where preprocessing of T is llowed in order to crete dt structure tht speeds up the serch. In this pper, we let m denote the length P of P, n denote the length T of T, nd k denote the numer of positions in T where P occurs s sustring. We ssume tht the size of the lphet Σ is fixed. We descrie two dt structures, the position hep nd the ugmented position hep. We gve primitive version of the position hep in [], nd some of the results of this pper were sketched in [], where we descried structure tht is closely relted to the ugmented position hep, contrcted Emil ddresses: ndrzej@cs.colordo.edu (Andrzej Ehrenfeucht), rmm@cs.colostte.edu (Ross M. McConnell), osheim@cs.colostte.edu (Niss Osheim), woo@cs.colostte.edu (Sung-Whn Woo) Preprint sumitted to Discrete Algorithms Jnury 9, 0

2 suffix tree. The position hep nd the ugmented position hep hve helpful comintoril properties tht the contrcted suffix tree does not. The position hep of T is unique, while contrcted suffix tree is not. Definition.. Let h(t) e the length of the longest sustring X of T tht is repeted t lest X times in T. A few moment s reflection revels tht h(t) cn e expected to e quite smll for most prcticl pplictions. The expected vlue of h(t) is O(log n) when T is rndomly-generted string. However, since few pplictions del with rndom strings, more importnt oservtion is tht long repeted sustrings in T hve little impct on the vlue of h(t) unless they re repeted n inordinte numer of times. We discuss properties of h(t) in greter detil elow. In this pper, we give the following results:. We descrie the ugmented position hep for the first time. The dt structure is trie with n nodes nd height O(h(T)). It is ugmented with some dditionl pointers to fcilitte queries.. As strting point for lgorithms on the ugmented position hep we review extremely simple lgorithms for constructing the position hep in O(nh(T)) time, nd for querying it in O(min(m,mh(T))) time []. Though these worst-cse ounds re inferior to ounds we give elow, the O(min(m,mh(T))) ound for the query lgorithm is overly pessimistic in prctice. For instnce, when T is rndomly constructed string nd the construction of P does not depend on T, or if P is rndomly constructed string nd construction of T does not depend on P, then this simple query lgorithm tkes O(m + k) expected time. Becuse of the simplicity of these lgorithms nd the expecttion tht h(t) is usully smll in prctice, they re proly of prcticl interest in some contexts, nd they re of pedgogicl interest, s they cn e tught nd implemented in undergrdute dt structures courses. In [], we lso gve more sophisticted O(n) lgorithm for constructing the position hep tht we generlize to the ugmented position hep here.. We show how to get n O(n) ound for constructing the ugmented position hep nd simple O(m + k) ound for finding the k occurrences of P in T. For the cse where the user my wish to hve the option to ndon the query unexpectedly fter k < k occurrences hve een returned, we show how to construct n itertor in O(m) time tht then gives occurrences of P in T in left-to-right order of occurrence in O(log k ) time piece. 4. We show how to dpt the position hep nd the dynmic position hep for dynmiclly chnging texts. When consecutive lock of chrcters is deleted from T, we show how to updte the ugumented position hep in O((h(T) + )h(t) log n) mortized time. When consecutive lock of chrcters from is inserted to T, yielding text T, we show how to updte the ugmented position hep in O((h(T )+)h(t )log n) mortized

3 time. The opertions re sed on the sift-up nd sift-down opertions on stndrd heps. The reson for the log n fctor nd the mortiztion of the time ound is due to the dt structure we use for mintining the dynmic text, not for updting the ugmented position hep. The trdeoff of implementing the position hep to ccommodte string edits is tht serches tke O(mlog n + k) mortized time, rther thn O(m + k) time. Previous dt structures for this prolem include the suffix tree [4], the compct directed cyclic word grph (compct DAWG) [5], nd the suffix rry [6]. The first two pproches tke O(n) time to uild the dt structure, nd O(m + k) time to find the k positions where the pttern string occurs. The suffix rry cn e constructed in O(n) time, nd tkes O(m+log n) time to produce pointer to list of occurrences of P in T. A slightly slower pproch tkes O(mlog n) time, nd this pproch is of prcticl interest ecuse of its simplicity. When the text hs low entropy, the FM-index scheme llows serching on version of the text tht is compressed with the Burrows-Wheeler trnsform [7] with no significnt slowdown in the query time [8]. The iggest disdvntge of our structure when compred with suffix rrys is its lrger spce requirement. Like the suffix tree nd the compct DAWG, nd unlike the suffix rry, our ounds must e incresed y log Σ fctor when the size of the lphet, Σ, is introduced s vrile. This fctor comes from the time required to find the child of node on the child edge leled y given letter of Σ. This cn e improved to O() expected time with hsh tle tht returns the child, given hsh key consisting of the prent nd letter. This is nevertheless lso disdvntge when compred to suffix rrys. There hs een previous work on updting indexing structures when the text is edited. The generlized suffix tree llows serch for pttern string in collection of texts. In [9], it is shown tht it is possile to implement it to llow insertion nd removl of ny text X in the collection in O( X ) time, nd in [8], it is shown how to do this on collection of Burrows-Wheeler compressed texts in ner-liner time. However, X must e inserted or removed in its entirety nd ritrry edits on X re not supported. The results most comprle to ours hve een given recently y Slson et. l. [0, ]. They hve given n pproch tht tkes O(n) worst-cse time to modify the Burrows-Wheeler trnsform nd the suffix rry fter n ritrry edit opertion on T [0, ]. Though, in the worst cse, this is s d s the cost of discrding the suffix rry nd reuilding it from the eginning, they rgue tht their pproch is much more efficient in prctice, nd support this with empiricl studies on enchmrks. Note tht h(t) is lso Θ(n) in the worst cse, such s when T = n. However, our ounds re stronger thn O(n) ecuse we cn lwys reuild the dt structure from scrtch in O(n) time if O((h(T)+)h(T)log n) exceeds this cost, nd it chrcterizes nlyticlly the reltionship etween the running time nd n esily-understood property of the text. Slson et. l. identify n s text tht requires Θ(n) time for their lgo-

4 rithm lso. However, the performnce of their lgorithms cn suffer gretly from single repetition of lrge string in T. An illustrtion of this phenomenon is where W is string on Σ, $ is specil chrcter they ppend to T tht is less thn ny letter in Σ in lexicl order, nd T is the conctention WW$. Suppose # is chrcter tht is lrger thn ny chrcter in Σ in lexicl order. Let T = WW#$. Then Θ(n) chnges to the suffix rry of T re required to otin the suffix rry of T. By contrst, h(t) = O(h(W)) in this cse, it tkes O((h(W) + )h(w)log n) mortized ound to updte our dt structure fter this sme edit opertion. This is seen y the following lemm, which gives n overly pessimistic upper ound on h(t) in terms of h(w). Lemm.. If T = WW, then h(t) h(w). Proof: Let X e string of length h(t) tht occurs h(t) times in T. For ech of these occurrences, either the first h(t)/ chrcters of the occurrence lie in the first occurrence of W, or the lst h(t)/ chrcters lie in the second occurrence of W. By the pigeonhole principle, one of these is string of length h(t)/ tht occurs h(t)/ times in W, giving lower ound of h(t)/ on h(w). Implementtions of lgorithms nd dt structures given in this pper cn e found t Preliminries Let λ e the null string. If X = x x...x j is string, we let X denote the length j of X. The reverse of X is the string X R = x j x j...x. For resons tht will ecome cler shortly, we dopt the convention of numering the positions of the text T from right to left, so T = t n t n...t. Let T i denote the suffix t i t i...t eginning t position i. Let us distinguish sustring P = p p...p m of T from n instnce i of P in T, where P = t i t i...t i m+. The null sustring, λ, is considered to occur t every position. If X nd Y re strings, we denote their conctention y XY. If X is prefix of P, we let P X denote the suffix of P consisting of the lst P X letters of P. If X is suffix of P, we let P \ X denote the prefix of P consisting of the first P X letters of P. Definition.. A rooted tree hs the hep property if ech node crries lel from n ordered set, such s the integers, nd, for every internl node X, the lels of the children of X re greter thn the lel of X. Definition.. A trie on lphet Σ denotes rooted tree T with the following properties:. Ech edge is leled with chrcter;. For ech node u nd letter Σ, there is t most one edge with lel from u to child of u. 4

5 Figure : The sequence hsh tree of sequence of strings. We refer to ech node y the string of letter lels on the pth from the root to the node. For exmple, the node leled 6 cn e thought of s synonymous with the string. Ech string in the sequence is instlled t new node tht is the shortest prefix of the string tht isn t lredy node of the sequence hsh tree. These prefixes re underlined. For exmple, when string 9 is inserted, its prefix is lredy node of the tree, ut its prefix is not, so pointer to string 9 is inserted t new node,. Given trie, let us sy tht the lel of pth from the root to node u is the string given y the sequence X of chrcters tht occur on edges of the pth. This is the pth lel of u. Becuse of the second property, the pth lel uniquely identifies u. We therefore dopt the convention of treting the node nd its pth lel s interchngele ojects. For exmple, we my consider whether string X is node of the trie, or whether one node is sustring of nother. Note tht one node is prefix of nother if nd only if it is n ncestor in the trie. A sic opertion on trie tkes n input string P = p p...p m nd finds the lrgest prefix P of P tht is node of the trie. Since Σ is fixed, this is esily ccomplished in O( P ) time y strting t the root nd itertively tking edges leled with the sequence of letters from P, until P is exhusted or node is encountered tht doesn t hve child on the next letter of P. Let us cll this opertion indexing into the trie.. Sequence Hsh Trees A dt structure of Coffmn nd Eve [], clled sequence hsh tree, ws designed for the prolem of implementing hsh tles (dictionries) whose keys re strings. It consists of trie for indexing into the tle. The structure of the tree depends on the order in which the strings re inserted. We descrie minor vrint tht is esier to dpt to our sustring mtching prolem, elow. Let S = (S,S,...S n ) e given ordering of the strings. Without loss of generlity for our purposes, we my ssume tht no string in S is prefix of ny other. The trie H n tht they construct is defined y induction, s follows. If i =, the trie H is just root node with pointer to S. If i >, then H i is otined from H i y finding the shortest prefix X of S i tht is not lredy node of the trie. A new node X is dded s the child of node X on edge leled, nd pointer is instlled from it to S i. 5

6 Figure : Incrementl construction of the position hep. Suffixes T, T,...,T n re inserted in scending order of length. The figure depicts the insertion of T i when i =. Indexing into the hep on T i identifies the longest prefix () of T i tht is lredy node Y of the hep. The shortest prefix of T i tht is not lredy node of the hep () is inserted s child of Y nd leled with position i. Figure gives n exmple. Coffmn nd Eve s pper hs received little ttention since it ws pulished in 970, due, in no dout, to the existence of superior wys of implementing hsh tle. In the present pper, we show tht this dt structure is much richer when considered in the context of the new prolem. The structure of the set of suffixes of text T llows us to derive interesting nd lgorithmiclly useful properties tht do not pply in the generl cse ddressed y Coffmn nd Eve. In prticulr, we show tht it hs height t most h(t), nd show tht if the suffixes re inserted in scending order of length, it is now possile to uild the dt structure in time tht is liner in n = T, tht is, in O() time, mortized, per hsh key. We show how the tree cn e ugmented with mximl-rech pointers so tht finding ll k entries tht hve P s prefix tkes O(m + k) worst-cse time, independently of the height of the tree. 4. The Position Hep Up until the lst two sections of this pper, we ssume tht T is sttic. We cn therefore suppose tht T nd P re stored in chrcter rrys, which supports lookup of the chrcter in given position i in O() time. Definition 4.. The position hep H(T) of text T is otined y itertively inserting the suffixes (T,T,...,T n ) of T, in scending order of length, into Coffmn nd Eve s dt structure using their insertion opertion. Tht is, T i is inserted y creting new node tht is the shortest prefix of T i tht is not lredy node of the tree, nd leling it with position i. Let us cll the lgorithm implied y this constructive definition the nive construction lgorithm. Figure gives n illustrtion. Coffmn nd Eve 6

7 ssume tht ech inserted string ends with specil chrcter $, ecuse one must ensure tht no inserted string is lredy node of the tree when it is inserted. The use of specil chrcter to ccomplish this is unnecessry in our context, since ech string T i is longer thn ny string inserted efore it, nd ech node previously inserted is prefix of some T j for j < i. The construction cn e executed for ny text T, nd, since it is deterministic, the position hep H(T) for text is unique. The lgorithm is simple enough to e tught nd progrmmed in undergrdute dt structures clsses. 4.. A time ound for constructing the position hep We now give time ound for using the ove constructive definition of the position hep s n lgorithm. We improve the time ound to O(n) elow, t the expense of dding elements to the dt structure. Lemm 4.. The height of the position hep of text T is t most h(t). Proof: Let X = x j x j...x e deepest lef of the tree. Let X i denote the prefix x i x i...x i of X. For ech i from through j, X i occurs t lest i times in T ecuse it hs t lest i descendnts, {X i,x i,...x }, nd ech of these contins n occurrence of sustring of which X i is prefix. Therefore, X j/ hs length j/ nd occurs t lest j/ times in T. It must e tht j/ is lower ound on h(t), so the height j is ounded y h(t) +. Corollry 4.. The nive construction lgorithm tkes O(nh(T)) time. Proof: Indexing into the hep to find the prent of the new node to e inserted for position i tkes time tht is ounded y the height of the hep, hence O(h(T)) time. Adding the new child tkes O() time. Summing this over ll positions gives n O(nh(T)) ound. 5. The Nive Query Algorithm We now give time ound for querying the position hep. We improve the time ound elow, t the expense of dding elements to the dt structure. Definition 5.. The nive query lgorithm for finding ll occurrences of pttern string P in T consists of the following steps. Index into the position hep to find the longest prefix X of P tht is node of H(T). For ech ncestor X of X (including X), look up the position i stored in X. Position i is n occurrence of X. Determine whether this occurrence is followed y P X. If it is, report i s n occurrence of P. If X = P, lso report ll positions stored t descendnts of X. Figure gives n exmple. This lgorithms is lso simple enough to e tught nd progrmmed in undergrdute dt structures clsses. 7

8 Figure : The nive query lgorithm. To find the occurrences of, index in on to the node leled 6. All positions t descendnts of this node ({6,, 9}) re occurrences of. In ddition, some ncestors cn e occurrences of. This is determined y inspection t the positions in ncestors, whereupon it is determined tht is lso n occurrence. A string such s tht is not node of the position hep is hndled slightly differently. Index in on the longest prefix tht is node of the string, in this cse. Only the positions {,, 6, 9} in ncestors of this node cn e occurrences. Which ones re occurrences is determined y inspection t these positions, whereupon it is determined tht 9 is the only occurrence. Lemm 5.. The nive query lgorithm is correct. Proof: A node X contins position i where X occurs in T. If X is prefix of P, then it is n ncestor of X, nd i my or my not e n occurrence of P in T, depending on whether the occurrence of X t i is followed y P X. The test for this condition returns i if nd only if it is n occurrence of P. If P is prefix of X, then X = P, nd since ll prefixes of X occur t position i, so does P. This is reported during the trversl of the sutree rooted t P. If the longest common prefix Y of P nd X is neither P nor X, then the occurrence of Y t i is followed y the first letter of X Y, which is not the first letter of P Y. Therefore, i is not n occurrence of P. The query does not report i in this cse. Lemm 5.. The nive query lgorithm runs in O(min(m,mh(T))+k) time. Proof: If X is the longest prefix of P tht is node of the hep, it tkes O( X ) time to find X y indexing into the hep on P. For ech of the X + ncestors of X, we must look up the position i stored in the ncestor, nd determine in O(m) time whether P occurs t position i. Since X m, this gives n O(m ) ound for this step. Since X is O(h(T)), this lso gives n O(mh(T)) ound for this prt. 8

9 If X = P, tht is, if P is node of the position hep, it lso tkes O() time to return ech of the positions in the sutree rooted t X, for totl of O(m + k) nd O(mh(T) + k). Lemm 5.4. If T is rndomly constructed string nd the construction of P does not depend on T, or if P is rndomly constructed string nd construction of T does not depend on P, then the nive query lgorithm tkes O(m + k) expected time, where m nd k re s in Lemm 5.. Proof: The mh(t) term comes from the fct tht t ech of O(h(T)) nodes X, we must check whether the occurrence of X t the position i tht it stores is followed y P X. This requires checking whether P X letters of P mtch t P X positions of T. The check hlts when mismtch is detected. The proility of ny of positions mtching is / Σ, so the expected numer of checks efore hlting is (Σ )/Σ P X i= / Σ i = O(). 6. The Augmented Position Hep The only ostcle to n O(m + k) worst-cse ound for returning the k occurrences of P is the time to check whether P occurs t the positions stored t ncestors of the lrgest prefix X of P tht is node of the position hep. Definition 6.. Let i e the position stored t node X in H(T), nd let Y e the lrgest prefix of T i tht is node of H(T). The mximl-rech pointer for X is pointer from node X to node Y. The ugmented position hep for T is otined y leling ech node X of H(T) with its mximl-rech pointer nd X s discovery nd finishing time in depth-first trversl of H(T) []. We lso ssocite with the hep n rry N[] such tht N[i] contins pointer to the node of the hep tht contins position i. Let H (T) denote the ugmented position hep. Figure 4 gives n exmple. The N[] rry nd the discovery nd finishing times re omitted. The mximl-rech pointers re depicted with dshed rrows. For exmple, the mximl-rech pointer from the node leled 4 points to the node leled 7, since is the longest sustring eginning t position 4 tht is node of the position hep. A nive lgorithm for otining the ugmented position hep is otined s follows. Crete the position hep for T. The pointers cn e instlled in N[] nd the discovery nd finishing times cn e ssigned to nodes of the hep during depth-first trversl of T. Then for ech suffix T i of T, index s fr s possile into H(T) on T i to find the the mximl node Y tht is prefix of T i. Instll mximl-rech pointer to Y from the node X pointed to y N[i]. 9

10 6.. Queries in O(m + k) time It is well-known tht node x of rooted tree is n ncestor of node y if nd only if the discovery time of x is less thn the discovery time of y nd the finishing time of x is lrger thn the finishing time of y []. This gives the following: Lemm 6.. Given pointers to two nodes X nd Y of H (T), it tkes O() time to determine whether X is n ncestor of Y. Lemm 6.. Given pointer to node X of H (T) nd position i, it tkes O() time to determine whether i is n occurrence of string X in T. Proof: It tkes O() time to find the node Y tht contins i using N[]. By Lemm 6., it tkes O() time to determine whether Y is descendnt of X. If so, then since i is n occurrence of Y nd X is prefix of Y, i is n occurrence of X. If not, it tkes O() time to determine whether Y is n ncestor of X, y Lemm 6.. If it is, then let Z e the node pointed to y the mximl-rech pointer of Y. Position i is n occurrence of X if nd only if X is prefix of T i. Z is the mximl prefix of T i tht is node of the hep. Therefore, X occurs t position i if nd only if it is (not-necessrily proper) prefix of Z, tht is, if nd only if Z is descendnt of X. This tkes O() time to determine, y Lemm 6.. For exmple, in Figure 4, given pointer to the node (the one leled 6), we cn tell tht is descendnt y looking in N[] to find pointer to its node, nd using the discovery nd finishing times of nd to determine tht is descendnt. Therefore, it is n occurrence. We cn tell tht is n occurrence y looking in N[] to find its node, using the discovery/finishing times to find tht it is n ncestor of node, using its mximl-rech pointer to find the node, nd using the discovery/finishing times of nd to determine tht is descendnt of. We cn tell tht is not n occurrence, ecuse its mximl-rech pointer doesn t point to descendnt of. Corollry 6.4. Let Xc e string such tht X is node of the tree nd Xc is not. Given pointer to X nd position j, it tkes O() time to determine whether j is n occurrence of Xc. Proof: By Lemm 6., it tkes O() time to determine whether j is n occurrence of X. If it is, then it is n occurrence of Xc if c occurs t position j X, which tkes O() time to check when T is stored in n rry. Before giving pseudocode for the liner-time query lgorithm, we illustrte the min ides in Figure 4. There re two cses: Cse, where the serch string is node of the position hep, nd Cse, where it is not. Cse is illustrted y, which is the node leled 6. By Lemm 6., we cn now check in O() time piece which of the positions {,} t proper ncestors re occurrences of. Only is; its node is the only proper ncestor with mximl-rech pointer into s sutree. Tht is O(m) time so fr. In 0

11 ddition, we report the lels of descendnts {6,,9} in O() time piece, s efore, in O(k) time, for totl of O(m + k) time. Cse is illustrted y, which is not node of the hep. Our strtegy is to prtition the string into segments,, nd, which cn e hndled efficiently y Corollry 6.4 nd Lemm 6.. We use the corollry to find the occurrences of, discrd those tht re not followed y. This gives the occurrences of. We then use the lemm to discrd from these occurrences those tht re not followed y. To pply the corollry, we wnt ll the segments except the lst to e of the form Xc, where X is node of the tree nd Xc is not. The first such segment is. This is our current sustring. As in the nive query lgorithm, only ncestors of X = cn e positions of. These re leled {,,6,9}. By Corollry 6.4, we cn determine which re occurrences of the current sustring in O() time piece, for totl of O( Xc ) time. This leves positions {6,9}. The string ecomes the finished prefix, its positions {6,9} re known, nd the rest of the query string, is the remining suffix. We now look for the prefix of the remining suffix of the form Xc, where X is node of the hep nd Xc is not. This is. We wnt to find which occurrences of Xc follow occurrences of the finished prefix. To do this, we sutrct the length of the finished prefix from ech of the positions of the finished prefix nd determine in O() time whether it is n occurrence of Xc, y Corollry 6.4. In the exmple, sutrcting = 4 from 9 gives 5, nd we determine tht 5 is n occurrence of. Therefore, 9 is n occurrence of. Sutrcting 4 from 6 gives, nd we determine tht is not n occurrence of. Therefore, of the initil possile positions of the serch string, {6,9}, only 9 survives the test. The finished prefix is now, the positions where it occurs re known to e {9}, nd the remining suffix is. When the remining suffix is short enough to e node of the tree, let us denote it Y. (In the exmple, Y =.) We sutrct the length of the finished prefix from ech of its occurrences ({9} for this exmple), nd check whether ech of these positions ({} in this exmple) is n occurrence of Y, using Lemm 6.. Since position is n occurrence of Y =, position 9 is n occurrence of the originl serch string. Generlizing from these exmples, we get the lgorithm of Tle. Lemm 6.5. The liner query lgorithm is correct. Proof: For Cse, the procedure is the sme s the nive lgorithm, except tht t ech ncestor X of P, we determine whether i is n occurrence of P in O() time, insted of O( P ) time, using Lemm 6.. For Cse, y induction on the numer of times FinishedPrefix is ssigned, I is the set of positions wherefinishedprefix occurs in T. In the finl line, P = F inishedp ref ix + ReminingSuf f ix, nd I is ssigned to e those positions of FinishedPrefix. After the finl step, FinishedPrefix = P, hence I is the set of positions in T where P occurs.

12 Tle : The liner query lgorithm for use with the ugmented position hep Cse : P is node of H (T). This is detected y indexing into H (T) on P, nd gives node P. For ech proper ncestor X of P, look up the position i stored t X, nd determine whether it is n occurrence of P. In ddition, report ll positions recorded in the sutree rooted t P. Cse : P is not node of H (T). // Find n initil set of cndidte positions Let CurrentSustring e the shortest prefix of P tht is not node of H (T) Let I e the set of positions where CurrentSustring occurs // Invrints: FinishedPrefix + ReminingSuffix = P; // I is the set of positions where FinishedPrefix occurs in T FinishedPrefix = CurrentSustring ReminingSuffix = P - CurrentSustring while ReminingSuffix is not node of H (T) Let CurrentSustring e the shortest prefix of ReminingSuffix tht is not node of H (T) I := {j j I nd the occurrence of FinishedPrefix t j in T is followed y n occurrence of CurrentSustring} ReminingSuffix = ReminingSuffix - CurrentSustring FinishedPrefix = FinishedPrefix + CurrentSustring CurrentSustring = ReminingSuffix Let I := {j j I nd the occurrence of CurrentPrefix t i is followed y CurrentSustring}

13 , 6,, 9 ncestors descendnts Figure 4: The liner query lgorithm on strings nd on the ugmented position hep. Mximl-rech pointers re dshed, nd mximl-rech pointers tht re loops re omitted from the digrm. Lemm 6.6. The liner query lgorithm cn e implemented in O(m + k) time using the ugmented position hep. Proof: Cse differs from the nive pproch only in tht it uses Lemm 6. to determine which ncestors of P contin the position of n occurrence of P, reducing ech of these tests from O( P ) to O(). Since there re P + ncestors of P, this tkes O( P ) time. As in the nive query lgorithm, ll other occurrences of P re found in O() time piece during trversl of the sutree rooted t P, for totl of O( P + k) time. For Cse, let (P P,...,P l ) e the vlues tken on y CurrentSustring, nd let (I,I,...,I l ) e the vlues tken on y I. To find the i th vlue P i of CurrentSustring, index s fr s possile on ReminingSuffix into H (T), yielding node X i, nd let e the next chrcter of ReminingSuffix following prefix X. P i = X i. Over ll itertions, this tkes time proportionl to l i= P i = O( P ). For i l, nd ech j I i, it tkes O() time to determine whether the instnce of FinishedPrefix t position j is followed y X i ; this is determined y finding whether j FinishedPrefix is n occurrence of X i, using Lemm 6.. It then tkes O() time to determine whether this occurrence of X i is followed y n occurrence of t position j FinishedPrefix X i. This determines whether the occurrence of FinishedPrefix t position j is followed y X i = P i. Therefore, it tkes O() time to determine, for ech element of I i, whether it remins in i.

14 By the nive lgorithm, ech P i hs O( P i ) occurrences, ecuse P i is not node of the tree, hence its occurrences cn only e recorded in ncestors (prefixes) of X i. Therefore, I i = O( P i ). Determining I l therefore tkes O( l i= I i ) = O( l i= P i ) = O( P ) time. 6.. Returning positions one-y-one in left-to-right order It is sometimes climed tht the suffix rry returns ll k occurrences of P in O(m + log n) time, even though k cn e superliner in this ound. The reson is tht it gives pointer to list of the positions. This time ound cptures the fct tht if the user wnts to exmine the first k positions, this tkes O(k ) rther thn Θ(k) time. One wy to view this is tht it returns n itertor in O(m + log n) time tht then tkes O() time per position to return the positions. The position hep cn e implemented to hve this property lso, using depth-first serch tht mintins stck of ctive clls tht hve not yet mde recursive cll on their lst child. One use of such n itertor, however, is to exmine the first k positions in left-to-right order. This is common opertion in text editors, for exmple. This cn e implemented in O(log k) worst-cse time per element, due to the fct tht the node lels hve the hep property (Definition.). We illustrte how to produce n itertor tht returns them in right-to-left order; left-to-right order cn e otined y uilding the ugmented position hep for the reverse of the text. The positions of nodes on the indexing pth X tke O() time to check. If P = X, then the descendnts of X might lso hve to e returned in left-to-right order. Keep priority queue on the topmost nodes of the sutree of X whose positions hve not yet een returned. Becuse the positions hve the hep property (Definition.), the minimum position is mong these nodes. Initilly the priority queue hs X in it. Ech time new position is sked for, the minimum index i in the priority queue is returned, nd the positions in the children of the node contining i re inserted to the priority queue. Since Σ = O() nd the size of the sutree is O(k), the size of the priority queue is O(k), nd extrcting i nd inserting its children tkes O(log k) time. 7. Building the Position Hep in O(n) Time Ech time node is dded to the position hep, its prent must e locted so tht it cn e dded s child. The reson the ove lgorithm for constructing the position hep from the root does not tke O(n) time is tht indexing from the root to find this prent t ech itertion is not n O() opertion. 7.. The strtegy Indexing into the hep from the root is not the only wy to find the prent of the new node t step i. Let X i e the node dded t step i, let the first letter of T i e, nd let X i e the node dded t step i. Since X i is 4

15 prefix of T i, X i = Y, where Y is the prent of X i. By Lemm 7., elow, Y X i +, so Y is (not necessrily proper) ncestor of X i. This suggests the ide of serching upwrd from X i insted of downwrd from the root in order to find the prent Y of the new node t step i. Since Y cn e much shorter thn X i, the upwrd serch might hve to proceed through lrge numer of nodes on the pth from X i towrd the root efore Y is reched. However, the new node t step i, Y is then much shorter thn the node, X i, inserted t the previous itertion. The cost of the opertion is proportionl to the decrese in depth from one itertion to the next. Wht mkes the pproch more efficient thn the ove pproch is tht depth of the new node inserted t successive itertions cn grow y t most from one itertion to the next, y Lemm 7.. This llows us to mortize occsionl lrge costs incurred in itertions where the depth decreses y lrge mount over mny itertions where the depth slowly uilds up gin t the rte of one per itertion. The rgument is the sme s tht for stck with multipop opertion descried in the chpteramortized Anlysis in the textook []. 7.. Implementtion The following lemm is the sis of the clim tht the depth in the tree t which the lgorithm works must uild up gin slowly if there is sudden lrge nd costly decrese in the depth. Lemm 7.. If P is not node of H(T), it hs fewer thn P occurrences in T. Proof: Every suffix of T tht hs P s prefix results in new node of the tree tht is either proper prefix of P or tht hs P s prefix. Since P does not occur in the tree, it is not prefix of ny node in the tree. Therefore, the numer of suffixes of T tht hve P s prefix, hence the numer of occurrences of P, is ounded y the numer of proper prefixes of P. Let us sy tht set of S of strings is hereditry if, whenever X S, every sustring of X is lso in S. Lemm 7.. The nodes of the position hep re hereditry set of strings. For exmple, in the finl tree, node is leled with position of Figure 4. Its sustrings,,,,,,, nd the empty string re ll nodes of the position hep; they re leled with positions 0,,5,6,,,,, respectively. Proof: Let us show this y induction on the length of T i = t i t i...t. The lemm is trivilly true for H(T ), which hs only one node, the empty string. Otherwise, we dopt s the induction hypothesis tht the nodes of H(T i ) hve the hereditry property. Since H(T i ) differs from H(T i ) only y the ddition of node X, H(T i ) cn only fil to hve the hereditry property if some proper sustring of X fils to e node of T i. 5

16 T = Figure 5: The hereditry property doesn t necessrily pply when the suffixes re not inserted in order of scending length. The figure depicts the Coffmn nd Eve structure where the insertion order of the suffixes is (T, T 4, T, T 7, T 5, T 6, T ). String is the node leled with position 5, ut its sustring is not node of the tree. 6 This cn t e the cse if X <, since λ is node of H(T i ). Suppose X. We cn then write X s X. The prent of X is X, hence it is node of H(T i ). Since X is longer thn X, X is node in H(T i ) y the induction hypothesis. Also, X is prefix of T i, nd since X is node of T i, X is either dded t step i or is lredy node of T i. In either cse, it is node of H(T i ). We conclude tht X nd X re nodes of T i. By the induction hypothesis, every sustring of X nd X is node of T i, hence of T i, nd these re every proper sustring of the new node X = X. This hereditry property is not shred y ritrry instnces of Coffmn nd Eve s dt structure, s the node leled 9 in Figure is the string, ut its sustring is not node of the tree. It is not even true when the keys re the suffixes of text T when they re not inserted in scending order of length. Figure 5 gives n exmple. Lemm 7.. For < i T, if X i is the node inserted t step i nd X i is the node inserted t step i, then X i X i +. Proof: Let denote the first letter of T i. X i is the shortest prefix of T i tht is not lredy node of H(T i ) nd X i is the shortest such prefix of T i = T i. Let denote the lst letter of X i. Then X i cn e written s Y for some string Y. Suppose X i X i +. Then X i is proper prefix of Y. Since X i is the longest prefix of T i tht is not node of H(T i ), Y is not node of H(T i ). By the hereditry property, Y is node of H(T i ), since it is sustring of X i, which is node of H(T i ). The only new node dded to H(T i ) to get H(T i ) is Y, so Y ws lredy node of H(T i ), contrdiction. To insert node to the position hep, we must find the prent. Since inserting the node fter the prent is found tkes O() time, the only ostcle to getting liner time ound is repeted indexing into the position hep to find the prent of ech node to e dded. We must use n lterntive method to find this prent. 6

17 X i = is previously dded node + is not node + is not node + is not node + = is node = Y : the new node X i X i X i Figure 6: Given the node X i dded t step i, find the prent of the node X i dded t step i. The ide of our O(n) lterntive method is given in Figure 6. At step i = 4, we dd X i = s new node. At step i = 5, we must dd the shortest prefix of H(T i ) tht is not lredy node of the position hep. Let denote the first letter of T i. If the string does not lredy occur s node of the position hep, then it cn e dded s child of the root in O() time. Otherwise, s in the proof of Lemm 7., the new node must e Y for some prefix Y of the node X i dded in step i, where is the chrcter occurring Y + positions into T i. Below, we show how to find, for ech such prefix Y of X i, whether Y is lredy node of the position hep, nd if so, to return pointer to it, in O() time piece. We try this on ll proper prefixes of X i in descending order of length until we find the first. In the figure, we let Y tke on the sequence of vlues (,,,), whereupon it is discovered tht Y = is lredy node of the position hep, nd since the conctention of nd is not, is the longest prefix of T i tht is lredy node of the position hep. We hve found the desired prent of the new node. The new node, X i = Y, is dded s its child of Y on n edge leled with letter. This does not give n O() ound to dd ech node of the tree. However, we cn mortize the vrile costs, showing tht they sum to O(n) over ll itertions. The reson the cost of step i is not O() is tht we might hve to try mny prefixes Y of T i efore we find the one such tht Y is lredy node of the hep. Let the decrese in depth denote the difference X i X i of the depth of the node dded t position i nd the depth of the node dded t position i. If this is negtive, cll it n increse in depth. If t step i, we try k i prefixes efore finding Y such tht Y is lredy node of the tree, then we spent O(k i ) time on the step, nd X i = Y = X i (k i ). The decrese in depth is k i. The first two prefixes tke O() time, so the time spent t step i is O() plus the decrese in depth. By Lemm 7., the depth cn increse y t most t ech itertion, so the totl increse in depth is O(n) 7

18 position hep dul hep Figure 7: The position hep nd its dul for the text. The lels of the pth leding to node in the dul is the reverse of the lels of the pth leding to it in the position hep. over ll itertions. The totl decrese in depth cn t exceed the totl increse in depth, which mens tht over ll itertions, the totl decrese in depth is O(n). Therefore, the totl time spent y the lgorithm is no() + O(n) = O(n). It remins to descrie how to get n O() ound for finding, for ech prefix Y of X i, whether Y is lredy node of the hep. Definition 7.4. Let the dul D(T) of the position hep H(T) e the trie where for ech node X of H(T), the reverse X R of X is node of D(T). (see Figure 7). We continue to refer to ech node y its pth lel X in the position hep, even when considering it s node of the dul. Equivlently, ech node of D(T) is denoted y the sequence X of lels on edges from the node to the root of D(T). It is tempting to think tht the dul is just the position hep of the reverse of the text, ut it is esily verified tht this is not the cse. Lemm 7.5. The set of nodes of D(T) is the sme s the set of nodes of H(T). Proof: Becuse for every node X of H(T), there is node X in D(T), where X is the string of lels from the node to the root in D(T), every node of H(T) is node of D(T). It remins to show tht every node of D(T) is node of H(T). Let X e n ritrry node of H(T). By Lemm 7., not only is every prefix of node X of H(T) node of H(T), ut so is every suffix. This implies tht every ncestor of X in D(T) is node of H(T). There re no nodes on ny pth of D(T) tht fil to e node of H(T). We implement the position hep nd its dul on the sme set of nodes, so tht ech node hs oth prent in the position hep nd prent in the dul. We concurrently construct the position hep nd its dul. Suppose tht t step i we lredy hve H(T i ) nd D(T i ). We show how to updte oth to get H(T i ) nd D(T i ) in O(k i ) time. When going from H(T i ) to H(T i ), let e the first letter of T i nd X i the node dded t step i. (Refer to Figure 8.) The prefixes of Y in descending order of length re the ncestors encountered on the pth from X i to the root of the position hep. For ech such ncestor Y, we cn find whether Y is 8

19 Prents of X i = Y in position hep nd its dul X i = is previously dded node + is not node + is not node = Y + is node = Y : the new node. Y = "" X i = Y X i X i 9 4 X i X i 4 Figure 8: Implementing the lgorithm of Figure 6 using the position hep nd its dul. Strting t the previously-dded node X i, we find the lowest ncestor Y such tht Y is lredy node. This is ccomplished y trversing ncestors in the position hep, nd seeing if they hve child on edge leled in the dul. In this cse Y is the node leled 4. Its child on edge leled in the dul is Y, the node leled 5. It is the prent of the new node X i =. The lst prefix tried efore Y ws found is the longest node of the dul hep tht is prefix of X nd hs no child leled. It is the prent of X i in the dul. lredy node of the hep y determining whether Y hs child on n edge leled in the dul. This tkes O() time, since Y is oth node of the hep nd of the dul. We stop when we encounter the first one. By the ove lgorithm, this tkes cre of dding node X to H(T i ), yielding H(T i ) in O(k i ) time. However, we must lso dd this node to the dul, which requires locting its prent, X, nd dding it s child on edge leled. Fortuntely, X ws just the lst prefix of Y considered efore X ws discovered. We lredy found X in the position hep, nd since it is lso node of the dul, we hve it in the dul. X cn e dded s child of X on edge leled in O() dditionl time over wht we hve ccounted for in dding it in the position hep. This gives the following: Lemm 7.6. It tkes liner time to construct the position hep of text T. 8. Constructing the Augmented Position Hep in O(n) Time The ugmented position hep differs from the position hep in tht the nodes re leled with depth-first discovery nd finishing times nd with mximlrech pointers. Depth-first serch on tree with n nodes tkes O(n) time, so it 9

20 only remins to descrie how to compute the mximl-rech pointers in O(n) time. Once gin, the strtegy is to mortize the cost. The pproch is virtully the sme s it is for dding new nodes: insted of serching downwrd from the root t ech itertion, we serch upwrd in the tree, strting t the node pointed to y mximl-rech pointer t the previous itertion. Even though this is not n O() opertion, the cost is proportionl to the decrese in depth of the node pointed to y the mximl-rech pointer. This depth cn increse y t most from one itertion to the next, llowing to mortize lrge decreses in depth over mny smll increses in depth. Lemm 8.. For < i T, if X i is the node pointed to y the mximlrech pointer of node i nd X i y the mximl-rech pointer of node i, then X i X i +. Proof: Let denote the first letter of T i. X i is the longest prefix of T i tht is node of H(T), nd X i is the longest prefix of T i = T i tht is node of H(T). Let denote the lst letter of X i. Then X i cn e written s Y for some string Y. Suppose X i X i +. Then X i is proper prefix of Y nd Y is not node of H(T i ). By the hereditry property, Y is node of H(T i ), since it is sustring of X i, which is node of H(T i ). The only new node dded to H(T i ) to get H(T i ) is Y, so Y ws lredy node of H(T i ), contrdiction. To construct the ugmented position hep in O(n) time, our strtegy is first to construct the position hep in O(n) time using the lgorithm from the previous section. As efore, we crete the rry N[], where N[i] points to the node tht contins position i, nd this tkes O(n) time y trivil methods. We then dd the discovery nd finishing times nd the mximl-rech pointers on second pss, in O(n) time. We find nd test ech prefix y strting t X i in the position hep nd scending through ncestors until we find the first one, Y, tht hs child on edge leled in the dul hep. This child, Y, in the dul is the node to which node i must point. The nlysis of the liner running time is the sme s it is for liner-time construction of the position hep. The current depth is the depth of node X i in the position hep. The first two prefixes of X i tke O() time to check for child on edge leled in the dul hep. Ech dditionl prefix tkes O() time to check, nd decreses the current depth in the position hep. Cll this the vrile prt of the time spent t position i. By Lemm 8., the current depth cn increse y t most one per itertion. The initil depth is t most, since T hs length. The totl decrese in depth cn therefore e t most one greter thn the totl increse in depth, which is O() per itertion, hence O(n) overll. The sum of the vrile prts of the times spent t the different itertions is therefore O(n). We therefore get the following: 0

21 Lemm 8.. It tkes O(n) time to construct the ugmented position hep for text T of length n. 9. Updting the Position Hep when the Text is Edited When lock of chrcters is inserted to or deleted from text T, the position hep must pss through series of steps in which it is trie, ut hs some things wrong with it tht must e repired in order for it to e the position hep of the new text. The gol of this section is to give lgorithms for Delete nd Insert, which updte the position hep when lock of text is deleted from or inserted to the text T. Since the text is no longer sttic, it is no longer convenient to lel node of the position hep with its position numer in the text; when position is deleted, the position numers of ll letters to its left decrese y one. To void hving to updte the position-numer lels of ll those nodes, we insted lel the nodes with position pointers to the positions of the text. This requires us to define the nlog of the hep property when pointers, rther thn integers, re used. Definition 9.. If p is pointer to position in T, let T p denote the suffix of T tht egins t p. If X is node in the trie with pointer to position of T, let p(x) denote this pointer. The trie hs the hep property if whenever Y is child of X, p(y ) is to the left of p(x) in T. The pointer p(x) is correctly plced if X hs n occurrence t position p(x), tht is, if X is prefix of T p(x). The constructive definition of the position hep (4.) remins unchnged, except tht ech time position is inserted, the new node is leled with pointer to the position, rther thn its position numer. It will e convenient to look up the corresponding position-hep node given pointer to position in the text T. This is ccomplished y leling ech position p of the text with pointer N(p) to the node of the position hep tht points to it. This serves the sme function s the rry N[] in the sttic cse. To void the need to mention this pointer ech time we move pointer in the position hep, we will define the opertion of moving position p from one node to nother in the position hep s including the opertion of mking the pointer N(p) point to the new node. The following lemm is useful for estlishing tht procedure for updting the position hep fter n edit opertion on T hs correctly produced the position hep for the modified text. Lemm 9.. A trie H where ech node is leled with pointer to letter of text T is the position hep for T if nd only if it stisfies the following properties:. H hs the hep property;. Every position of T is pointed to y t most one pointer p(x) for some node X in the trie;

22 . Every position of T is pointed to y t lest one pointer p(x) for some node X of the trie; 4. For every node X, p(x) is correctly plced. Proof: By induction on the numer of positions inserted y the nive construction lgorithm. 9.. Deleting or inserting lock of text in T The workhorses of the lgorithm for updting the position hep fter insertion or removl of lock of text re Remove nd Add. Below, we explin how they work, ut for now, we define the prolems in terms of their preconditions nd postconditions so tht, for the time eing, we cn mke clls to them in our implementtion of Delete nd Insert. Definition 9.. The prolems solved y Remove nd Add An input to Remove or Add is trie tht stisfies properties properties nd of Lemm 9., ut might not stisfy properties nd 4. An dditionl input to Remove is node X tht contins position pointer to e removed from the set of position pointers in the trie. It removes the pointer without disrupting the hep property, without otherwise chnging the set of position pointers in the tree, nd without creting ny new violtions of property 4 t ny position pointers. An dditionl input to Add is position pointer to e inserted to the trie. The position pointer must not lredy occur in the trie. It correctly plces the pointer to without disrupting the hep property, without otherwise chnging the set of position pointers in the tree, nd without creting ny new violtions of property 4 t ny position pointers. A cll to Remove or Add must updte vrile h tht gives the current height of the trie. Implementtion requires shuffling position pointers in the tree in wy tht is fmilir to nyone who hs studied heps. Detils re given elow. In the mentime, given the prolems solved y Remove nd Add, we cn now explin the min procedures of the section, Delete nd Insert, in terms of clls to Remove nd Add. The Delete procedure updtes the position hep when lock of chrcters is deleted from the text so tht it is the position hep of the new text. Definition 9.4. An lgorithm for Delete Let h e the height of the input position hep. Cll Remove nd Add, using the modified text, on the h chrcters tht lie to the left of the deleted lock.

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring