Efficient implementation of lazy suffix trees

Size: px

Start display at page:

Download "Efficient implementation of lazy suffix trees"

Janice Mitchell
6 years ago
Views:

SOFTWARE PRACTICE AND EXPERIENCE Softw. Prct. Exper. 2003; 33:1035 1049 (DOI: 10.1002/spe.535) Efficient implementtion of lzy suffix trees R. Giegerich 1,S.Kurtz 2 nd J.

1 SOFTWARE PRACTICE AND EXPERIENCE Softw. Prct. Exper. 2003; 33: (DOI: /spe.535) Efficient implementtion of lzy suffix trees R. Giegerich 1,S.Kurtz 2 nd J. Stoye 1,, 1 Fculty of Technology, University of Bielefeld, Bielefeld, Germny 2 Centre for Bioinformtics, University of Hmurg, Bundesstrße 43, Hmurg, Germny SUMMARY We present n efficient implementtion of write-only top-down construction for suffix trees. Our implementtion is sed on new, spce-efficient representtion of suffix trees tht requires only 12 ytes per input chrcter in the worst cse, nd 8.5 ytes per input chrcter on verge for collection of files of different type. We show how to efficiently implement the lzy evlution of suffix trees such tht sutree is evluted only when it is trversed for the first time. Our experiments show tht for the prolem of serching mny exct ptterns in fixed input string, the lzy top-down construction is often fster nd more spce efficient thn other methods. Copyright c 2003 John Wiley & Sons, Ltd. KEY WORDS: string mtching; suffix tree; spce-efficient implementtion; lzy evlution INTRODUCTION Suffix trees re efficiency oosters in string processing. The suffix tree of text t is n index structure tht cn e computed nd stored in O( t ) time nd spce. Once constructed, it llows ny sustring w of t to e locted in O( w ) steps, independent of the size of t. This instnt ccess to sustrings is most convenient in myrid [1] of situtions, nd in Gusfield s ook [2], out 70 pges re devoted to pplictions of suffix trees. While suffix trees ply prominent role in lgorithmics, their prcticl use hs not een s widespred s one should expect (for exmple, Skien [3] hs oserved tht suffix trees re the dt structure with the highest need for etter implementtions). The following prgmtic considertions mke them pper less ttrctive: Correspondence to: J. Stoye, AG Genominformtik, Fculty of Technology, Universität Bielefeld, Bielefeld, Germny. E-mil: stoye@techfk.uni-ielefeld.de Contrct/grnt sponsor: Deutsche Forschungsgemeinschft; contrct/grnt numer: Ku 1257/1-1 Pulished online 25 June 2003 Copyright c 2003 John Wiley & Sons, Ltd. Received 26 Novemer 2001 Revised 4 Novemer 2002 Accepted 8 Mrch 2003

2 1036 R. GIEGERICH, S. KURTZ AND J. STOYE The liner-time constructions y Weiner [4], McCreight [5] nd Ukkonen [6] re quite intricte to implement. (See lso [7], which reviews these methods nd revels reltionships much closer thn one would think.) Although symptoticlly optiml, their poor loclity of memory reference [8] cuses significnt loss of efficiency on cched processor rchitectures. Although symptoticlly liner, suffix trees hve reputtion of eing greedy for spce. For exmple, the representtion of McCreight [5] requires 28 ytes per input chrcter in the worst cse. Due to these fcts, for mny pplictions, the construction of suffix tree does not mortize. For exmple, if text is to e serched only for very smll numer of ptterns, then it is usully etter to use fst nd simple online method, such s the Boyer Moore Horspool lgorithm [9], to serch the complete text new for ech pttern. However, these concerns re llevited y the following recent developments: In [8], Giegerich nd Kurtz dvocte the use of write-only, top-down construction, referred to here s the wotd-lgorithm. Although its time efficiency is O(nlog n) in the expected cse nd even O(n 2 ) in the worst cse (for text of length n), it is competitive in prctice, due to its simplicity nd good loclity of memory reference. In [10], Kurtz developed spce-efficient representtion tht llows suffix trees to e computed in liner time in 46% less spce thn previous methods. As consequence, suffix trees for lrge texts, e.g. complete genomes, hve een proved to e mngele; see for exmple [11]. The question of mortizing the cost of suffix tree construction is lmost eliminted y incrementlly constructing the tree s demnded y its queries. This possiility ws lredy hinted t in [8], where the wotd-lgorithm ws clled lzytree for this reson. When implementing the wotd-lgorithm in lzy functionl progrmming lnguge, the suffix tree utomticlly ecomes lzy dt structure, ut of course, the generl overhed of using lzy lnguge is incurred. In the present pper, we explin how lzy nd n eger version of the wotd-lgorithm cn efficiently e implemented in n impertive lnguge. Our implementtion technique voids constnt lphet fctor in the running time. It is sed on new spce efficient suffix tree representtion, which requires only 12n ytes of spce in the worst cse. This is n improvement of 8n ytes over the most spce efficient previous representtion, s developed in [10]. Other recently pulished suffix tree representtions require 17n ytes [14] nd 21n ytes [15]. Theoreticl developments in [16,17] descrie suffix tree representtions tht use only n log 2 n + O(n) its of spce. Another one is the compressed suffix tree of [18], which requires only O(n) its of spce. However, these dt structures re confined to text serching nd, to the est of our knowledge, no implementtion exists sed on these techniques. There re other spce-efficient dt structures for string pttern mtching. The suffix rry of [12] requires 9n ytes, including the spce for construction. The level compressed trie of [19] tkes The suffix rry construction of [12] nd the liner time suffix tree construction of [13] lso do not hve the lphet fctor in their running times. For the liner time suffix tree constructions of [4 6] the lphet fctor cn e voided y employing hshing techniques, see [5]; however, t the cost of using considerly more spce, see [10]. Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

3 LAZY SUFFIX TREES 1037 out 12n ytes. The suffix inry serch tree of [20] requires 10n ytes. The suffix cctus of [21] cn e implemented in 10n ytes, nd the suffix orcle [22] in8n ytes. Finlly, the PT-tree of [23] requires n log 2 n + O(n) its. While some of these dt structures require less spce thn suffix trees, we expect tht for moderte numer of pttern serches our lzy construction wotdlzy is more spce efficient thn these methods. Moreover, some of these index structures cn only e computed efficiently y first constructing the complete suffix tree. Experimentl results show tht our implementtion technique leds to progrms tht re superior to previous ones in mny situtions. For exmple, when serching 0.01n ptterns of length etween 10 nd 20 in text of length n, the lzy wotd-lgorithm (wotdlzy, for short) is on verge 43% fster nd lmost 50% more spce efficient thn linked list implementtion of McCreight s liner time suffix tree lgorithm [5]. It is 38% fster nd 65% more spce efficient thn hsh tle implementtion of McCreight s liner time suffix tree lgorithm, ten times fster nd 42% more spce efficient thn progrmsed on suffix rrys [12], nd 40 times fster thn the iterted ppliction of the Boyer Moore Horspool lgorithm [9]. The lzy wotd-lgorithm mkes suffix trees lso pplicle in contexts where the expected numer of queries to the text is smll reltive to the length of the text, with n lmost immesurle overhed compred with its eger vrint wotdeger if the complete tree is uilt. In ddition to its usefulness for serching string ptterns, wotdlzy is interesting for other prolems (see the list in [2]), such s exct set mtching, the sustring prolem for dtse of ptterns, the DNA contmintion prolem, common sustrings of more thn two strings, circulr string lineriztion, or computtion of the q-word distnce of two strings. Documented source code of the progrms wotdeger nd wotdlzy is ville t A preliminry version of this pper ppered in [24]. THE WOTD SUFFIX TREE CONSTRUCTION Terminology Let e finite ordered set of size k, thelphet. is the set of ll strings over, ndε is the empty string. Weuse + to denote the set \{ε} of non-empty strings. We ssume tht t is string over of length n 1ndtht is chrcter not occurring in t. Fornyi [1,n+ 1], let s i = t i...t n denote the ith non-empty suffix of t. A + -tree T is finite rooted tree with edge lels from +. For ech, every node α in T hs t most one -edge α v β for some string v nd some node β. An edge leding to lef is lef edge. Letα e node in T. We denote α y the pth tht leds us there. Formlly, α is denoted y w if nd only if w is the conctention of the edge lels on the pth from the root to α. ε is the root. Astrings occurs in T if nd only if T contins node sv, for some string v. Thesuffix tree for t, denoted y ST(t), isthe + -tree T with the following properties: (i) ech node is either lef or rnching node, nd (ii) string w occurs in T if nd only if w is sustring of t. There is one-to-one correspondence etween the non-empty suffixes of t nd the leves of ST(t). Wedefinethelef set l(α) of node α s follows. For lef s j, l(s j ) ={j}. For rnching node u, l(u) ={j u v uv is n edge in ST(t), j l(uv)}. Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

4 1038 R. GIEGERICH, S. KURTZ AND J. STOYE A review of the wotd-lgorithm The wotd-lgorithm dheres to the recursive structure of suffix tree. The ide is tht for ech rnching node u the sutree elow u is determined y the set of ll suffixes of tththveu s prefix. In other words, if we hve the set R(u) := {s us is suffix of t} of remining suffixes ville, we cn evlute the node u. This works s follows: t first R(u) is divided into groups ccording to the first chrcter of ech suffix. For ny chrcter c, letgroup(u, c) := {w cw R(u)} e the c-group of R(u). Ifforsomec, group(u, c) contins only one string w, then there is lef edge leled cw outgoing from u. Ifgroup(u, c) contins t lest two strings, then there is n edge leled cv leding to rnching node ucv,wherev is the longest common prefix (lcp, for short) of ll strings in group(u, c). The child ucv cn then e evluted from the set R(ucv) ={w vw group(u, c)} of remining suffixes. The wotd-lgorithm strts y evluting the root from the set R(root) of ll suffixes of t. All nodes of ST(t) cn e evluted recursively from the corresponding set of remining suffixes in top-down mnner. For exmple, consider the input string t =. The wotd-lgorithm for t works s follows: At first, the root is evluted from the set R(root) of ll non-empty suffixes of the string t, see the first six columns in Figure 1. The lgorithm recognizes three groups of suffixes. The -group, the -group, nd the -group. The -group contins two nd the -group contins three suffixes, hence we otin two unevluted rnching nodes, which re reched y n -edgendy-edge. The -group is singleton, so we otin lef reched y n edge leled. To evlute the unevluted rnching node corresponding to the -group, one first computes the longest common prefix of the remining suffixes of tht group. This is in our cse. So the -edge from the root is leled y, nd the remining suffixes nd re divided into groups ccording to their first chrcter. We otin two singleton groups of suffixes, nd thus two lef edges outgoing from. These lef edges re leled s nd. The unevluted rnching node corresponding to the -group is evluted recursively in similr wy, see Figure 1. Properties of the wotd-lgorithm The distinctive property of the wotd-lgorithmis tht the construction proceeds top-down. Once node hs een constructed, it need not e revisited in the construction of other prts of the tree (unlike the liner-time constructions of [4 6,13]). As the order of sutree construction is otherwise independent, it my e rrnged in demnd-driven fshion, otining the lzy implementtion detiled in the next section. The top-down construction hs een mentioned severl times in the literture [8,19,25,26],uttfirst glnce, its worst cse running time of O(n 2 ) is disppointing. However, the expected running time is O(nlog k n) (see e.g. [8]), nd experiments in [8] suggest tht the wotd-lgorithm is prcticlly liner for moderte size strings. This cn e explined y the good loclity ehvior: the wotd-lgorithm hs optiml loclity on the tree dt structure. In principle, no more thn current pth of the tree need e in memory. With respect to text ccess, the wotd-lgorithm lso ehves very well: for ech sutree, only the corresponding remining suffixes re ccessed. At certin tree level, the numer of suffixes considered will e smller thn the numer of ville cche entries. As these suffixes re red sequentilly, prcticlly no further cche misses will occur. This point is reched erlier when Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

5 LAZY SUFFIX TREES 1039 ) z } R(root ) z } R() z } R() ) ) z } R() Figure 1. The write-only top-down construction of ST(). the rnching degree of the tree nodes is higher, since the suffixes split up more quickly. Hence, the loclity of the wotd-lgorithm improves for lrger lphets. Aside from the liner constructions lredy mentioned, there re O(nlog n) time suffix tree constructions (e.g. [26,27]), which re sed on Hopcroft s prtitioning technique [28]. While these constructions re fster in terms of worst-cse nlysis, the sutrees re not constructed independently. Hence they do not shre the loclity of the wotd-lgorithm, nor do they llow for lzy implementtion. IMPLEMENTATION TECHNIQUES This section descries how the wotd-lgorithm cn e implemented in n eger lnguge. The simultion of lzy evlution in n eger lnguge is not very common pproch. Unevluted prts of the dt structure hve to e represented explicitly, nd the trversl of the suffix tree ecomes more complicted ecuse it hs to e merged with the construction of the tree. We will show, however, tht y creful considertion of efficiency mtters, one cn end up with progrm tht is not only Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

6 1040 R. GIEGERICH, S. KURTZ AND J. STOYE more efficient nd flexile in specil pplictions, ut the performnce of which is comprle to the est existing implementtions of index-sed exct string mtching lgorithms in generl. We first descrie the dt structure tht stores the suffix tree, nd then we show how to implement the lzy nd eger evlution, including the dditionl dt structures. The suffix tree dt structure To implement suffix tree, we siclly hve to represent three different items: nodes, edges, nd edge lels. To descrie our representtion, we define totl order on the children of rnching node: let uv nd uw e two different nodes in ST(t) tht re children of the sme rnching node u. Then uv uw if nd only if min l(uv) < min l(uw). Note tht lef sets re never empty nd l(uv) l(uw) =. Hence is well defined. Let us first consider how to represent the edge lels. Since n edge lel v is sustring of t, it cn e represented y pir of pointers (i, j) into t = t, such tht v = t i...t j (see Figure 2). In cse the edge is lef edge, we hve j = n + 1, i.e., the right pointer j is redundnt. In cse the edge leds to rnching node, it lso suffices to store only left pointer, if we choose it ppropritely (see Figure 3): let u v uv e n edge in ST(t). Wedefinelp(uv) := min l(uv) + u, theleft pointer of uv. Now suppose tht uv is rnching node nd i = lp(uv). Assume furthermore tht uvw is the smllest child of uv w.r.t. the reltion. Hence we hve min l(uv) = min l(uvw), nd thus lp(uvw) = min l(uvw) + uv = min l(uv) + u + v = lp(uv) + v. Nowletr = lp(uvw). Then v = t i...t i+ v 1 = t i...t lp(uv)+ v 1 = t i...t lp(uvw) 1 = t i...t r 1. In other words, to retrieve edge lels in constnt time, it suffices to store the left pointer for ech node (including the leves). For ech rnching node u we dditionlly need constnt time ccess to the child of uv with the smllest left pointer. This ccess is provided y storing reference firstchild(u) to the first child of u w.r.t.. Thelp- ndfirstchild-vlues re stored in single integer tle T. The vlues for children of the sme node re stored in consecutive positions ordered w.r.t.. Thus, only the edges to the first child re stored explicitly. The edges to ll other children re implicit. They cn e retrieved y scnning consecutive positions in tle T. Any node u is referenced y the index in tle T where lp(u) is stored. To decode the suffix tree representtion, we need two extr its: lef it mrks n entry in T corresponding to lef, nd rightmost child it mrks n entry corresponding to node tht does not hve right rother w.r.t.. Figure 4 shows tle T representing ST(). The evlution process The wotd-lgorithm is est viewed s process evluting the nodes of the suffix tree, strting t the root nd recursively proceeding downwrds into the sutrees. We first descrie how n unevluted node u of ST(t) is stored. For the evlution of u, we need ccess to the set R(u) of remining suffixes. Therefore we employ glol rry suffixes tht contins pointers to suffixes of t. For ech unevluted node u, there is n intervl in suffixes tht stores strting positions of suffixes in R(u), ordered y descending suffix-length from left to right. R(u) is then represented y the two oundries left(u) nd right(u) of the corresponding intervl in suffixes. The oundries re stored in the two integers reserved in tle T for the rnching node u. To distinguish evluted nd unevluted nodes, we use third it, the unevluted it. Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

7 LAZY SUFFIX TREES 1041 (2; 3) (1; 1) (6; 6) (4; 6) (6; 6) (4; 6) (2; 3) (6; 6) (6; 6) Figure 2. The finl suffix tree ST() from Figure 1 with edge lels s pirs (left pointer, right pointer). ST(t) u t u i v r w (i; r 1) uv (r; ) uvw Figure 3. The definition of lp(uv). Now we cn descrie how u is evluted: the edges outgoing from u re otined y simple counting sort [29], using the first chrcter of ech suffix stored in the intervl [left(u), right(u)] of the rry suffixes s the key in the counting phse. Ech chrcter c with count greter thn zero corresponds to c-edge outgoing from u. Moreover, the suffixes in the c-group determine the sutree elow tht edge. The pointers to the suffixes of the c-group re stored in suintervl, in descending order of their length. To otin the complete lel of the c-edge, the lcp of ll suffixes in the c-group is computed. If the c-group contins just one suffix s, then the lcp is s itself. If the c-group contins more thn one suffix, then simple loop tests for equlity of the chrcters t suffixes[i]+j for j = 1, 2,... nd Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

8 1042 R. GIEGERICH, S. KURTZ AND J. STOYE lef it 0 lst child it! 0 z } z } z } z } z } z } z } z } z } Figure 4. A tle T representing ST() (see Figure 2). The input string s well s T re indexed strting with 1. A rnching node occupies two tle entries, lef occupies one tle entry. The first vlue for rnching node u is lp(u), the second is firstchild(u), indicted lso y rrows elow the tle. The lef it nd rightmost (lst) child it re indicted in the first tle entry of ech node. for ll strt positions i of the suffixes in the c-group. As soon s n inequlity is detected, the loop stops nd j is the length of the lcp of the c-group. The children of u re stored in tle T, one for ech non-empty group. A group with count one corresponds to suintervl of width one. It leds to lef, sy s, for which we store lp(s) in the next ville position of tle T. lp(s) is given y the left oundry of the group. A group of size lrger thn one leds to n unevluted rnching node, sy v, for which we store left(v) nd right(v) in the next two ville positions of tle T. In this wy, ll nodes with the sme prent u re stored in consecutive positions. Moreover, since the suffixes of ech intervl re in descending order of their length, the children re ordered w.r.t. the reltion. Thevluesleft(v) nd right(v) re esily otined from the counts in the counting sort phse, nd setting the lef-it nd the rightmost-child it is strightforwrd. To prepre for the (possile) evlution of v, the vlues in the intervl [left(v),right(v)] of the rry suffixes re incremented y the length of the corresponding lcp. Finlly, fter ll successor nodes of u recreted, thevluesofleft(u) ndright(u) in T rereplcedy theintegerslp(u) := suffixes[left(u)] nd firstchild(u), nd the unevluted it for u is deleted. The nodes of the suffix tree cn e evluted in n ritrry order respecting the prent/child reltion. Two strtegies re relevnt in prctice: the eger strtegy evlutes nodes in depth-first nd left-to-right trversl, s long s there re unevluted nodes remining. The progrm implementing this strtegy is clled wotdeger in the sequel. The lzy strtegy evlutes node only when the corresponding sutree is trversed for the first time, for exmple y procedure serching for ptterns in the suffix tree. The progrm implementing this strtegy is clled wotdlzy in the sequel. Spce requirement The suffix tree representtion s descried ove requires 2q + n integers, where q is the numer of non-root rnching nodes. Since q = n 1 in the worst cse, this is n improvement of 2n integers over the est previous representtion, s descried in [10]. However, one hs to e creful when compring the 2q + n integers representtion ove with the results of [10]. The 2q + n integers representtion Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

9 LAZY SUFFIX TREES 1043 is tilored for the wotd-lgorithm nd requires extr working spce of 2.5n integers in the worst cse : The rry suffixes contins n integers, nd the counting sort requires uffer of the width of the intervl tht is to e sorted. In the worst cse, the width of this intervl is n 1. Moreover, wotdeger needs stck of size up to n/2, to hold references to unevluted nodes. Creful memory mngement, however, llows spce to e sved in prctice. Note tht during eger evlution, the rry suffixes is processed from left to right, i.e., it contins completely processed prefix. Simultneously, the spce requirement for the suffix tree grows. By recliming the completely processed prefix of the rry suffixes for the tle T, the extr working spce required y wotdeger is only little more thn one yte per input chrcter, see Tle I.Forwotdlzy, it is not possile to reclim unused spce of the rry suffixes, since this is processed in n ritrry order. As consequence, wotdlzy needs more working spce. EXPERIMENTAL RESULTS For our experiments, we collected set of nine files of different sizes nd types. We restricted ourselves to 7-it ASCII files, since the suffix tree ppliction we consider (serching for string ptterns) is not common for inry files. Our collection consists of the following files. We used three files from the Clgry Corpus: i, ook1, ook2. The first one contins forml text (iliogrphic items), nd the ltter two contin English text. We dded three files (contining English text) from the Cnterury Corpus: lice29, lcet10,ndplrn12. Finlly, we dded three lrge files : E.coli (the complete genome of the cteri Escherichi coli), ile (the Bile), nd world (the Project Gutenerg Edition of The World Fctook 1992). All progrmswe considerre written in C. We used the gcc compiler, version 2.96, with optimizing option O3. The progrms were run on computer with MHz AMD Athlon XP processor, 512 MB RAM, under Linux. On this computer ech integer nd ech pointer occupies 4 ytes. As consequence, we hve 30 its to store left(u) or lp(u) for rnching node or lef u. Moreover, we hve 31 its to store right(u) or firstchild(u) for rnching node u. left(u), right(u), nd lp(u) re in the rnge [0,n]. firstchild(u) is in the rnge [0, 3n]. Hence 3n muste stisfied, i.e. the mximl length of the input string is In first experiment we rn three different progrms constructing suffix trees: wotdeger, mccl, nd mcch. The ltter two implement McCreight s suffix tree construction [5]. mccl computes the improved linked list representtion, nd mcch computes the improved hsh tle representtion of the suffix tree, s descried in [10]. Tle I shows the running times nd the spce requirements. We normlized w.r.t. the length of the files. Tht is, we show the reltive time (in seconds) to process 10 6 chrcters (i.e., rtime = (10 6 time)/n), nd the reltive spce requirement in ytes per input chrcter. For wotdeger we show the spce requirement for the suffix tree representtion (stspce), Moreover, the wotd-lgorithm does not run in liner worst cse time, in contrst to e.g. McCreight s lgorithm [5] which cn e used to construct the 5n integers representtions of [10] in constnt working spce. It is not cler to us whether it is possile to construct the 2q + n representtion of this pper within constnt working spce, or in liner time. In prticulr, it is not possile to construct it with McCreight s [5] or with Ukkonen s lgorithm [6], see [10]. These files, like the two corpor, cn e otined from Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

10 1044 R. GIEGERICH, S. KURTZ AND J. STOYE Tle I. Time nd spce requirement for different progrms constructing suffix trees. wotdeger mccl mcch File n k rtime stspce spce rtime spce rtime spce i ook ook lice lcet plrn E.coli ile world Averge s well s the totl spce requirement including the working spce. mccl nd mcch only require constnt extr working spce. The lst row of Tle I shows the verges of the vlues of the corresponding columns. In ech row, entries in old fce mrk the smllest reltive time nd the smllest reltive spce requirement, respectively. All three progrms hve similr verge running times. wotdeger nd mcch showmorestle running time thn mccl. This my e explined y the fct tht the running time of wotdeger nd mcch is independent of the lphet size. For thorough explntion of the ehvior of mccl nd mcch we refer to [10]. wotdeger gives us runtime dvntge on dt sets smller thn 10 6 chrcters, while for lrger dt, the extr logrithmic fctor in its symptotic construction time ecomes visile grdully. mccl, plgued y its extr lphet fctor k, wins the rce only for n = nd k = 4. mcch is est for lrge dt sizes in connection with lrge lphets. In ny cse, wotdeger is more spce efficient thn the other progrms, using 0.84 nd 5.54 ytes per input chrcter less thn mccl nd mcch, respectively. Note tht the dditionl working spce required for wotdeger is on verge only 1.06 ytes per input chrcter. As lredy oserved in [10], suffix trees for DNA sequences re lrger thn for other types of input strings. This is due to the smll lphet, which leds to denser suffix tree with more rnching nodes. This effect cn e confirmed here for ll lgorithms, compring spce requirements for inputs E.coli nd ile. In second experiment we studied the ehvior of different progrms serching for mny exct ptterns in n input string, scenrio tht occurs for exmple in genome-scle sequencing projects, see [2] (Section7.15). For the progrmsof the previousexperiment, ndforwotdlzy, we implemented serch functions. wotdeger nd mccl require O(km) time to serch for pttern string of length m. mcch requires O(m) time. Since the pttern serch for wotdlzy is merged with the evlution of suffix tree nodes, one cnnot mke generl sttement out the running time of the serch. We lso considered suffix rrys, using the originl progrm code developed y Mner nd Myers [12] (p. 946). The suffix rry progrm, referred to y mmy, constructs suffix rry in O(nlog n) Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

11 LAZY SUFFIX TREES 1045 Tle II. Time nd spce requirement for serching 0.01n exct ptterns. wotdlzy mmy wotdeger mccl mcch mh File n k rtime stspce spce rtime rtime rtime rtime spce rtime i ook ook lice lcet plrn E.coli ile world Averge Construction time Serch time time. Serching is performed in O(m + log n) time. The suffix rry requires 5n ytes of spce. For the construction, dditionlly 4n ytes of working spce re required. Finlly, we lso considered the iterted ppliction of n on-line string serching lgorithm, our own implementtion of the Boyer Moore Horspool lgorithm [9], referred to y mh. The lgorithm tkes O(n + m) expected time per serch, nd uses O(m) working spce. We generted ptterns ccording to the following strtegy: for ech input string t of length n we rndomly smpled ρn sustrings s 1,s 2,...,s ρn of different lengths from t nd tested if they occurred in t or not. The proportionlity fctor ρ ws etween nd 1. The lengths of the sustrings were evenly distriuted over the intervl [10, 20]. For i [1,ρn], the progrms were clled to serch for pttern p i,wherep i = s i,ifi is even, nd p i is the reverse of s i, otherwise. The reson for reversing the pttern string s i in hlf of the cses is to simulte the fct tht pttern serch is often unsuccessful. Tle II shows the reltive running times for ρ = For wotdlzy we show the spce requirement for the suffix tree fter ll ρn pttern serches hve een performed (stspce), nd the totl spce requirement. For mmy, mh, nd the other three progrms the spce requirement is independent of ρ. Thus for the spce requirement of wotdeger, mccl, nd mcch see Tle I. The spce requirement of mh is mrginl, so it is omitted in Tle II. Except for the DNA sequence, wotdlzy is the fstest nd most spce efficient progrm for ρ = This is due to the fct tht the pttern serches only evlute prt of the suffix tree. Compring the stspce columns of Tles I nd II we cn estimte tht for ρ = 0.01 only out 10% of the suffix tree is evluted. Of course, for the other progrms, their tree construction time domintes the serch time. We cn lso deduce tht wotdeger performs pttern serches fster thn mccl. This cn e explined s follows: serching for ptterns mens tht for ech rnching node the list of successors is trversed, to find prticulr edge. However, in our new suffix tree representtion, the successors Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

12 1046 R. GIEGERICH, S. KURTZ AND J. STOYE 10 mmy mccl mcch wotdeger wotdlzy mh Figure 5. Averge reltive running time (in seconds) for different vlues of ρ [0, 0.002]. re found in consecutive positions of tle T. This mens smll numer of cche misses, nd hence the good performnce. It is remrkle tht wotdlzy is more spce efficient nd ten times fster thn mmy. Of course, the spce dvntge of wotdlzy is lost with lrger numer of pttern serches. In prticulr, for ρ 0.3 mmy is the most spce efficient progrm (dt not shown). Figures 5 nd 6 give generl overview, showing how ρ influences the running times. Figure 5 shows the verge reltive running time for ll progrms nd different choices of ρ for ρ Figure 6 shows the vergereltiverunningtime for ll progrmsexceptmh for ll vlues of ρ. We oservetht wotdlzy is the fstest progrm for ρ 0.05, nd wotdeger is the fstest progrm for ρ mh is fster thn wotdlzy only for ρ< Thus the index construction performed y wotdlzy lredy mortizes for very smll numer of pttern serches. We lso performed pttern serches on two very lrge iologicl sequence files sprot39 (relese 39 of the SWISSPROT protein dtse) nd yest (the completegenomeof the ker s yest Scchromyces cerevisie). The results re shown in Tle III. We left out mh, which would tke extremely long running times, nd we could lso not test mcch since this progrm cn only process files of length up to chrcters. Our min oservtions re: Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

13 LAZY SUFFIX TREES mmy mccl mcch wotdeger wotdlzy Figure 6. Averge reltive running time (in seconds) for different vlues of ρ [0, 1]. Tle III. Time nd spce requirements when serching 0.01n ptterns in very lrge files. wotdlzy wotdeger mccl mmy File n k rtime stspce spce rtime stspce spce rtime spce rtime spce sprot yest The reltive running time for ll progrms clerly increses. With ρ pproching 1, the slower suffix tree construction of wotdeger nd wotdlzy is compensted for y fster pttern serch procedure, so tht there is running time dvntge over mccl (dt not shown). The execution time for mmy rises shrply on our lrgest dt set. Here, for serching 0.01n ptterns for n = , the extr logrithmic fctor in the serch procedure of mmy ecomes visile. Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

14 1048 R. GIEGERICH, S. KURTZ AND J. STOYE CONCLUSION We hve developed efficient lzy nd eger implementtions of the write-only top-down suffix tree construction. These construct representtion of the suffix tree tht requires only 12n ytes of spce in the worst cse, plus 10n ytes of working spce. The totl spce requirement in prctice, though, is only 9.38n ytes on verge for collection of files of different types. The time nd spce overhed of the lzy implementtion is very smll. Our experiments show tht for serching mny exct ptterns in n input string, the lzy lgorithm is the most spce nd time efficient lgorithm for wide rnge of input vlues. A specil construction for repetitive strings We end this exposition y discussing specil cse of oth prcticl relevnce nd theoreticl interest. DNA, nd in prticulr humn DNA, lrgely consists of repetitive sequences. Estimtes of repetitive contents go up to 50% nd more [30] (p. 860). In the context of iosequence nlysis, it is unfortunte tht the wotd-lgorithm performs most poorly on repetitive strings. The worst cse running time of O(n 2 ) is chieved on strings of form n. The wotd-lgorithm performs similrly on Fioncci strings (see e.g. [31], exercise ), which re known for their highly repetitive structure. This is prticulrly unfortunte since Fioncci strings, with their itertive pttern tht grows repetitive strings vi recomintion of previously generted strings, my e seen s n idelized model of the effects of DNA cutting, repir, nd recomintion. Let us nlyze why the wotd-lgorithm performs poorly on repetitive strings. Long repets often led to suffix trees with long interior edges, nd significnt effort is spent on determining the longest common prefix of the suffixes contriuting to sutree elow such n edge. To e more precise, the lels of edges leving the nodes w nd w for nd w re often identicl, or the former is prefix of the ltter. This is ecuse the suffixes in R(w) re (prt from the constnt prefix ) suset of the suffixes in R(w). This suggests the following improvement of the wotd-lgorithm: when evluting node w, we cn immeditely skip the first m chrcter comprisons if m is the length of the corresponding lcp computed for w, nd only then continue with the norml computtion of the lcp. In more dvnced lgorithm, one cn pply this ide even recursively, ut it is not the im of this exposition to elorte on such detils. A prototype implementtion of this ide hs een tested, nd in fct, it seems to outperform ll other suffix tree constructions on Fioncci strings. While still eing top-down construction, the lgorithm sketched here is no longer write-only, s it ccesses node w when evluting w. Hence mny of the rguments given in fvor of the wotd-lgorithm, in prticulr loclity of memory reference, must e re-evluted for this vrint. But there re further spects to this ide, interesting in their own right. The dependence of w on w induces prtil ordering on the construction of nodes. In the extreme, the suffix tree is constructed in order of incresing suffix length, nd we hve reverse online construction in perfect nlogy to Weiner s lndmrk lgorithm of 1973 [4]. This surprising oservtion my provide new insights into the reltionships etween different suffix tree constructions. ACKNOWLEDGEMENTS We thnk Gene Myers for providing copy of his suffix rry code. S. Kurtz ws supported y grnt Ku 1257/1-1 from the Deutsche Forschungsgemeinschft. Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

15 LAZY SUFFIX TREES 1049 REFERENCES 1. Apostolico A. The myrid virtues of suword trees. Comintoril Algorithms on Words. Springer: Berlin, 1985; Gusfield D. Algorithms on Strings, Trees, nd Sequences. Cmridge University Press: Cmridge, Skien SS. Who is interested in lgorithms nd why? Lessons from the Stony Brook Algorithms Repository. Proceedings 2nd Workshop on Algorithm Engineering. IEEE Press: New York, 1998; we98/proceedings/. 4. Weiner P. Liner pttern mtching lgorithms. Proceedings 14th Annul IEEE Symposium on Switching nd Automt Theory, 1973; McCreight EM. A spce-economicl suffix tree construction lgorithm. Journl of the ACM 1976; 23(2): Ukkonen E. On-line construction of suffix-trees. Algorithmic 1995; 14(3): Giegerich R, Kurtz S. From Ukkonen to McCreight nd Weiner: unifying view of liner-time suffix tree constructions. Algorithmic 1997; 19(3): Giegerich R, Kurtz S. A comprison of impertive nd purely functionl suffix tree constructions. Science of Computer Progrmming 1995; 25(2 3): Horspool RN. Prcticl fst serching in strings. Softwre Prctice nd Experience 1980; 10(6): Kurtz S. Reducing the spce requirement of suffix trees. Softwre Prctice nd Experience 1999; 29(13): Kurtz S, Choudhuri JV, Ohleusch E, Schleiermcher C, Stoye J, Giegerich R. REPuter: the mnifold pplictions of repet nlysis on genomic scle. Nucleic Acids Reserch 2001; 29(22): Mner U, Myers EW. Suffix rrys: A new method for on-line string serches. SIAM Journl of Computing 1993; 22(5): Frch M. Optiml suffix tree construction with lrge lphets. Proceedings 38th Annul Symposium on the Foundtions of Computer Science. IEEE Press: New York, 1997; Dorohoncenu B, Nevill-Mnning CG. Protein clssifiction using suffix trees. Proceedings of the Eighth Interntionl Conference on Intelligent Systems for Moleculr Biology. AAAI Press: Menlo Prk, CA, 2000; Hunt E, Atkinson MP, Irving RW. A dtse index to lrge iologicl sequences. Proceedings of the 27th Interntionl Conference on Very Lrge Dtses. Morgn Kufmnn: Sn Frncisco, 2001; Clrk DR, Munro JI. Efficient suffix trees on secondry storge. Proceedings of the Seventh Annul ACM SIAM Symposium on Discrete Algorithms. ACM Press: New York, 1996; Munro JI, Rmn V, Ro SS. Spce efficient suffix trees. Journl of Algorithms 2001; 39(2): Grossi R, Vitter JS. Compressed suffix rrys nd suffix trees with pplictions to text indexing nd string mtching. Proceedings of the 32th Annul ACM Symposium on the Theory of Computing. ACM Press: New York, 2000; Andersson A, Nilsson S. Efficient implementtion of suffix trees. Softwre Prctice nd Experience 1995; 25(2): Irving RW, Love L. Suffix inry serch trees nd suffix rrys. Reserch Report TR , Deprtment of Computer Science, University of Glsgow, Kärkkäinen J. Suffix cctus: A cross etween suffix tree nd suffix rry. Proceedings of the 6th Annul Symposium on Comintoril Pttern Mtching (Lecture Notes in Computer Science, vol. 937). Springer: Berlin, 1995; Alluzen C, Crochemore M, Rffinot M. Fctor orcle: new structure for pttern mtching. Proceedings of the 26th Annul Conference on Current Trends in Theory nd Prctice of Informtics (Lecture Notes in Computer Science, vol. 1725). Springer: Berlin, 1999; Colussi L, De Col A. A time nd spce efficient dt structure for string serching on lrge texts. Informtion Processing Letters 1996; 58(5): Giegerich R, Kurtz S, Stoye J. Efficient implementtion of lzy suffix trees. Proceedings of the Third Workshop on Algorithmic Engineering (Lecture Notes in Computer Science, vol. 1668). Springer: Berlin, 1999; Mrtinez HM. An efficient method for finding repets in moleculr sequences. Nucleic Acids Reserch 1983; 11(13): Gusfield D. An increment-y-one pproch to suffix rrys nd trees. Report CSE-90-39, Computer Science Division, University of Cliforni, Dvis, Apostolico A, Iliopoulos C, Lndu GM, Schieer B, Vishkin U. Prllel construction of suffix tree with pplictions. Algorithmic 1988; 3: Hopcroft J. An O(nlog n) lgorithm for minimizing sttes in finite utomton. The Theory of Mchines nd Computtions, Kohvi Z, Pz A (eds.). Acdemic Press: New York, 1971; Cormen TH, Leiserson CE, Rivest RL. Introduction to Algorithms (2nd edn). MIT Press: Cmridge, MA, Interntionl Humn Genome Sequencing Consortium. Initil sequencing nd nlysis of the humn genome. Nture 2001; 409: Knuth DE. The Art of Computer Progrmming. Fundmentl Algorithms (3rd edn), vol. 1. Addison-Wesley: Reding, MA, Copyright c 2003 John Wiley & Sons, Ltd. Softw. Prct. Exper. 2003; 33:

What are suffix trees?

What are suffix trees? Suffix Trees 1 Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl