Suffix Arrays on Words

Size: px
Start display at page:

Download "Suffix Arrays on Words"

Transcription

1 Suffix Arrys on Words Polo Ferrgin nd Johnnes Fischer Diprtimento di Informtic, University of Pis Institut für Informtik, Ludwig-Mximilins-Universität München Astrct. Surprisingly enough, it is not yet known how to uild directly suffix rry tht indexes just the k positions t word-oundries of text T [,n], tking O(n) timendo(k) spce in ddition to T.Wepropose clss-note solution to this prolem tht chieves such optiml time nd spce ounds. Word-sed versions of indexes chieving the sme time/spce ounds were lredy known for suffix trees [, ] nd (compct) DAWGs [,]. Our solution inherits the simplicity nd efficiency of suffix rrys, with respect to such other word-indexes, nd thus it foresees pplictions in word-sed pproches to dt compression [] nd computtionl linguistics [6]. To support this, we hve run lrge set of experiments showing tht word-sed suffix rrys my econstructed twice s fst s their full-text counterprts, nd with working spce s low s 0%. The spce reduction of the finl word-sed suffix rry impcts lso in their query time (i.e. less rndom ccess inry-serch steps!), eing fster y fctor of up to. Introduction One of the most importnt tsks in clssicl string mtching is to construct n index on the input dt in order to nswer future queries fster. Well-known exmples of such indexes include suffix-trees, word grphs nd suffix rrys (see e.g. [7]). Despite the extensive reserch tht hs een done in the lst three or four decdes, this topic hs recently re-gined populrity with the rise of compressed indexes [8] nd new pplictions such s dt compression, text mining nd computtionl linguistics. However, ll of the indexes mentioned so fr re full-text indexes, in the sense tht they index ny position in the text nd thus llow to serch for occurrences of ptterns strting t ritrry text positions. In mny situtions, deploying the full-text feture might e like using cnnon to shoot fly, with undesired negtive impcts on oth query time nd spce usge. For exmple, in The first uthor hs een prtilly supported y the Itlin MIUR grnt Itly- Isrel FIRB Pttern Discovery Algorithms in Discrete Structures, with Applictions to Bioinformtics, nd y the Yhoo! Reserch grnt on Dt compression nd indexing in hierrchicl memories. The second utor hs een prtilly funded y the Germn Reserch Foundtion (DFG, Bioinformtics Inititive). B. M nd K. Zhng (Eds.): CPM 007, LNCS 80, pp. 8 9, 007. c Springer-Verlg Berlin Heidelerg 007

2 Suffix Arrys on Words 9 Europen lnguges, words re seprted y specil symols such s spces or puncttion signs; in dictionry of URLs, words re seprted y dots nd slshes. In oth cses, the results found y word-sed serch with full-text index should hve to e filtered out y discrding those results tht do not occur t word oundries. Possily time-costly step! Additionlly, indexing every text position would ffect the overll spce occupncy of the index, with n increse in the spce complexity which could e estimted in prctice s multiplictive fctor 6, given the verge word length in linguistic texts. Of course, the use of word-sed indexes is not limited to pttern serches, s they hve een successfully used in mny other contexts, like dt compression [] nd computtionl linguistics [6], just to cite few. Surprisingly enough, word sed indexes hve een introduced only recently in the string-mtching literture [], lthough they were very fmous in Informtion Retrievl mny yers efore [9]. The sic ide underlying their design consists of storing just suset of the text positions, nmely the ones tht correspond to word eginnings. As we oserved ove, it is esy to construct such indexes if O(n) dditionl spce is llowed t construction time (n eing the text size): Simply uild the index for every position in the text, nd then discrd those positions which do not correspond to word eginnings. Unfortuntely, such simple (nd common, mong prctitioners!) pproch is not spce optiml. In fct O(n) construction time cnnot e improved, ecuse this is the time needed to scn the input text. But O(n) dditionl working spce (other thn the indexed text nd the finl suffix rry) seems too much ecuse the finl index will need O(k) spce, where k is the numer of words in the indexed text. This is n interesting issue, not only theoreticlly, ecuse... we hve seen mny ppers in which the index simply is, without discussion of how it ws creted. But for n indexing scheme to e useful it must e possile for the index to e constructed in resonle mount of time. [0] And in fct, the working-spce occupncy of construction lgorithms for full-text indexes is yet primry concern nd n ctive field of reserch []. The first result ddressing this issue in the word-sed indexing relm is due to Anderson et l. [] who showed tht the word suffix tree cn e constructed in O(n) expected time nd O(k) working spce. In 006, Ineng nd Tked [] improved this result y providing n on-line lgorithm which runs in O(n) time in the worst cse nd O(k) spce in ddition to the indexed text. They lso gve two lterntive indexing structures [, ] which re generliztions of Directed Acyclic Word Grphs (DAWGs) or compct DAWGs, respectively. All three construction methods re vritions of the construction lgorithms for (usul) suffix trees [], DAWGs [] nd CDAWGs [], respectively. The only missing item in this qurtet is word-sed nlog of the suffix rry, gp which we close in this pper. We emphsize the fct tht, s it is the cse with full-text suffix rrys (see e.g. []), we get clss-note solution which is simple nd prcticlly effective, thus surpssing the previous ones y ll mens. A comment is in order efore detiling our contriution. A more generl prolem thn word-sed string mtching is tht of sprse string mtching,

3 0 P. Ferrgin nd J. Fischer where the set of points to e indexed is given s n ritrry suset of ll n text positions, not necessrily coinciding with the word oundries. Although the uthors of [,, ] clim tht their indexes cn solve this tsk s well, they did not tke into ccount n exponentil fctor [6]. To the est of our knowledge, this prolem is still open. The only step in this direction hs een mde y Kärkkäinen nd Ukkonen [7] who considered the specil cse where the indexed positions re evenly spced.. Our Contriutions We define new dt structure clled the word(-sed) suffix rry nd show how it cn e constructed directly in optiml time nd spce; i.e., without first constructing the sprse suffix tree. The size of the structure is k RAM words, nd t no point during its construction more thn O(k) spce (in ddition to the text) is needed. This is interesting in theory ecuse we could compress the text y mens of [8] nd then uild the word-sed index in spce O(k)+nH h + o(n) its nd O(n) time, simultneously over ll h = o(log n), where H h is the h- th order empiricl entropy of the indexed text (lphet is ssumed to hve constnt size). If the numer k of indexed words is reltively smll, nmely k = o(n/ log n), this index would tke the sme spce s the est compressed indexes (cf. [8]), ut it would need less spce to e constructed. As fr s pttern-queries re concerned, it is esy to dpt to word-sed suffix rrys the clssicl pttern serches over full-text suffix rrys. For ptterns of length m, we then esily show tht counting queries tke O(m log k) time,or O(m +logk) if n dditionl rry of size k is used. Note tht this reduces the numer of costly inry serch step y O(log(n/k)) compred with fulltext suffix rrys. Reporting queries tke O(occ) dditionl time, where occ is the numer of word occurrences reported. We then show tht the ddition of nother dt structure, similr to the Enhnced Suffix Arry [9], lowers these time ounds to O(m) nd O(m + occ), respectively. In order to highlight the simplicity, nd hence prcticlity, of our word-sed suffix rry we test it over vrious dtsets, which cover some typicl pplictions of word-sed indexes: nturl nd rtificil lnguge, structured dt nd prefix-serch on hosts/domins in URLs. Construction time is twice fster thn stte-of-the-rt lgorithms pplied to full-text suffix rrys, nd the working spce is lowered y 0%. As query time is fster y up to fctor without post-filtering the word-ligned occurrences, nd up to orders of mgnitude including post-filtering, we exclude the ide of using full-text suffix rry for finding word-ligned occurrences lredy t this point. Definitions Throughout this rticle we let T e text of length n over constnt-sized lphet Σ. We further ssume tht certin chrcters from constnt-sized suset W of the lphet ct s word oundries, thus dividing T in nturl sense into k tokens, herefter clled words. NowletI e the set of positions

4 Suffix Arrys on Words ucket > A= Fig.. The initil rdix-sort in step T = SA= Fig.. The new text T nd its (full-text) suffix rry SA where new words strt: I nd i I \{} T i W. (The first position of the text is lwys tken to e the eginning of new word.) Similr to [] we define the set of ll suffixes of T strting t word oundries s Suffix I (T )= {T i..n : i I}. Then the word suffix rry A[..k] ispermuttionofi such tht T A[i ]..n <T A[i]..n for ll <i k; i.e., A represents the lexicogrphic order of ll suffixes in Suffix I (T ). Definition (Word Aligned String Mtching). For given pttern P of length m let O P I e the set of word-ligned positions where P occurs in T : i O P iff T i..n is prefixed y P nd i I. Then the tsks of word ligned string mtching re () to nswer whether or not O P is empty (decision query), () to return the size of O P (counting query), nd () to enumerte the memers of O P in some order (reporting query). Optiml Construction of the Word Suffix Arry This section descries the optiml O(n) time nd O(k) spce lgorithm to construct the word suffix rry. For simplicity, we descrie the lgorithm with only one word seprtor (nmely ). The reder should note, however, tht ll steps re vlid nd cn e computed in the sme time ounds if we hve more thn one (ut constntly mny) word seprtors. We lso ssume tht the set I of positions to e indexed is implemented s n incresingly sorted rry. As running exmple for the lgorithm we use the text T =, so I =[,, 6, 9,,, 8,,, 7].. The gol of this step is to estlish corse sorting of the suffixes from Suffix I (T ). In prticulr, we wnt to sort these suffixes using their first word s the sort key. To do so, initilize the rry A[..k] =I. Thenperform rdix-sort of the elements in A: t ech level l 0, ucket-sort the rry A using T A[i]+l s the sort key for A[i]. Stop the recursion when ucket contins only one element, or when ucket consists only of suffixes strting

5 P. Ferrgin nd J. Fischer A= LCP[],h 0 for i,...,k do p A [i],h mx{0,h A[p]} if p> then while T A[p]+h = T A[p ]+h do h h + end LCP[p] h end h h + A[p] end Fig.. The finl word suffix rry Fig.. O(n)-time longest common prefix computtion using O(k) spce (dpted from [0]) with w forsomew (Σ \{}). Since ech chrcter from T is involved in t most one comprison, this step tkes O(n) time. See Fig. for n exmple.. Construct new text T = (I[])(I[])...(I[k]), where (I[i]) is the ucket-numer (fter step ) of suffix T I[i]..n Suffix I (T ). In our exmple, T =. (We use oldfce letters to emphsize the fct tht we re using new lphet.) This step cn clerly e implemented in O(k) time.. We now uild the (full-text) suffix rry SA for T. Becuse the liner-time construction lgorithms for suffix rrys (e.g., []) work for integer lphets too, we cn employ ny of them to tke O(k) time. See Fig. for n exmple. In this figure, we hve ttched to ech position in the new text T the corresponding position in T s superscript (i.e., the rry I), which will e useful in the next step.. This step derives the word suffix rry A from SA. ScnSA from left to right nd write the corresponding suffix to A: A[i] = I[SA[i]]. This step clerly tkes O(k) time. See Figure for n exmple. Theorem. Given text T of length n consisting of k words drwn from constnt-sized lphet Σ, the word suffix rry for T cn e constructed in optiml O(n) time nd O(k) extr spce. Proof. Time nd spce ounds hve lredy een discussed in the description of the lgorithm; it only remins to prove the correctness. This mens tht we hve to prove T A[i ]..n T A[i]..n for ll <i k fter step. Note tht fter step we hve T A[i ]..x T A[i]..y,wherex nd y re defined so tht T x nd T y is the first fter T A[i ] nd T A[i], respectively. We now show tht steps refine this ordering for uckets of size greter thn one. In other words, we wish to show tht in step, uckets [l : r] shring common prefix T A[i]..x with T x eing the first for ll l i r re sorted using the lexicogrphic order of T x+..n s sort key. But this is simple: ecuse the newly constructed text T from step respects the order of T A[i]..x, nd ecuse step estlishes the correct lexicogrphic order of T,theI[SA[i]] s re the correct sort keys for step.

6 Suffix Arrys on Words Tle. Different methods for retrieving ll occ occurrences of pttern t wordoundries. The full-text suffix rry would hve the sme time- nd spce-ounds, with k sustituted y n>>k,ndocc y occ >> occ,whereocc is the numer of not necessrily word-ligned occurrences of the pttern. method spce usge (words) time ounds in-nive k O(m log k + occ) in-improved ( + C)k, C O((m log(ck)) log k + occ) in-lcp k O(m +logk + occ) es-serch k + O(k/ log k) O(m Σ + occ) To further reduce the required spce we cn think of compressing T efore pplying the ove construction lgorithm, y dopting n entropy-ounded storge scheme [8] which llows constnt-time ccess to ny of its O(log n) contiguous its. This implies the following: Corollry. The word suffix rry cn e uilt in k log n + nh h (T )+o(n) its nd O(n) time, where H h (T ) is the hth order empiricl entropy of T.For ny k = o(n/ log n), the spce needed to uild nd store this dt structure is nh h + o(n) its, simultneously over ll h = o(log n). This result is interesting ecuse it sys tht, in the cse of tokenized text with long words on verge (e.g. dictionry of URLs), the word-sed suffix rry tkes the sme spce s the est compressed indexes (cf. [8]), ut it would need less spce to e constructed. Serching in the Word Suffix Arry We now consider how to serch for the word-ligned occ occurrences of pttern P [,m]inthetextt [,n]. As serching the word suffix rry cn e done with the sme methods s in the full-text suffix rry we keep the discussion short (see lso Tle ); the purpose of this section is the completeness of exposition, nd to prepre for the experiments in the following section. Here we lso introduce the notion of word-sed LCP-rry nd show tht it cn e computed in O(n) time nd O(k) spce. We emphsize tht enhncing the word suffix rry with the LCP-rry ctully yields more functionlity thn just improved string-mtching performnce. As n exmple, with the LCPrry it is possile to simulte ottom-up trversls of the corresponding word suffix tree, nd ugmenting this further llows us lso to simulte top-down trversls [9]. Additionlly, in the vein of [, Section..], we cn derive the word suffix tree from rrys LCP nd A. This yields simple, spce-efficient nd memory-friendly (in the sense tht nodes tend to e stored in the vicinity of their predecessor/children) lterntive to the lgorithm in []. Serching in O(m log k) time. Becuse A is sorted lexicogrphiclly, it cn e inry-serched in similr mnner to the originl serch-lgorithm from

7 P. Ferrgin nd J. Fischer Mner nd Myers []. We cn lso pply the two heuristics proposed there to speed up the serch in prctice (though not in theory): the first uilds n dditionl rry of size Σ K (K = log Σ (Ck) forsomec ) to nrrow down the initil serch intervl in A, nd the second one reduces the numer of chrcter comprisons y rememering the numer of mtching chrcters from T nd P tht hve een seen so fr. Serching in O(m +logk) time. Like in the originl rticle [] the ide is to pre-compute the longest common prefixes of T A[(l+r)/]..n with oth T A[l]..n nd T A[r]..n for ll possile serch intervls [l : r]. Footnote 6 in [] ctully shows tht only one of these vlues needs to e stored, so the dditionl spce needed is one rry of size k. Becuse oth the precomputtion nd the serch lgorithm re unchnged, we refer the reder to [] for complete description of the lgorithm. Serching in time. While the previous two serching lgorithms hve serching time tht is independent of the lphet size, we show in this section how to locte the serch intervl of P within A in. We note tht for constnt lphet this ctully yields optiml O(m) counting time nd optiml O(m + occ) reporting time. Define the LCP-rry LCP[..k] s follows: LCP[] = ndfori>, LCP[i] is the length of the longest common prefix of the suffixes T A[i ]..n nd T A[i]..n. We will now show tht this LCP-tle cn e computed in O(n) timeinthe order of inverse word suffix rry A whichisdefinedsa[a [i]] = I[i]; i.e., A [i] tells us where the i th-longest suffix mong ll indexed suffixes from T cn e found in A. A cn e computed in O(k) time s y-product of the construction lgorithm (Section ). In our exmple, A =[7,,,, 8, 0, 6,,, 9]. Figure shows how to compute the LCP-rry in O(n) time. It is ctully generliztion of the O(n)-lgorithm for lcp-computtion in (full-text) suffix rrys [0]. The difference from [0] is tht the originl lgorithm ssumes tht when going from position p (here A[p] =i) to position p = A [i + ] (hence A[p ]=i + ), the difference in length etween T A[p]..n nd T A[p ]..n is exctly one, wheres in our cse this difference my e lrger, nmely A[p ] A[p]. This mens tht when going from position p to p the lcp cn decrese y t most A[p ] A[p] (insted of ); we ccount for this fct y dding A[p] toh (line 0) nd sutrcting p (i.e. the new p) in the next itertion of the loop (line ). At ny itertion, vrile h holds the length of the prefix tht T A[p]..n nd T A[p ]..n hve in common. Since ech text chrcter is involved in t most comprisons, the O(n) time ound esily follows. Now in order to chieve mtching time, use the RMQ-sed vrint of the Enhnced Suffix Arry [9] proposed y []. This requires o(k) dditionl spce nd cn e computed in O(k) time. Experimentl Results The im is to show the prcticlity of our method. We implemented the word suffix rry in C++ ( {}fischer/wordsa.tgz). Insted

8 Suffix Arrys on Words Tle. Our Test-files nd their chrcteristics. In the word seprtor column, LF stnds for line feed, SPC for spce nd TAB for tultor. dtset size (MB) Σ word seprtors used words different words vg. length English 9 LF,SPC,- 67,868,08,0,8.7 XML 8 97 SPC, /, <, >,,67,,7, sources 0 0 [0 in totl],0,6,06,86.98 URLs LF, /,6,80,809.0 rndom 0 SPC 0,000,00 9,9,9 6.0 of using liner time lgorithm for the construction of suffix rrys, we opted for the method from Lrsson nd Sdkne []. We implemented the serch strtegies in-nive, in-improved, in-lcp nd es-serch from Tle. Unfortuntely, we could not compre to the other word-sed indexes [,,] ecuse there re no pulicly ville implementtions. For in-improved we chose C = /, so the index occupies.k memory words (prt from T,whichtkesn ytes). For the RMQ-preprocessing of the es-serch we used the method from Alstrup et l. [] which is fst in prctice, while still eing reltively spce-conscious (out.k words). With the LCP-rry nd the text this mkes totl of.k words. We tested our lgorithms on the files English, XML, nd sources from the Pizz&Chili site [6] (some of them truncted), plus one file of URLs from the.eu domin [7]. To test the serch lgorithms on smll lphet, we lso generted n rtificil dtset y tking words of rndom length (uniformly from 0 to 0) nd letters uniformly from Σ = {, }. See Tle for the chrcteristics of the evluted dtsets. Tle shows the spce consumption nd the preprocessing times for the four different serch methods. Concerning the spce, the first four columns under spce consumption denote the spce (in MB) of the finl index (including the text) for the different serch lgorithms it cn susequently support. Column leled pek gives the pek memory usge t construction time for serch lgorithms ; the pek usge for serch lgorithm is the sme s tht of the finl index. Concerning the construction time, most prt of the preprocessing is needed for the construction of the pure word suffix rry (method ); the times for methods re only slightly longer thn tht for method. To see the dvntge of our method over the nive lgorithm which prunes the full-text suffix rry to otin the word suffix rry, Tle shows the construction times nd pek spce consumption of two stte-of-the-rt lgorithms for constructing (full-text) suffix rrys, MSufSort-.0 [8] nd deep-shllow [9]. Note tht the figures given in Tle re pure construction times for the fulltext suffix rry; pruning this is neither included in time nor spce. First look t the pek spce consumption in Tle. MSufSort needs out 7n ytes if the input text cnnot e overwritten (it therefore filed for the lrgest dtset), nd deep-shllow needs out n ytes. These two columns should e compred with the column leled pek in Tle, ecuse this column gives the spce

9 6 P. Ferrgin nd J. Fischer Tle. Spce consumption (including the text) nd preprocessing times for the different serch lgorithms: in-nive (), in-improved (), in-lcp (), es-serch () spce consumption (MB) preprocessing times (in sec) dtset pek English ,96.0, XML , sources URLs rndom needed to construct the pure word suffix rry (i.e., k + n ytes in our implementtion). For ll ut one dt set our method uses significntly less spce thn oth MSufSort (0.9 7.%) nd deep-shllow (6. 8.9%). For the construction time, compre the lst two columns in Tle with the preprocessing time for method in Tle. Agin, our method is lmost lwys fster ( % nd % etter thn deep-shllow nd MSufSort, respectively); the difference would e even lrger if we did include the time needed for pruning the full-text suffix rry. Tle. Spce consumption (including the text) nd construction times for two different stte-of-the-rt methods to construct (full-text) suffix rrys pek spce consumption (MB) construction time dtset MSufSort-.0 deep-shllow MSufSort-.0 deep-shllow English, XML,976.9, sources, URLs rndom, We finlly tested the different serch strtegies. In prticulr, we posed 00,000 counting queries to ech index (i.e., determining the intervl of pttern P in A) for ptterns of length, 0, 00,,000, nd 0,000. The results cn e seen in Fig.. We differentited etween rndom ptterns (left hnd side of Fig. ) nd occurring ptterns (right hnd side). There re severl interesting points to note. First, the improved O(m log k)-lgorithm is lmost lwys the fstest. Second, the is not competitive with the other methods, prt from very long ptterns or very smll lphet (Sufig. (h)). And third, the query time for the methods sed on inry serch ( ) cn ctully e higher for short ptterns thn for long ptterns (Fig. ()-()). This is the effect of nrrowing down the serch for the right order when serching for the left one. We omit the results for the sources-dtset s they strongly resemle those for the URL-dtset.

10 Suffix Arrys on Words e-0 e e-0 e pttern length e pttern length () English, rndom ptterns. () English, occurring ptterns e-0 e-0 e pttern length 0.00 e-0 e-0 e pttern length (c) XML, rndom ptterns. (d) XML, occurring ptterns. e e-0 e pttern length e-0 e-0 e pttern length (e) URLs, rndom ptterns. (f) URLs, occurring ptterns. e-0 e-0 e pttern length 0.00 e-0 e-0 e pttern length (g) Rndom words, rndom ptterns. (h) Rndom words, occurring ptterns. Fig.. Prcticl performnce of the lgorithms from Section (verge over 00,000 counting queries; time for index construction is not included). Axes hve log-scle.

11 8 P. Ferrgin nd J. Fischer 6 Conclusions We hve seen spce- nd time-optiml lgorithm to construct suffix rrys on words. The most striking property ws the simplicity of our pproch, reflected in the good prcticl performnce. This supersedes ll the other known pproches sed on suffix trees, DAWG nd compct DAWG. As future reserch issues we point out the following two. In similr mnner s we compressed T (Corollry ), one could compress the word-sed suffix rry A y proly resorting the ides on word-sed Burrows-Wheeler Trnsform [] nd lphet-friendly compressed indexes [8]. This would hve n impct not only in terms of spce occupncy, ut lso on the serch performnce of those indexes ecuse they execute O() rndom memory-ccesses per serched/scnned chrcter. With word-sed index this could e turned to O() rndom memory-ccesses per serched/scnned word, with significnt prcticl speed-up in the cse of very lrge texts possily residing on disk. The second reserchissue regrdsthe sprse string-mtching prolem in which the set of points to e indexed is given s n ritrry set, not necessrily coinciding with word oundries. As pointed out in the introduction, this prolem is still open, though eing relevnt for texts such s iologicl sequences where nturl word oundries do not occur. References. Andersson, A., Lrsson, N.J., Swnson, K.: Suffix Trees on Words. Algorithmic (), 6 60 (999). Ineng, S., Tked, M.: On-Line Liner-Time Construction of Word Suffix Trees. In: Lewenstein, M., Vliente, G. (eds.) CPM 006. LNCS, vol. 009, pp Springer, Heidelerg (006). Ineng, S., Tked, M.: Sprse Directed Acyclic Word Grphs. In: Crestni, F., Ferrgin, P., Snderson, M. (eds.) SPIRE 006. LNCS, vol. 09, pp Springer, Heidelerg (006). Ineng, S., Tked, M.: Sprse compct directed cyclic word grphs. In: Stringology, pp. 97 (006). Yugo, R., Isl, K., Mofft, A.: Word-sed lock-sorting text compression. In: Austrlsin Conference on Computer Science, pp IEEE Press, New York (00) 6. Ymmoto, M., Church, K.W: Using suffix rrys to compute term frequency nd document frequency for ll sustrings in corpus. Computtionl Linguistics 7(), 0 (00) 7. Gusfield, D.: Algorithms on Strings, Trees, nd Sequences. Cmridge University Press, Cmridge (997) 8. Nvrro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (to pper) Preliminry version ville t gnvrro/ps/cmcs06.ps.gz 9. Witten, I.H, Mofft, A., Bell, T.C: Mnging Gigytes: Compressing nd Indexing Documents nd Imges, nd edn. Morgn Kufmnn, Sn Frncisco (999) 0. Zoel, J., Mofft, A., Rmmohnro, K.: Guidelines for Presenttion nd Comprison of Indexing Techniques. SIGMOD Record (), 0 (996)

12 Suffix Arrys on Words 9. Hon, W.K., Sdkne, K., Sung, W.K.: Breking Time-nd-Spce Brrier in Constructing Full-Text Indices. In: Proc. FOCS, pp. 60. IEEE Computer Society, Los Almitos (00). Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmic (), 9 60 (99). Blumer, A., Blumer, J., Hussler, D., Ehrenfeucht, A., Chen, M.T., Seifers, J.I.: The Smllest Automton Recognizing the Suwords of Text. Theor. Comput. Sci. 0, (98). Ineng, S., Hoshino, H., Shinohr, A., Tked, M., Arikw, S., Muri, G., Pvesi, G.: On-line construction of compct directed cyclic word grphs. Discrete Applied Mthemtics 6(), 6 79 (00). Kärkkäinen, J., Snders, P., Burkhrdt, S.: Liner Work Suffix Arry Construction. J. ACM (6), 9 (006) 6. Ineng, S.: personl communiction (Decemer 006) 7. Kärkkäinen, J., Ukkonen, E.: Sprse Suffix Trees. In: Ci, J.-Y., Wong, C.K. (eds.) COCOON 996. LNCS, vol. 090, pp Springer, Heidelerg (996) 8. Ferrgin, P., Venturini, R.: A Simple Storge Scheme for Strings Achieving Entropy Bounds. Theoreticl Computer Science 7(), (007) 9. Aouelhod, M.I., Kurtz, S., Ohleusch, E.: Replcing Suffix Trees with Enhnced Suffix Arrys. J. Discrete Algorithms (), 86 (00) 0. Ksi, T., Lee, G., Arimur, H., Arikw, S., Prk, K.: Liner-Time Longest- Common-Prefix Computtion in Suffix Arrys nd Its Applictions. In: Amir, A., Lndu, G.M. (eds.) CPM 00. LNCS, vol. 089, pp Springer, Heidelerg (00). Aluru, S. (ed.): Hndook of Computtionl Moleculr Biology. Chpmn & Hll/CRC, Sydney, Austrli (006). Mner, U., Myers, E.W.: Suffix Arrys: A New Method for On-Line String Serches. SIAM J. Comput. (), 9 98 (99). Fischer, J., Heun, V.: A new succinct representtion of RMQ-informtion nd improvements in the enhnced suffix rry. In: Proc. ESCAPE. LNCS (to pper). Lrsson, N.J., Sdkne, K.: Fster suffix sorting. Technicl Report LU-CS-TR:99-, LUNDFD6/(NFCS-0)/ 0/(999), Deprtment of Computer Science, Lund University, Sweden (My 999). Alstrup, S., Gvoille, C., Kpln, H., Ruhe, T.: Nerest Common Ancestors: A Survey nd New Distriuted Algorithm. In: Proc. SPAA, pp ACM Press, New York (00) 6. Ferrgin, P., Nvrro, G.: The Pizz & Chili Corpus. Aville t Università degli Studi di Milno, Lortory for We Algorithmics: URLs from the.eu domin. Aville t 8. Mnisclco, M.A., Puglisi, S.J.: An efficient, verstile pproch to suffix sorting. ACM Journl of Experimentl Algorithmics (to pper) Aville t 9. Mnzini, G., Ferrgin, P.: Engineering lightweight suffix rry construction lgorithm. Algorithmic, 0(), 0 (00) Aville t mnzini/lightweight

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Informtion Retrievl nd Orgnistion Suffix Trees dpted from http://www.mth.tu.c.il/~himk/seminr02/suffixtrees.ppt Dell Zhng Birkeck, University of London Trie A tree representing set of strings { } eef d

More information

What are suffix trees?

What are suffix trees? Suffix Trees 1 Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl

More information

Algorithm Design (5) Text Search

Algorithm Design (5) Text Search Algorithm Design (5) Text Serch Tkshi Chikym School of Engineering The University of Tokyo Text Serch Find sustring tht mtches the given key string in text dt of lrge mount Key string: chr x[m] Text Dt:

More information

Fig.25: the Role of LEX

Fig.25: the Role of LEX The Lnguge for Specifying Lexicl Anlyzer We shll now study how to uild lexicl nlyzer from specifiction of tokens in the form of list of regulr expressions The discussion centers round the design of n existing

More information

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Position Heaps: A Simple and Dynamic Text Indexing Data Structure Position Heps: A Simple nd Dynmic Text Indexing Dt Structure Andrzej Ehrenfeucht, Ross M. McConnell, Niss Osheim, Sung-Whn Woo Dept. of Computer Science, 40 UCB, University of Colordo t Boulder, Boulder,

More information

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis CS143 Hndout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexicl Anlysis In this first written ssignment, you'll get the chnce to ply round with the vrious constructions tht come up when doing lexicl

More information

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries Tries Yufei To KAIST April 9, 2013 Y. To, April 9, 2013 Tries In this lecture, we will discuss the following exct mtching prolem on strings. Prolem Let S e set of strings, ech of which hs unique integer

More information

2 Computing all Intersections of a Set of Segments Line Segment Intersection

2 Computing all Intersections of a Set of Segments Line Segment Intersection 15-451/651: Design & Anlysis of Algorithms Novemer 14, 2016 Lecture #21 Sweep-Line nd Segment Intersection lst chnged: Novemer 8, 2017 1 Preliminries The sweep-line prdigm is very powerful lgorithmic design

More information

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5 CS321 Lnguges nd Compiler Design I Winter 2012 Lecture 5 1 FINITE AUTOMATA A non-deterministic finite utomton (NFA) consists of: An input lphet Σ, e.g. Σ =,. A set of sttes S, e.g. S = {1, 3, 5, 7, 11,

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully

More information

Suffix trees, suffix arrays, BWT

Suffix trees, suffix arrays, BWT ALGORITHMES POUR LA BIO-INFORMATIQUE ET LA VISUALISATION COURS 3 Rluc Uricru Suffix trees, suffix rrys, BWT Bsed on: Suffix trees nd suffix rrys presenttion y Him Kpln Suffix trees course y Pco Gomez Liner-Time

More information

Presentation Martin Randers

Presentation Martin Randers Presenttion Mrtin Rnders Outline Introduction Algorithms Implementtion nd experiments Memory consumption Summry Introduction Introduction Evolution of species cn e modelled in trees Trees consist of nodes

More information

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards A Tutology Checker loosely relted to Stålmrck s Algorithm y Mrtin Richrds mr@cl.cm.c.uk http://www.cl.cm.c.uk/users/mr/ University Computer Lortory New Museum Site Pemroke Street Cmridge, CB2 3QG Mrtin

More information

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011 CSCI 3130: Forml Lnguges nd utomt Theory Lecture 12 The Chinese University of Hong Kong, Fll 2011 ndrej Bogdnov In progrmming lnguges, uilding prse trees is significnt tsk ecuse prse trees tell us the

More information

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST Suffi Trees Outline Introduction Suffi Trees (ST) Building STs in liner time: Ukkonen s lgorithm Applictions of ST 2 3 Introduction Sustrings String is ny sequence of chrcters. Sustring of string S is

More information

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

In the last lecture, we discussed how valid tokens may be specified by regular expressions. LECTURE 5 Scnning SYNTAX ANALYSIS We know from our previous lectures tht the process of verifying the syntx of the progrm is performed in two stges: Scnning: Identifying nd verifying tokens in progrm.

More information

Intermediate Information Structures

Intermediate Information Structures CPSC 335 Intermedite Informtion Structures LECTURE 13 Suffix Trees Jon Rokne Computer Science University of Clgry Cnd Modified from CMSC 423 - Todd Trengen UMD upd Preprocessing Strings We will look t

More information

The Greedy Method. The Greedy Method

The Greedy Method. The Greedy Method Lists nd Itertors /8/26 Presenttion for use with the textook, Algorithm Design nd Applictions, y M. T. Goodrich nd R. Tmssi, Wiley, 25 The Greedy Method The Greedy Method The greedy method is generl lgorithm

More information

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016 Applied Dtses Lecture 13 Online Pttern Mtching on Strings Sestin Mneth University of Edinurgh - Ferury 29th, 2016 2 Outline 1. Nive Method 2. Automton Method 3. Knuth-Morris-Prtt Algorithm 4. Boyer-Moore

More information

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Lecture 10 Evolutionry Computtion: Evolution strtegies nd genetic progrmming Evolution strtegies Genetic progrmming Summry Negnevitsky, Person Eduction, 2011 1 Evolution Strtegies Another pproch to simulting

More information

Definition of Regular Expression

Definition of Regular Expression Definition of Regulr Expression After the definition of the string nd lnguges, we re redy to descrie regulr expressions, the nottion we shll use to define the clss of lnguges known s regulr sets. Recll

More information

CS481: Bioinformatics Algorithms

CS481: Bioinformatics Algorithms CS481: Bioinformtics Algorithms Cn Alkn EA509 clkn@cs.ilkent.edu.tr http://www.cs.ilkent.edu.tr/~clkn/teching/cs481/ EXACT STRING MATCHING Fingerprint ide Assume: We cn compute fingerprint f(p) of P in

More information

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have Rndom Numers nd Monte Crlo Methods Rndom Numer Methods The integrtion methods discussed so fr ll re sed upon mking polynomil pproximtions to the integrnd. Another clss of numericl methods relies upon using

More information

10.5 Graphing Quadratic Functions

10.5 Graphing Quadratic Functions 0.5 Grphing Qudrtic Functions Now tht we cn solve qudrtic equtions, we wnt to lern how to grph the function ssocited with the qudrtic eqution. We cll this the qudrtic function. Grphs of Qudrtic Functions

More information

Dr. D.M. Akbar Hussain

Dr. D.M. Akbar Hussain Dr. D.M. Akr Hussin Lexicl Anlysis. Bsic Ide: Red the source code nd generte tokens, it is similr wht humns will do to red in; just tking on the input nd reking it down in pieces. Ech token is sequence

More information

Pointwise convergence need not behave well with respect to standard properties such as continuity.

Pointwise convergence need not behave well with respect to standard properties such as continuity. Chpter 3 Uniform Convergence Lecture 9 Sequences of functions re of gret importnce in mny res of pure nd pplied mthemtics, nd their properties cn often be studied in the context of metric spces, s in Exmples

More information

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv Compression Outline 15-853:Algorithms in the Rel World Dt Compression III Introduction: Lossy vs. Lossless, Benchmrks, Informtion Theory: Entropy, etc. Proility Coding: Huffmn + Arithmetic Coding Applictions

More information

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay Lexicl Anlysis Amith Snyl (www.cse.iit.c.in/ s) Deprtment of Computer Science nd Engineering, Indin Institute of Technology, Bomy Septemer 27 College of Engineering, Pune Lexicl Anlysis: 2/6 Recp The input

More information

Lecture 7: Integration Techniques

Lecture 7: Integration Techniques Lecture 7: Integrtion Techniques Antiderivtives nd Indefinite Integrls. In differentil clculus, we were interested in the derivtive of given rel-vlued function, whether it ws lgeric, eponentil or logrithmic.

More information

UT1553B BCRT True Dual-port Memory Interface

UT1553B BCRT True Dual-port Memory Interface UTMC APPICATION NOTE UT553B BCRT True Dul-port Memory Interfce INTRODUCTION The UTMC UT553B BCRT is monolithic CMOS integrted circuit tht provides comprehensive MI-STD- 553B Bus Controller nd Remote Terminl

More information

From Dependencies to Evaluation Strategies

From Dependencies to Evaluation Strategies From Dependencies to Evlution Strtegies Possile strtegies: 1 let the user define the evlution order 2 utomtic strtegy sed on the dependencies: use locl dependencies to determine which ttriutes to compute

More information

Suffix trees. December Computational Genomics

Suffix trees. December Computational Genomics Computtionl Genomics Prof Irit Gt-Viks, Prof. Ron Shmir, Prof. Roded Shrn School of Computer Science, Tel Aviv University גנומיקה חישובית פרופ' עירית גת-ויקס, פרופ' רון שמיר, פרופ' רודד שרן ביה"ס למדעי

More information

MATH 25 CLASS 5 NOTES, SEP

MATH 25 CLASS 5 NOTES, SEP MATH 25 CLASS 5 NOTES, SEP 30 2011 Contents 1. A brief diversion: reltively prime numbers 1 2. Lest common multiples 3 3. Finding ll solutions to x + by = c 4 Quick links to definitions/theorems Euclid

More information

A dual of the rectangle-segmentation problem for binary matrices

A dual of the rectangle-segmentation problem for binary matrices A dul of the rectngle-segmenttion prolem for inry mtrices Thoms Klinowski Astrct We consider the prolem to decompose inry mtrix into smll numer of inry mtrices whose -entries form rectngle. We show tht

More information

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism Efficient K-NN Serch in Polyphonic Music Dtses Using Lower Bounding Mechnism Ning-Hn Liu Deprtment of Computer Science Ntionl Tsing Hu University Hsinchu,Tiwn 300, R.O.C 886-3-575679 nhliou@yhoo.com.tw

More information

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties, Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs. Lecture 5 Wlks, Trils, Pths nd Connectedness Reding: Some of the mteril in this lecture comes from Section 1.2 of Dieter Jungnickel (2008), Grphs, Networks nd Algorithms, 3rd edition, which is ville online

More information

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds A Sprse Grid Representtion for Dynmic Three-Dimensionl Worlds Nthn R. Sturtevnt Deprtment of Computer Science University of Denver Denver, CO, 80208 sturtevnt@cs.du.edu Astrct Grid representtions offer

More information

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure , Mrch 12-14, 2014, Hong Kong An Algorithm for Enumerting All Mximl Tree Ptterns Without Dupliction Using Succinct Dt Structure Yuko ITOKAWA, Tomoyuki UCHIDA nd Motoki SANO Astrct In order to extrct structured

More information

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe CSCI 0 fel Ferreir d Silv rfsilv@isi.edu Slides dpted from: Mrk edekopp nd Dvid Kempe LOG STUCTUED MEGE TEES Series Summtion eview Let n = + + + + k $ = #%& #. Wht is n? n = k+ - Wht is log () + log ()

More information

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig CS311H: Discrete Mthemtics Grph Theory IV Instructor: Işıl Dillig Instructor: Işıl Dillig, CS311H: Discrete Mthemtics Grph Theory IV 1/25 A Non-plnr Grph Regions of Plnr Grph The plnr representtion of

More information

PARALLEL AND DISTRIBUTED COMPUTING

PARALLEL AND DISTRIBUTED COMPUTING PARALLEL AND DISTRIBUTED COMPUTING 2009/2010 1 st Semester Teste Jnury 9, 2010 Durtion: 2h00 - No extr mteril llowed. This includes notes, scrtch pper, clcultor, etc. - Give your nswers in the ville spce

More information

COMBINATORIAL PATTERN MATCHING

COMBINATORIAL PATTERN MATCHING COMBINATORIAL PATTERN MATCHING Genomic Repets Exmple of repets: ATGGTCTAGGTCCTAGTGGTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized

More information

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search. CS 88: Artificil Intelligence Fll 00 Lecture : A* Serch 9//00 A* Serch rph Serch Tody Heuristic Design Dn Klein UC Berkeley Multiple slides from Sturt Russell or Andrew Moore Recp: Serch Exmple: Pncke

More information

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits Systems I Logic Design I Topics Digitl logic Logic gtes Simple comintionl logic circuits Simple C sttement.. C = + ; Wht pieces of hrdwre do you think you might need? Storge - for vlues,, C Computtion

More information

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method A New Lerning Algorithm for the MAXQ Hierrchicl Reinforcement Lerning Method Frzneh Mirzzdeh 1, Bbk Behsz 2, nd Hmid Beigy 1 1 Deprtment of Computer Engineering, Shrif University of Technology, Tehrn,

More information

From Indexing Data Structures to de Bruijn Graphs

From Indexing Data Structures to de Bruijn Graphs From Indexing Dt Structures to de Bruijn Grphs Bstien Czux, Thierry Lecroq, Eric Rivls LIRMM & IBC, Montpellier - LITIS Rouen June 1, 201 Czux, Lecroq, Rivls (LIRMM) Generlized Suffix Tree & DBG June 1,

More information

Lecture 10: Suffix Trees

Lecture 10: Suffix Trees Computtionl Genomics Prof. Ron Shmir, Prof. Him Wolfson, Dr. Irit Gt-Viks School of Computer Science, Tel Aviv University גנומיקה חישובית פרופ' רון שמיר, פרופ' חיים וולפסון, דר' עירית גת-ויקס ביה"ס למדעי

More information

CS 241 Week 4 Tutorial Solutions

CS 241 Week 4 Tutorial Solutions CS 4 Week 4 Tutoril Solutions Writing n Assemler, Prt & Regulr Lnguges Prt Winter 8 Assemling instrutions utomtilly. slt $d, $s, $t. Solution: $d, $s, nd $t ll fit in -it signed integers sine they re 5-it

More information

Lily Yen and Mogens Hansen

Lily Yen and Mogens Hansen SKOLID / SKOLID No. 8 Lily Yen nd Mogens Hnsen Skolid hs joined Mthemticl Myhem which is eing reformtted s stnd-lone mthemtics journl for high school students. Solutions to prolems tht ppered in the lst

More information

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1): Overview (): Before We Begin Administrtive detils Review some questions to consider Winter 2006 Imge Enhncement in the Sptil Domin: Bsics of Sptil Filtering, Smoothing Sptil Filters, Order Sttistics Filters

More information

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley AI Adjcent Fields Philosophy: Logic, methods of resoning Mind s physicl system Foundtions of lerning, lnguge, rtionlity Mthemtics Forml representtion nd proof Algorithms, computtion, (un)decidility, (in)trctility

More information

Graphs with at most two trees in a forest building process

Graphs with at most two trees in a forest building process Grphs with t most two trees in forest uilding process rxiv:802.0533v [mth.co] 4 Fe 208 Steve Butler Mis Hmnk Mrie Hrdt Astrct Given grph, we cn form spnning forest y first sorting the edges in some order,

More information

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table TDDD55 Compilers nd Interpreters TDDB44 Compiler Construction LR Prsing, Prt 2 Constructing Prse Tles Prse tle construction Grmmr conflict hndling Ctegories of LR Grmmrs nd Prsers Peter Fritzson, Christoph

More information

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey Alignment of Long Sequences BMI/CS 776 www.biostt.wisc.edu/bmi776/ Spring 2012 Colin Dewey cdewey@biostt.wisc.edu Gols for Lecture the key concepts to understnd re the following how lrge-scle lignment

More information

Lexical analysis, scanners. Construction of a scanner

Lexical analysis, scanners. Construction of a scanner Lexicl nlysis scnners (NB. Pges 4-5 re for those who need to refresh their knowledge of DFAs nd NFAs. These re not presented during the lectures) Construction of scnner Tools: stte utomt nd trnsition digrms.

More information

Parallel Square and Cube Computations

Parallel Square and Cube Computations Prllel Squre nd Cube Computtions Albert A. Liddicot nd Michel J. Flynn Computer Systems Lbortory, Deprtment of Electricl Engineering Stnford University Gtes Building 5 Serr Mll, Stnford, CA 945, USA liddicot@stnford.edu

More information

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the LR() nlysis Drwcks of LR(). Look-hed symols s eplined efore, concerning LR(), it is possile to consult the net set to determine, in the reduction sttes, for which symols it would e possile to perform reductions.

More information

1.1. Interval Notation and Set Notation Essential Question When is it convenient to use set-builder notation to represent a set of numbers?

1.1. Interval Notation and Set Notation Essential Question When is it convenient to use set-builder notation to represent a set of numbers? 1.1 TEXAS ESSENTIAL KNOWLEDGE AND SKILLS Prepring for 2A.6.K, 2A.7.I Intervl Nottion nd Set Nottion Essentil Question When is it convenient to use set-uilder nottion to represent set of numers? A collection

More information

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016 Solving Prolems y Serching CS 486/686: Introduction to Artificil Intelligence Winter 2016 1 Introduction Serch ws one of the first topics studied in AI - Newell nd Simon (1961) Generl Prolem Solver Centrl

More information

Video-rate Image Segmentation by means of Region Splitting and Merging

Video-rate Image Segmentation by means of Region Splitting and Merging Video-rte Imge Segmenttion y mens of Region Splitting nd Merging Knur Anej, Florence Lguzet, Lionel Lcssgne, Alin Merigot Institute for Fundmentl Electronics, University of Pris South Orsy, Frnce knur.nej@gmil.com,

More information

Suffix Tries. Slides adapted from the course by Ben Langmead

Suffix Tries. Slides adapted from the course by Ben Langmead Suffix Tries Slides dpted from the course y Ben Lngmed en.lngmed@gmil.com Indexing with suffixes Until now, our indexes hve een sed on extrcting sustrings from T A very different pproch is to extrct suffixes

More information

F. R. K. Chung y. University ofpennsylvania. Philadelphia, Pennsylvania R. L. Graham. AT&T Labs - Research. March 2,1997.

F. R. K. Chung y. University ofpennsylvania. Philadelphia, Pennsylvania R. L. Graham. AT&T Labs - Research. March 2,1997. Forced convex n-gons in the plne F. R. K. Chung y University ofpennsylvni Phildelphi, Pennsylvni 19104 R. L. Grhm AT&T Ls - Reserch Murry Hill, New Jersey 07974 Mrch 2,1997 Astrct In seminl pper from 1935,

More information

Stack Manipulation. Other Issues. How about larger constants? Frame Pointer. PowerPC. Alternative Architectures

Stack Manipulation. Other Issues. How about larger constants? Frame Pointer. PowerPC. Alternative Architectures Other Issues Stck Mnipultion support for procedures (Refer to section 3.6), stcks, frmes, recursion mnipulting strings nd pointers linkers, loders, memory lyout Interrupts, exceptions, system clls nd conventions

More information

OUTPUT DELIVERY SYSTEM

OUTPUT DELIVERY SYSTEM Differences in ODS formtting for HTML with Proc Print nd Proc Report Lur L. M. Thornton, USDA-ARS, Animl Improvement Progrms Lortory, Beltsville, MD ABSTRACT While Proc Print is terrific tool for dt checking

More information

Unit 5 Vocabulary. A function is a special relationship where each input has a single output.

Unit 5 Vocabulary. A function is a special relationship where each input has a single output. MODULE 3 Terms Definition Picture/Exmple/Nottion 1 Function Nottion Function nottion is n efficient nd effective wy to write functions of ll types. This nottion llows you to identify the input vlue with

More information

The dictionary model allows several consecutive symbols, called phrases

The dictionary model allows several consecutive symbols, called phrases A dptive Huffmn nd rithmetic methods re universl in the sense tht the encoder cn dpt to the sttistics of the source. But, dpttion is computtionlly expensive, prticulrly when k-th order Mrkov pproximtion

More information

Engineer To Engineer Note

Engineer To Engineer Note Engineer To Engineer Note EE-186 Technicl Notes on using Anlog Devices' DSP components nd development tools Contct our technicl support by phone: (800) ANALOG-D or e-mil: dsp.support@nlog.com Or visit

More information

Efficient implementation of lazy suffix trees

Efficient implementation of lazy suffix trees SOFTWARE PRACTICE AND EXPERIENCE Softw. Prct. Exper. 2003; 33:1035 1049 (DOI: 10.1002/spe.535) Efficient implementtion of lzy suffix trees R. Giegerich 1,S.Kurtz 2 nd J. Stoye 1,, 1 Fculty of Technology,

More information

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) Numbers nd Opertions, Algebr, nd Functions 45. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) In sequence of terms involving eponentil growth, which the testing service lso clls geometric

More information

Typing with Weird Keyboards Notes

Typing with Weird Keyboards Notes Typing with Weird Keyords Notes Ykov Berchenko-Kogn August 25, 2012 Astrct Consider lnguge with n lphet consisting of just four letters,,,, nd. There is spelling rule tht sys tht whenever you see n next

More information

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem Announcements Project : erch It s live! Due 9/. trt erly nd sk questions. It s longer thn most! Need prtner? Come up fter clss or try Pizz ections: cn go to ny, ut hve priority in your own C 88: Artificil

More information

Symbol Table management

Symbol Table management TDDD Compilers nd interpreters TDDB44 Compiler Construction Symol Tles Symol Tles in the Compiler Symol Tle mngement source progrm Leicl nlysis Syntctic nlysis Semntic nlysis nd Intermedite code gen Code

More information

Some necessary and sufficient conditions for two variable orthogonal designs in order 44

Some necessary and sufficient conditions for two variable orthogonal designs in order 44 University of Wollongong Reserch Online Fculty of Informtics - Ppers (Archive) Fculty of Engineering n Informtion Sciences 1998 Some necessry n sufficient conitions for two vrile orthogonl esigns in orer

More information

Space-Efficient Data Structures for Top-k Completion

Space-Efficient Data Structures for Top-k Completion Spce-Efficient Dt Structures for Top-k Completion Bo-June (Pul) Hsu Microsoft Reserch One Microsoft Wy, Redmond, WA, 9805 USA pulhsu@microsoft.com Giuseppe Ottvino Diprtimento di Informtic Università di

More information

Approximation of Two-Dimensional Rectangle Packing

Approximation of Two-Dimensional Rectangle Packing pproximtion of Two-imensionl Rectngle Pcking Pinhong hen, Yn hen, Mudit Goel, Freddy Mng S70 Project Report, Spring 1999. My 18, 1999 1 Introduction 1-d in pcking nd -d in pcking re clssic NP-complete

More information

9 Graph Cutting Procedures

9 Graph Cutting Procedures 9 Grph Cutting Procedures Lst clss we begn looking t how to embed rbitrry metrics into distributions of trees, nd proved the following theorem due to Brtl (1996): Theorem 9.1 (Brtl (1996)) Given metric

More information

Reducing a DFA to a Minimal DFA

Reducing a DFA to a Minimal DFA Lexicl Anlysis - Prt 4 Reducing DFA to Miniml DFA Input: DFA IN Assume DFA IN never gets stuck (dd ded stte if necessry) Output: DFA MIN An equivlent DFA with the minimum numer of sttes. Hrry H. Porter,

More information

I/O Efficient Dynamic Data Structures for Longest Prefix Queries

I/O Efficient Dynamic Data Structures for Longest Prefix Queries I/O Efficient Dynmic Dt Structures for Longest Prefix Queries Moshe Hershcovitch 1 nd Him Kpln 2 1 Fculty of Electricl Engineering, moshik1@gmil.com 2 School of Computer Science, himk@cs.tu.c.il, Tel Aviv

More information

12-B FRACTIONS AND DECIMALS

12-B FRACTIONS AND DECIMALS -B Frctions nd Decimls. () If ll four integers were negtive, their product would be positive, nd so could not equl one of them. If ll four integers were positive, their product would be much greter thn

More information

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7.

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7. CS 241 Fll 2017 Midterm Review Solutions Octoer 24, 2017 Contents 1 Bits nd Bytes 1 2 MIPS Assemly Lnguge Progrmming 2 3 MIPS Assemler 6 4 Regulr Lnguges 7 5 Scnning 9 1 Bits nd Bytes 1. Give two s complement

More information

George Boole. IT 3123 Hardware and Software Concepts. Switching Algebra. Boolean Functions. Boolean Functions. Truth Tables

George Boole. IT 3123 Hardware and Software Concepts. Switching Algebra. Boolean Functions. Boolean Functions. Truth Tables George Boole IT 3123 Hrdwre nd Softwre Concepts My 28 Digitl Logic The Little Mn Computer 1815 1864 British mthemticin nd philosopher Mny contriutions to mthemtics. Boolen lger: n lger over finite sets

More information

1.5 Extrema and the Mean Value Theorem

1.5 Extrema and the Mean Value Theorem .5 Extrem nd the Men Vlue Theorem.5. Mximum nd Minimum Vlues Definition.5. (Glol Mximum). Let f : D! R e function with domin D. Then f hs n glol mximum vlue t point c, iff(c) f(x) for ll x D. The vlue

More information

On String Matching in Chunked Texts

On String Matching in Chunked Texts On String Mtching in Chunked Texts Hnnu Peltol nd Jorm Trhio {hpeltol, trhio}@cs.hut.fi Deprtment of Computer Science nd Engineering Helsinki University of Technology P.O. Box 5400, FI-02015 HUT, Finlnd

More information

CSCE 531, Spring 2017, Midterm Exam Answer Key

CSCE 531, Spring 2017, Midterm Exam Answer Key CCE 531, pring 2017, Midterm Exm Answer Key 1. (15 points) Using the method descried in the ook or in clss, convert the following regulr expression into n equivlent (nondeterministic) finite utomton: (

More information

Determining Single Connectivity in Directed Graphs

Determining Single Connectivity in Directed Graphs Determining Single Connectivity in Directed Grphs Adm L. Buchsbum 1 Mrtin C. Crlisle 2 Reserch Report CS-TR-390-92 September 1992 Abstrct In this pper, we consider the problem of determining whether or

More information

documents 1. Introduction

documents 1. Introduction www.ijcsi.org 4 Efficient structurl similrity computtion etween XML documents Ali Aïtelhdj Computer Science Deprtment, Fculty of Electricl Engineering nd Computer Science Mouloud Mmmeri University of Tizi-Ouzou

More information

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) * Pln for Tody nd Beginning Next week Interpreter nd Compiler Structure, or Softwre Architecture Overview of Progrmming Assignments The MeggyJv compiler we will e uilding. Regulr Expressions Finite Stte

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup Regulr Expression Mtching with Multi-Strings nd Intervls Philip Bille Mikkel Thorup Outline Definition Applictions Previous work Two new problems: Multi-strings nd chrcter clss intervls Algorithms Thompson

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Ma/CS 6b Class 1: Graph Recap

Ma/CS 6b Class 1: Graph Recap M/CS 6 Clss 1: Grph Recp By Adm Sheffer Course Detils Adm Sheffer. Office hour: Tuesdys 4pm. dmsh@cltech.edu TA: Victor Kstkin. Office hour: Tuesdys 7pm. 1:00 Mondy, Wednesdy, nd Fridy. http://www.mth.cltech.edu/~2014-15/2term/m006/

More information

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li 2nd Interntionl Conference on Electronic & Mechnicl Engineering nd Informtion Technology (EMEIT-212) Complete Coverge Pth Plnning of Mobile Robot Bsed on Dynmic Progrmming Algorithm Peng Zhou, Zhong-min

More information

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Solving Prolems y Serching CS 486/686: Introduction to Artificil Intelligence 1 Introduction Serch ws one of the first topics studied in AI - Newell nd Simon (1961) Generl Prolem Solver Centrl component

More information

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona CSc 453 Compilers nd Systems Softwre 4 : Lexicl Anlysis II Deprtment of Computer Science University of Arizon collerg@gmil.com Copyright c 2009 Christin Collerg Implementing Automt NFAs nd DFAs cn e hrd-coded

More information

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization An Efficient Divide nd Conquer Algorithm for Exct Hzrd Free Logic Minimiztion J.W.J.M. Rutten, M.R.C.M. Berkelr, C.A.J. vn Eijk, M.A.J. Kolsteren Eindhoven University of Technology Informtion nd Communiction

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona Implementing utomt Sc 5 ompilers nd Systems Softwre : Lexicl nlysis II Deprtment of omputer Science University of rizon collerg@gmil.com opyright c 009 hristin ollerg NFs nd DFs cn e hrd-coded using this

More information

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών ΕΠΛ323 - Θωρία και Πρακτική Μταγλωττιστών Lecture 3 Lexicl Anlysis Elis Athnsopoulos elisthn@cs.ucy.c.cy Recognition of Tokens if expressions nd reltionl opertors if è if then è then else è else relop

More information