AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures

Size: px

Start display at page:

Download "AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures"

Morgan Russell
6 years ago
Views:

1 AS-Idex: A Structure For Strig Search Usig -grams ad Algebraic Sigatures ABSTRACT Cédric du Mouza CNAM Paris, Frace dumouza@cam.fr Philippe Rigaux INRIA Saclay Orsay, Frace philippe.rigaux@iria.fr AS-Idex is a ew idex structure for exact strig search i disk residet databases. It uses hashig, ulike kow alteratives, whether baesd o trees or tries. It typically idexes every -gram i the database, though o-dese idexig is possible. The hash fuctio uses the algebraic sigatures of -grams. Use of hashig provides for costat idex access time for arbitrarily log patters, ulike other structures whose search cost is at best logarithmic. The storage overhead of AS-Idex is basically %, similar to that of alteratives or smaller. We show the idex structure, our use of algebraic sigatures ad the search algorithm. We preset the theoretical ad experimetal performace aalysis. We compare the AS-Idex to mai alteratives. We coclude that AS-Idex is a attractive structure ad we idicate directios for future work. 1. INTRODUCTION Databases icreasigly store data of various kids such as text, DNA records, ad images. This data is at least partly ustructured, which creates the eed for full text searches (or patter matchig) [17]. I mai memory, matchig a patter P agaist a strig S rus i O( S / P ) at best [4]. Searchig very large data sets requires a idex, despite the storage overhead ad possibly log idex costructio time. We address the problem of searchig arbitrarily log strigs i exteral memory. We assume a database D = {R 1, R 2,, R } of records, viewed as strigs over a alphabet Σ. The database supports record isertio, deletio ad updates, as well a search o ay substrig of the records cotets. We call the iput to the strig the patter. Curretly deployed systems for documets use almost exclusively iverted files idexig words for keyword search. However, our eed for full patter matchig i records where the cocept of word may ot exist rules out this solutio. Suffix trees ad arrays form a class of idexes for patter matchig. Suffix trees work best whe Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. ACM CIKM 09, Hog Kog, Chia Copyright 200X ACM X-XXXXX-XX-X/XX/XX...$ Witold Litwi Uiv. of Paris Dauphie Paris, Frace witold.litwi@dauphie.fr Thomas Schwarz Sata Clara Uiv. Sata Clara, CA, USA tjschwarz@scu.edu they fit ito RAM. Attempts to create versios that work from disks are recet, experimetal, ad focus typically o specific applicatios [7, 20]. The literature attributes this to bad locality of referece, a ecessarily complex pagig scheme, ad structural deterioratio caused by iserts ito the structure. A suffix array stores poiters o a list of suffixes sorted i lexicographic order, ad uses biary search for patter matchig. However, a stadard database architecture stores records i blocks ad supports dyamic isertios ad deletios that make difficult the maiteace of sequetial storage. A suffix array search eeds O(log 2 N) block accesses, where N is the size (umber of characters) of D. We are ot aware of a geeric solutio to these difficulties i the literature, ad without oe, these costs disqualify suffix arrays for our eeds. Two approaches that explicitly address disk-based idexig for full patter-matchig searches are the Strig B-Tree [7] ad -gram iverted idex [18, 11]. The Strig B-Tree is basically a combiatio of B + -Tree ad Patricia Tries. The global structure is that of a B + -Tree, where keys are poiters to suffixes i the database. Each ode is orgaized as a Patricia Trie, which helps guidig the search ad isert operatios. A Strig B-Tree fids all occurreces of a patter P i O( P + log B N) disk accesses, where B is the block size. Aother directio for text search eeds are idexes ivertig -grams ( cosecutive symbols) istead of etire words. They are disk friedly i that they rely o fast sequetial scas, provide a good locality of referece, ad easily adapt to pagig ad partitioig. However, the search cost is liear, both i the size of the database ad the size of the patter. I the preset paper, we itroduce a ew data structure for idexbased full-text search called Algebraic Sigature Idex (AS-Idex). It follows the path of a iverted file based o -grams. Its ovel attractive property, uique at preset to our kowledge, is costat disk search time, idepedet of the size of the database ad of the legth of the patter. This results from its stadard database approach for large-scale idexig (amely hashig). AS-Idex hash calculus is however specific. As the ame suggests, it relies o algebraic sigatures, (ASs) [15], whose algebraic properties we ca exploit. Our experimets show AS-Idex to be a very fast solutio to patter searches i a database. AS-Idex search oly eeds two disk lookups whe the hash directory fits i mai memory. I our experimets, it proved itself to be up to oe order of magitude faster tha -gram idexes ad twice faster that Strig B-Trees. The basic variat of AS-Idex has a storage overhead of about 5 to 6. A variat of our scheme oly idexes selective -grams ad has lower storage overhead at the costs of slower search times. All

2 these properties should make AS-Idex a practical solutio for text idexig. The paper is orgaized as follows. Sectio 2 gives a bird s eye view of the AS-Idex basic priciples. We recall the theory of algebraic sigatures (Sectio 3). We the discuss the AS-Idex structure i Sectio 4 ad the search algorithm i Sectio 5. Sectio 6 aalyses the scheme s behavior, especially collisio ad false positive probability, as well as performace. Sectio 8 explais the details of our implemetatio of Strig B-Trees, -gram idex ad AS-Idex ad experimets. We review related work i Sectio 9. Fially, we summarize ad give future research directios i Sectio AS-INDEX OVERVIEW AS-Idex is a classical hash file with variable legth disk-residet buckets, (Fig. 1). Buckets are poited to by the hash directory. Simplicity ad performace of such files attracted coutless applicatios. The mai advatage is costat access time, idepedet of file ad patter size. Costat time is ot possible for a tree/trie access method. Each bucket stores a list of etries, each idexig some -gram i the database. The basic AS-Idex is dese, idexig every - gram. The hash fuctio providig the bucket for a etry uses the -gram value as the hash key. The hash fuctio is particular: it relies o algebraic sigatures of -grams, to be described i the ext sectio. I what follows, we oly deal with the static AS-Idex, but the hash structure may use stadard mechaisms for dyamicity, scalabilty ad distributio. h(s1) S1 Hash directory h(s2) Sp Patter P e1 AS(e1, e2, Sp) e2 S2 failure success Figure 1: A matchig attempt with AS-Idex I overview, a search for a patter P proceeds as follows (Fig. 1): First, we preprocess P for three sigatures: (i) of the iitial - gram S 1, (ii) of the fial -gram S 2 ad (iii) of the suffix S p of P after S 1. Hashig o S 1 locates the bucket with every etry e 1 idexig a -gram i the database with the sigature of S 1. Likewise, hashig o S 2 locates the bucket with every etry e 2 idexig a -gram with the sigature of S 2. We oly cosider pairs of etries that are i the same record ad at the right distace amog them. We thus locate ay strig S matchig P o its iitial ad termial -gram, at least by sigature. A algebraic calculatio AS(e 1, e 2, S p) determies whether S p may match the suffix of S as well. The method is probabilistic i ature, with low chaces of false positives. We ca avoid eve a miute possibility of a false match by a symbol for symbol compariso betwee patter ad the relevat part of the record. By limitig disk accesses to the two buckets associated to the first ad last -grams of the P, AS-Idex search rus idepe- Strig B-Trees -Gram AS-Idex Costr. O( D log B D ) O( D ) O( D ) Storage O( D ) O( D ) O( D ) (ratio) ( 6-7) ( 6) ( 5-6) Preproc. oe O( P ) O( P ) Search O( P + log B D ) O( D P ) O(1) Table 1: Disk-based idex structures for searchig a patter P i a database D. B is the block size. detly from P s size. The cost of the search procedure outlied above is reduced to that of readig two buckets. The hash directory itself ca ofte be cached i RAM, or eeds at most two additioal disk accesses, as we will show. With a appropriate dyamic hashig mechaism that evely distributes the etries i the structure ad scales gracefully, the bucket size is expected to remai uiform eough to let the AS-Idex ru i costat time, idepedetly of the database size. Table 1 compares the aalytical behavior of AS-Idex with those of two competitors (Strig B-Trees ad -gram idex) ad summarizes its expected advatages. The size of all structures is liear i the size of the database. The ratio directly depeds o the size of idex etries. We metio i Table 1 the ratio obtaied i our implemetatio, before ay compressio. The asymptotic search time i the database size is liear for -idex, logarithmic for Strig B- Trees ad costat for AS-Idex. Moreover, oce the patter P has bee pre-processed, AS-Idex rus idepedetly from P s, whereas -gram idex cost is liear i P. I summary, AS-Idex efficietly idetifies matches with oly two lookups, for ay patter legth. This efficiecy is achieved through extesive use of properties of algebraic sigatures, described i the ext sectio. 3. ALGEBRAIC SIGNATURES We use a Galois field GF (2 f ) of size 2 f. The elemets of GF are bit strigs of legth f. Selectig f = 8 deals with ASCII records ad f = 16 with Uicode. We recall that a Galois field is a fiite set that supports additio ad multiplicatio. These operatios are associative, commutative ad distributive, have eutral elemets 0 ad 1, ad there exist additive ad multiplicative iverses. I a Galois field GF (2 f ), additio ad subtractio are implemeted as the bitwise XOR. Log/atilog tables provide usually the most practical method for multiplicatio [15]. We adopt the usual mathematical otatios for the operatios i what follows. We use a primitive elemet α of GF (2 f ). This meas that the powers of α eumerate all the o-zero elemets of the Galois field. It is well kow that there always exist primitive elemets. Let R = r 0r 1 r M 1 be a record with M symbols. We iterpret R as a sequece of GF elemets. Idetifyig the character set of the records with a Galois field provides a coveiet mathematical cotext to perform computatios o record cotets. DEFINITION 1. The m-symbols sigature (AS) of a record R is a vector AS m(r) with m coordiates (s 1, s 2,... s m) defied by 8 >< >: s 1 = r 0 + r 1 α + r 2 α r M 1 α M 1 s 2 = r 0 + r 1 α 2 + r 2 α r M 1 α 2(M 1). s m = r 0 + r 1 α m + r 2 α 2m... + r M 1 α m(m 1) (1) We refer the reader to [15] for more details about defiitios ad properties of algebraic sigatures.

3 I our examples, we give the m-symbol AS of R as the cocateatio of the values s m,, s 1 i hexadecimal otatio. For istace if s 1 = #34 ad s 2 = #12, the we write the 2-symbol AS as s 2s 1 = #1234. Symbol f m M K L r 0, r 1, r M 1 s 1, s m S 1, S m Iterpretatio Size (i bits) of a GF elemet i GF (2 f ) (f = 8 or f = 16) Size of -grams Size of sigature vectors, m Size of records Size of patters, K > Number of lies i the AS- Idex Record characters/symbols Oe-symbol sigatures Oe-symbol -gram sigatures Table 2: Table of the symbols used i the paper We use differet partial algebraic sigatures of patter ad database records, as we ow explai. DEFINITION 2. Let l [0, M 1] be ay positio (offset) i R. The Cumulative Algebraic Sigature (CAS) at l, CAS m(r, l), is the algebraic sigature of the prefix of R edig at r l, i.e., CAS m(r, l) = AS m(r 0... r l ). The Partial Algebraic Sigature (PAS) from l to l is the value P AS m(r, l, l) = AS m(r l r l +1 r l ), with 0 l l, Fially, we most ofte use the PAS of substrigs of legth, i.e., of -grams. DEFINITION 3. The -gram Algebraic Sigature (NAS) of R at l is NAS m(r, l) = P AS m(r, l + 1, l), for l 1. With other words: NAS m(r, l) = (r l r l α 1, Record RM r 0 CAS(l) r l r l r l α 2( 1),.., r l r l α m( 1) ) (2) PAS(l, l) NAS(l) r l +1 r l r M 1 Figure 2: Computig CAS(l), P AS(l, l) ad NAS(l) i record R M I all the defiitios, we may drop R wheever it is implicit for brevity s sake. Figure 2 shows the respective parts of the record that defie the CAS, P AS ad NAS at offset l. The followig simple properties of algebraic sigatures, expressed for coordiate i, 1 i m, are useful for what follows. We ote i-th symbol of a CAS, as CAS m(l) i ad proceed similarly for NAS ad PAS. 0 Hash directory D L 1 C 0 C i Buckets etries for CAS c Structure of a bucket etries for CAS c e1 e2 <r1, l1, c> <r2, l2, c>... <ri, li, c >... Figure 3: Structure of the AS-idex Properties 3 ad 4 let us icremetally calculate ext CAS ad NAS while buildig the file, or preprocessig the patter, istead of recomputig the sigature etirely. This speeds up the process cosiderably. Property 5 also speeds up the patter preprocessig, as it will appear. Property 6 fially is fudametal for the match attempt calculus. CAS m(l) i = CAS m(l 1) i + r l α il (3) NAS m(l) i = NASm(l 1)i r l α NAS m(l) i = For 0 l < l: + r l α i( 1) (4) CASm(l)i CASm(l )i α i(l +1) (5) CAS m(l) i = CAS m(l ) i + α i(l +1) P AS m(l + 1, l) i (6) Table 2 summarizes the symbols used i the paper. 4. STRUCTURE Our database cosists of records that are made up of a Record IDetifier (RID) ad some o-key field. (Extesios to databases with more tha oe key ad/or o-key field are straight-forward.) We assume that the o-key field cosists of strigs of symbols from our Galois field. Our search fids all occurreces of a patter i the o-key field of every record i the database. Whe we talk about offsets ad algebraic sigatures, we refer oly to the o-key field. If R is such a field ad r i a character (Galois field elemet) i R, the we call i the offset. A -gram G = r l +1 r l is ay sequece of cosecutive symbols i R. By extesio, we the call l the offset of the -gram. A AS-Idex cosists of etries. DEFINITION 4. Let G be a -gram at offset o i R. The etry idexig G, deoted E(G), is a triplet (rid, o, c) where rid is the RID of R, c is CAS 1(R, o), ad o is o modulo 2 f 1. We assure costat size of the etries by takig the remaider. The choice of the modulus is justified by the idetity χ 2f 1 = χ for all Galois field elemets χ. The idexig is dese, i.e., every -gram i the database is idexed, ad by a differet etry. To costruct the idex, we process all -grams i the database. AS-Idex is a hash file, deoted D[0..L 1], with directory legth L = 2 v beig a power of 2 (Fig. 3). Elemets of D refer to buckets or lies of variable legth, each cotaiig a list of

4 etries. Lies are of variable legth to accommodate a possible ueve distributio of -gram values. Each D[i] cotais the address of the i-th bucket. All together, the AS-Idex lie structure is similar to a postig list i a iverted file, except for the presece of the CAS c i each etry ad a specific represetatio of the offset l. Sice we use a hash file, lies should have a collisio resolutio method such as classical separate chaiig that uses poiters to a overflow space. Such a techique accomodates moderate growth, but if we eed to accomodate large growth, the we eed a dyamic hashig method such as liear hashig. We ow describe how to calculate the idex i of the lie for a etry E(G) = (rid, o, c). We calculate i from the m-symbol NAS NAS m(g) = (s 1,..., s m). The coordiates of the NAS are bit strigs. By cocateatig them, we obtai a bit strig S = s ms m 1... s 1 that we iterpret as a large, usiged iteger. The idex i is: i = h L(S) = S mod L Sice L = 2 v, this amouts to extractig the last v bits of S. It is easy to see that m should be such that m ad m v/f. The choice of AS-Idex parameters m, ad L will be further discussed i Sectio 6. EXAMPLE 1. Cosider a 100 GB database with byte-wide symbols (f = 8). For the sake of example, let L = 2 30, leadig to buckets with /2 30 = 93 etries o the average. Let = 5 ad m = 4. To calculate lie idex i of -gram G we thus cocateate s 4..s 1 of NAS 4(G). The we cut lower 30 bits. Now, cosider the record with RID 73 ad o-key field Uiversity Paris Dauphie. Assume that NAS 4( Uive, 4) is # a. Sice this is the first 5-gram, CAS 1(73, 4) has the same value as the first compoet, i.e., is #9a. For subsequet 5-grams, the first coordiate of the NAS ad the CAS usually differ. The etry is E = (73, 4, #9a). Its lie idex is # a mod 2 30 = # a (we remove the leadig 2 bits). 5. PATTERN SEARCH Let P = p 0... p K 1 be the patter to match. AS-Idex search delivers the RID of every record i the database that cotais P. The search algorithm first prepocesses P by extractig values that become the etries ito AS-Idex to locate matches. 5.1 Preprocessig The preprocessig phase computes three sigatures from the patter P : (a) the m-symbols AS of the 1st -gram i P, called S 1; (b) the 1-symbol PAS of the suffix of P followig the 1st -gram, S p = P AS 1(P,, K 1). (c) the m-symbols AS of the last -gram i P, deoted as S 2; There are several ways to compute these sigatures. For istace, oe may compute S 1, the S p, the extract the m-symbol value of S 2 through property (5). Figure 4 shows the parts of the patter that determie the sigatures S 1, S 2 ad S p o our ruig example. Recall that = 5 ad m = 4. We preprocess the patter P = Uiversity Paris Dauphie ad obtai (a) S 1 = NAS 4(P, 4) (for 5-gram Uive ); (b) S 2 = NAS 4(P, 24) (for 5-gram phie ) ad (c) S p = P AS 1(P, 5, 24). This iformatio is used to fid the occurreces of the patter i the database. Record R Patter P CAS c 1 NAS S 1 l 1 PAS S p CAS c 2 coferece at the Uiversity Paris Dauphie 5.2 Processig 0 5 Uiversity Paris Dauphie Figure 4: The search algorithm NAS S 2 l = 2 l +K 1 Let i = h L(S 1) ad i = h L(S 2). Every etry (R, l 1, c 1) i bucket D[i] idexes a -gram G i a record R whose NAS S G is such that h L(S G) = h L(S 1). Likewise, every etry (R, l 2, c 2) i bucket D[i ] idexes a -gram G such that h L(G ) = h L(S 2). Up to possible collisios, G ad G respectively equal the first ad last -grams i patter P. We search every pair (G, G ) able to characterize a matchig strig S = P. We must have R = R ad l 2 = (l 1 + K ) mod (2 f 1). The last compoet i G, c 2 should match the value implied by c 1 ad S p (see Figure 4). Algebraic property 6 implies that c 2 should be 0 Hash directory D L 1 i i <r1, l1, c1> <r3, l2, c2> c 2 = c 1 + α l 1+1 S p. (7) <r1, l4, c4> <r2, l5, c5> <r3, l6, c6>... <r3, l3, c3> bucket for NAS( phie ) check CAS ad positios bucket for NAS( Uive ) Figure 5: Usig AS-Idex for a search Figure 5 illustrates the process. By hashig o S 1 = Uive, we retrieve the bucket D[i] which cotais, amog others, the etries idexig all occurreces of Uive i the database. Similarly we retrieve the bucket D[i ] which cotais etries for all the occurreces of the -gram phie. A pair of etries [e i(r 1, l 1, c 1), e i(r 1, l 4, c 4)] i (D[i], D[i ]) represets a substrig s of r 1 that begis with Uive ad eds with phie. Checkig whether s matches P ivolves two tests: (i) we compute c 1 + α l1+1.s p ad compare it with c 4 to check whether the sigatures of the middle parts match, ad (ii) we verify as discussed whether l 1 matches l 2 give K, i.e., S ad P have the same legth. If both tests are succesful, we report a probable match. The ext attempt will cosider (r 3, l 2, c 2) i D[i] ad (r 3, l 6, c 6) i D[i ]. Note that (r 2, l 5, c 5) i D[i ] is skipped because there is o possible match o r 2 i D[i]. The pseudo-code of the algorithm is give below. Algorithm AS-SEARCH Iput: a patter P = p 0... p K 1, the -gram size Output: the list of records that cotai P begi // Preprocessig phase S 1 := NAS m(p, 1) S 2 := NAS m(p, K 1)

5 Nr. Collisios Take Expected 0 16,568, , , > Table 3: Actual ad expected umber of collisios usig a 3B sigature o a dictioary of moderately sized Eglish words. S p := P AS 1 (P,, K 1) i := h L (S 1 ) // i is the first lie idex i := h L (S 2 ) // i is the secod lie idex // Processig phase for each etry E(R, l 1, c 1 ) i D[i] c 2 = c 1 + α l 1+1 S p l 2 = (l 1 + K ) mod (2 f 1) if (there exists a etry E (R, l 2, c 2 ) i D[i ]) the Report success for R edif edfor ed The selectivity of the process relies o its ability to maipulate three distict sigatures, S 1, S 2 ad S p. Therefore the patter legth must be at least Collisio Hadlig As hashig i geeral, our method is subject to collisios deliverig false positives. To elimiate ay collisios, it is ecessary to post-process AS-Search by attemptig to actually fid P i every R idetified as a match. This requires a symbol by symbol compariso betwee P ad its presumed match locatio. It will appear however that AS-Search should typically have a egligible probability of a collisio. Hece post-processig may be left to presumably rare applicatios eedig full assurace. Also, it should be RAM-based ad therefore typically egligible with respect to the disk search time. We thus do ot detail it here. 6. ANALYSIS We ow preset a short, theoretical aalysis of the expected performace of AS-Idex. 6.1 Hash uiformity Algebraic sigature values ted to have a more uiform distributio tha the distributio of character values due to the multiplicatios by powers of α i their calculatios. However, the total umber of strigs or -grams i a dataset gives a upper boud for the umber of algebraic sigatures calculated from them. Biological databases ofte store DNA strigs i a ASCII file. Oly the four characters A, C, G, ad T appear as characters i such a file. There are oly 4 6 = 4K differet 6-grams i the file ad the umber of differet NAS of these sigatures caot exceed this value. Icreasig the umber of coordiates i the NAS beyod 5 symbols is ot goig to achieve better uiformity. This cautio applies oly to files usig a small alphabet. For a differet experimet, we chose a list of Eglish words ( words MB) six or more characters loger take from a world list used to perform a dictioary attack o a password file (by a admiistrator tryig to weed out weak passwords). We calculated the 3B three compoet sigature of all words with more tha five characters (65536 words) ad calculated the umber of sigatures attaied by i words as well as this umber for a perfect hash fuctio (Table 3). The χ 2 value of shows very close agreemet. Whe repeatig the test usig 2B two compoet sigatures, we obtaied χ 2 = A much smaller set had χ 2 = , but there were less sigatures attaied by a high umber ( 5) of words. These results give experimetal verificatio for the flatess of sigatures. 6.2 Storage ad Performace Idex costructio time. The properties (2) - (6) of algebraic sigatures allow us to calculate all etries with a liear sweep of all records. We eed to keep a poiter to the symbol just beyod the curret -gram ad to the first symbol of the curret -gram. Usig equatios (3) ad (4), we ca the calculate the NAS of the ext -gram from the old oe ad update the ruig CAS of the record. Sice creatig the etry for a -gram ad isertig it ito the idex take costat time, buildig the idex takes time liear to the size of the database. Storage costs. The storage complexity of AS-Idex is O(N) for N idexed -grams. The actual size of a RID should be 3-4 bytes sice 3 bytes already allow a database with 16M records. The actual storage per etry should be about 5-6 bytes, which results i a storage overhead of about (5 6)N. We ca lower this storage overhead, e.g. to 125%, by o-dese idexig, that we do ot preset here due to space limitatio, at the expese of a proportioal icrease i search time. EXAMPLE 2. We still cosider a 100 GB database with 8b symbols. Assumig a average record size of 100 symbols, we have 1G records ad our record idetifier eeds to be 4B log. We previously set the size of the CAS to 1B. With the 1B offset ito the record, the etry is 6B. AS-Idex should use about 6 times more space tha the origial database. Now assume records of 10KB each. The record idetifiers ca be 3B log. This gives a total etry size of 5B or a storage factor of 5. Each elemet of the hash directory stores a bucket address with at least log 2 (L) bits. I the case of our large 100GB database, with L = 2 30, choosig 4 bytes for the address leads to the required storage of 4.1GB, smaller tha the curret data servers stadard capacity. I most cases, D is expected to fit i mai memory. Patter preprocessig. To preprocess the patter, we eed to calculate a PAS ad two NAS. We calculate both i a similar maer as above ad obtai preprocessig times liear i the size of the patter. Sice the result depeds o all symbols i the patter, we caot do better. Search speed. We assume that etries are close to uiformly distributed. The search algorithm picks up two cells i the hash directory, i order to obtai the umber of etries ad the bucket refereces of, respectively, D[i] ad D[i ]. The the buckets themselves must be read. Each of these accesses may icur a radom disk access, hece a (costat) cost of at most four disk reads. If the hash directory resides i mai memory, the cost reduces to loadig the two buckets. The mai memory cost is a i-ram joi of the two buckets. With l deotig the expected lie legth, ad assumig the lies are ordered, the average complexity of this phase should be (uder our uiformity assumptio) 2l. The worst case is O(l 2 ). This case is highly ulikely, as all the etries o both lies should fit ito the same record. The optioal collisio resolutio (symbol-tosymbol) test adds O( P ). This test is typically performed i RAM ad makes oly a egligible cotributio to the otherwise costat costs. Altogether, the search cost is S = C hash + C buck + C ram + C post (8)

6 Device Processor speed RAM speed Flash disk access Magetic disk Disk trasfert rate Access time 2-3 Gz per core 100s ms 5-7 ms 300 MB/s. Table 4: Hardware characteristics where C hash represets the Hash Directory access cost, C buck the bucket access cost, C ram the RAM processig cost ad C post the post-processig. We ow evaluate the actual search time that may result from the above complexity figures. We take as basis the characteristics of the curret popular hardware show i Table 4 (see also [8] for a recet aalysis). C hash is the cost of fetchig 2 elemets i the hash directory. The trasfer overhead is egligible, ad the (worst case) cost is therefore i the rage [10, 14] ms for magetic disks. The bucket access cost C buck is similar to C hash regardig radom disk accesses, but we fetch far more bytes per access. We eed to trasfer 2l etries. Table 4 suggests that we ca trasfer up to 300 KB per ms (Flash trasfer rate are similar.) Sice the size of a etry is typically 5-6 bytes, each search loads 10 12l. It follows that for l = 1K, the lie trasfer cost is egligible. For l = 16K, 160 to 200 KB must be trasferred. The cost is about 0.5 to 0.7 ms. This is still egligible with respect to disk accesses for AS-Idex o magetic disk, but ot o solid oe. I the latter case, the trasfer cost is equivalet to a additioal disk access. The basic formula (average cost) for C ram, the i-ram joi of the two lies, is 2l. I practice it is l E, where E is a couple of visited etries processig cost. I detail, we have 2 RAM accesses to Rids. I this test is successful, we eed 2 additioal accesses to CASs, 2 to offsets o ad 1 access to the log table for algebraic computatios. A coservative evaluatio of E = 500s seems fair. The cost of the i-ram joi ca thus be estimated as 100µs for l = 200, 500µs for l = 1, 000, ad 8ms for l = 16K. The first cost is egligible whatever the storage media; ext oe is so for disk, but ot for flash, the latter is ot for either. Fially, the postprocessig cost P ca be estimated based o a uit cost of 100s per symbol. Eve for a 1,000 symbols log patter, the postprocessig costs 100µs, which remais egligible for both magetic disk or flash memories. Its importace also depeds o the umber of matches, of course. 6.3 Choice of file parameters The previous aalysis leads to the followig coclusios regardig the choice of AS-Idex parameters. As a geeral rule of thumb, oe must choose parameter L so as to maitai the hash directory D i RAM. Cachig D i RAM, wheever possible, saves two disk access o four. Settig a limit o L may lead to icrease the average lie legth l, but our aalysis shows that this remais beeficial eve whe l reaches hudreds or eve thousads of etries. Uder our assumptios, l should reach 16K to add the equivalet of 1 radom access to the search cost, at which poit oe may cosider elargig the hash directory beyod the RAM limits. Large values for l may also be beeficial with respect to other factors. First, larger lies accommodate a larger database for a fixed. Secod, we may choose a smaller, with a smaller miimal size of + 1 for patters. Let us illustrate this latter impact. We cotiue with our ruig example of a 100 GB database with byte-wide symbols (f = 8), L is 2 30, ad a average load of l = /2 30 = 93 etries per lie. This implies the choice of m, which must be such that 2 mf L, i.e., m (log 2 L)/f. I our example, the lie idex i eeds to cosist of at least 30 bits. Correspodigly, NAS m eeds to be at least that log. Sice each coordiate of NAS m cosists of 8b, the value for m eeds to be at least 4. Each NAS the cotais at least 32 bits. The -grams used eed to cotai at least m symbols. Otherwise, the rage of -gram values is smaller tha L ad certai lies will ot cotai ay etries. If the -grams are reasoably close to uiformly distributed, the rage of values is 256 ad we ca pick = m. Still referrig to our example with L = 2 30, we ca choose = 4. However, the actual character set used is most ofte smaller tha 256, or oly a fractio of the characters appear frequetly. This requires a larger, sice the rage of possible -grams must cotai at least L values. Let v be the umber of values we expect per symbol. I a simple ASCII text, the umber of pritable character codes is v = 96. DNA ecodig represets a extreme case with v = 4. The -gram size must be such that v L. With v = 96 (simple ASCII text) ad L = 2 30, must be set to 5, the smallest value such that 96 L, to geerate all required NAS values. These parameter values were actually used for Example 1. Cosider ow the case of a DNA database where oly 4 of the possible 256 ASCII characters appear i records. We eed to set = 15 i order to obtai the 2 30 possible sigatures. + 1 is the miimal patter legth we allow to search for. Such limit should ot be evertheless a practical costrait for a search over a small alphabet. The eed there is rather for log patters [3]. If evertheless it was a cocer, oe may choose a smaller at the price of fewer, hece loger, lies. For istace choosig = 10 ad L = 2 20 for our DNA database results i the average of 95K etries i each lie. The miimal patter size decreases by five, i.e., to + 1 = 11 symbols. 7. NON-DENSE INDEXING While our basic scheme performs well, the storage overhead of about 500% might be too high for some applicatios. We ow describe a variat that addresses this issue. 7.1 Skewed -gram sigature distributio As previously observed, a skewed distributio ca lead to may etries i a sigle AS-Idex lie. We ow modify our search procedure as follows. I our pre-processig phase, we choose q -grams, q 2, amog them the startig ad fial -gram of the patter. The simplest choice is to have the -grams evely distributed over the patter. The search first determies the shortest lie legth of all lies idexed by ay of the q -grams. If ay of these lies is empty, the we are doe: the patter is ot foud i the database. If the shortest lie happes to be the oe idexed by the iitial -gram i the patter, othig chages. If it is the oe idexed by the fial - gram i the patter, the calculatio of the PAS still uses formula (7) uchaged because subtractio ad additio are the same i the Galois field. Now c 1 is the CAS i the lie idexed by the last -gram ad c 2 the oe i the first lie, while l does ot chage. Otherwise, let a be the offset of the -gram with miimal cout of etries ad S a its NAS. For each etry i lie D[h L(S a)], we search for a matchig etry i the lie of the first or of the last - gram i the patter, depedig o which oe has the smaller cout. This amouts to matchig the part of the patter betwee the selected -gram ad either the begiig or the ed of the patter. If this succeeds, we cotiue to use our calculus to match the other

7 part of the patter. Sice we elimiate most mismatches i the first step, AS-Idex will ow come more quickly to a decisio o average. t t t t t t t t a) 1 < t < b) t = c) t > o aliged patter Figure 6: No-dese AS idexig: (a) overlappig -grams, (b) tilig, (c) lossy Our ext variat lowers the storage overhead by idexig oly some istead of all -grams. The disadvatages are higher search costs ad a larger miimal size for the patters that we ca search for. Figure 6 shows the idea. Startig with the first -gram i a record, we oly idex the -grams that are t > 1 symbols apart, where parameter t is the idexig rate. We thus idex oly -grams startig at the offsets 0, t, 2t,. The size of the idex is ow reduced by a factor of about 1/t. We ca distiguish three cases, lossless idexig if = t, lossy idexig if < t, ad overlappig idexig if > t (Figure 6). I all cases, trailig characters of the patter might ot cotribute to ay idexed NAS. The search procedure is the same i all three cases. It eeds to be modified from the base procedure because the occurrece of a patter i a strig might ot match the tilig of the records by the idexed -grams. Assume that the patter is P = p 0p 1,... p K 1. We defie substrigs P i, i = 0, 1,... t 1 of the patter as P i = p i, p i+1,... p i+lt+ 1 with l = (K )/t. Thus, P 0 is the substrig of the patter that begis with the first -gram i P, eds with the last -gram i P startig at offset lt 1, ad cotais all symbols of P i betwee. P 1 starts at the secod character ad fiishes with the last -gram startig a multiple of t characters after the first oe, etc. Our search procedure ow first tries to match P 0 i the database. More precisely, we process the lie idexed by the first -gram. For a occurrece, the last -gram i the substrig matchig the sub-patter is also i AS-Idex ad our matchig will succeed. We the successively search for sub-patters P 1, P 2,... P t 1. This is guarateed to fid ay occurrece of the patter i the database. Sice we actually match various subpatters, a diagosed match is ot based o all symbols i the patter ad we might eed to verify that the patter actually occurred. Cosider the search for our P = Uiversity Paris Dauphie, Figure 6. Let = 4 ad t = 4. Suppose the use of a odese tilig AS-Idex (Figure 6.b). The search begis, as with the dese idex, by attemptig to match S 1 = Uiv. The last -gram S 2 for the dese idex would be S 2 = hie. Now, S 2 = phi, i subpatter P 0. The matchig -gram hie i the visited record caot be idexed if Uiv is. Provided the match of P 0 succeeds, we read the record ad attempt to match e to the symbol followig P 0 i the record (provided it exists). Next, we attempt matchig with P 1, usig thus S1 = ive ad S 2 = hie. If this attempt succeeds, we attempt to mach U as above. The matchig attempts cotiue with P 2 havig S 1 = ive ad S 2 = auph. The completio requires fidig U ad ie i the record before ad after the P 2 match i the record. The fial roud attempts to match P 3 with S 1 = vers ad S 2 = uphi followed by direct testig of Ui ad of e. The fial result is the uio of all the successful matches. Compared with dese idexig, the storage space of this (tilig) AS-Idex is reduced by factor of four, e.g., falls to (oly) 125% of the database size. Likewise, this is at least four times less tha ay alterative method we discussed. I cotrast, we eed four times more disk accesses, i.e., 16 or 8 at best usually. Fially, the miimal idexed patter is i tur = 8 symbols, istead of five. Clearly, may applicatios may gladly accept the discussed tradeoffs. I particular, sice smaller AS-Idex may the fit oto a flash disk. As this oe is about te times faster tha a magetic oe, all together the AS-idexig gets actually about 2.5 times faster. I the ruig example of our 100 GB database, the reduced AS-Idex size would be about 125 MB, already fittig curret mass-produced flash disks. 7.2 Scalable Distributed AS-Idex This variat targets the maiteace of a AS-Idex of a growig database that might ultimately reach a very large size. A AS- Idex with static legth ad width will evetually see may ad large overflows. We propose to deal with this problem by covertig AS-Idex ito a Scalable Distributed Data Structure (SDDS). Appropriate choices are LH LH [14] or its high-availability variats LH RS [13]. These schemes target especially the distributed RAM (or flash) for storage as they are orders of magitude faster tha disks. I a utshell, a SDDS variat of AS-Idex, let us call it SDAS-Idex, would work as follows. We create the SDAS-Idex as AS-Idex at some ode (site), called ode 0, or LH -bucket 0. Hashig of -grams ito lies i SDAS is dyamic usig the liear hash addressig. The umber of lies grows through splits. These are triggered by iserts that result i a overflow of a AS-Idex lie. If the umber of lies exceeds the capacity of a ode, the LH will split the cotet of a ode, movig about half to a ew ode. Durig this split, every other lie remais at the curret ode ad the remaiig oes are set to the ew ode. A search i SDAS-Idex is essetially the same as i AS-Idex, except that access to a lie ivolves messagig. Network latecy becomes the domiat part of the search cost if the cotets of a ode are stored i memory. Also, processig lie etries is ow doe i parallel. If we apply performace results from implemetatios of LH systems, the SDAS-Idex searches should complete i a few milli secods. The precise aalysis ad experimetal cofirmatio remai to be doe. A SDAS-Idex implemeted i distributed RAM ca easily grow to thousads of odes, each with a capacity of easily 1GB. This gives a total capacity of about 1TB. Depedig o whether we use o-dese idexig or ot, we could idex a database of 200 GB to 1TB with expected search times i the order of a 0.1 msec to a few milli-secods. Usig disks, we ca target PB-scale databases with search times somewhere betwee 0.1 sec to 1 sec. 8. EXPERIMENTAL EVALUATION We describe i this sectio our experimetal settig ad results. We implemeted our AS structure as well as a Strig B-Tree [7] ad a -gram idex, based o iverted lists [24]. The ratioale

8 for Strig B-Tree, discussed more i what follows, is that it appears attractive for disk-based use ad is amog most recet proposals. As Sectio 9 discusses, the iverted list was the commo basis for may variats, e.g., -gram/2l [11]. Notice that all the discussed structures are tree/trie based. Hece, oe offers the costat search performace of AS-Idex. I other words, their search speed must deteriorate beyod the oe of AS-Idex for a sufficietly large database. All structures are coded i C++, ad we ru the experimet uder Liux o a 2.40 GHz duo-core processor with 2GB i mai memory ad two 80GB disks. We coduct our experimets o a database of records idetified by a uique ID ad with cotet cosistig of a sequece of oebyte characters. We treat each character as a elemet i the Galois field GF (2 8 ). The desig of the database is meat to represet a large spectrum of situatios ragig from may small records to large-size protei descriptios. We also cosider databases of DNA sequeces. All records are stored o disk. 8.1 Settigs We implemeted a Strig B-Tree structure [7], makig our best efforts to miimize the storage overhead. Each ode cotais a compact represetatio of a Patricia Trie [19]. I our implemetatio a disk block occupies 4K, ad we ca store at most 550 etries i each leaf. We build the -gram idex i two steps. First, we sca all record cotets i order to extract all -grams with their positio. For each -gram, we obtai a triple <gram, rid, offset> which we isert i a temporary file. The secod step sorts all triples ad creates lists of 6-bytes etries. The fial step builds the B+-tree which allows to access quickly to a list give a -gram. Our simple costructio i bulk is fast ad creates a compact structure. Our implemetatio of AS-Idex is static as well. First, the - grams are collected. For each -gram at offset l i a record R the hash key k = h L(NAS m(r, l)) ad the CAS c = CAS 1(R, l) are computed. The quadruplet < k, c, rid, l > is iserted i a temporary file. Secod, the temporary file is sorted o the hash key k. This groups together etries which must be iserted ito the same bucket. A etry cosists of a CAS (1 byte), a record id (2 bytes) ad a offset. The latter is a 1B iteger obtaied by takig the actual offset of the -gram i the record, modulo 255. Usig the CAS as a secodary sort key, oe places the etries ito the required order for isertio ito the bucket. We use three types of datasets with quite distict characteristics: alpha, da ad text. The alpha(σ) type cosists of sythetic ASCII records, with uiform distributio, ragig over a alphabet Σ which is a subset of the exteded ASCII characters. We cosider two alphabets: Σ 26, with oly 26 characters, ad Σ full with all the 256 symbols that ca be ecoded with f = 8 bits. We call the resultig datasets alpha(26) ad alpha(full). They allow us to compare the behavior of our structure to the theoretical aalysis i Sectio 6. The secod type, da, cosists of real DNA records extracted from the UCSC database 1. For the types alpha(full), alpha(26) ad da, we composed datasets ragig from 10MB to 100MB. The last type, text, cosists of real text records created from ASCII files of large Eglish books. The typical size of a text record is 1-2 MB, ad we created a 100MB database of text files. The -gram size is set to 8 for DNA files, ad to 4 for the other datasets. 8.2 Space occupacy ad build time 1 File AS-idex -gram idex Str. B-Tree 5 MB 23.1 (20+3.1) 35 (30+5) MB 43.1 (40+3.1) 65 (60+5) MB ( ) 185 (180+5) MB ( ) 365 (360+5) MB ( ) 605 (600+5) Table 5: Idex sizes i MB for alpha(26) files File AS-idex -gram idex Str. B-Tree 5 MB 20.4 (20+0.4) 31 (30+1) MB 40.4 (40+0.4) 61 (60+1) MB ( ) 181 (180+1) MB ( ) 361 (360+1) MB ( ) 601 (600+1) Table 6: Idex sizes i MB for DNA files The size of the AS-idex is the sum of the size of cells ad the directory which stores the umber of etries for each bucket. The size of the -gram idex is the sum of the size of iverted lists ad the size of the B+tree. The size of the Strig B-Tree is the size of the B+tree where each ode is a serialized Patricia Trie. Tables 5, 6, ad 7 give respectively the idex sizes for alpha, DNA ad text files. The sizes of idexes are comparable. The size of AS-Idex beefits from its smaller etries (4 bytes), mostly due to the 1B size of the offset. Sophisticated compressio techiques [24] would beefit all structures. -gram idex offsets could be limited to 3B or eve 2B, which would still allow the size of records to grow to up to 2 24 or 2 16 symbols. If the database has a few log records with may occurreces of the same -gram, the we ca save space by storig each rid oly oce i the list, followed by a possibly compressed list of offsets. If, o the cotrary, the database cosists of may, relatively small records, compressio based o delta-codig ca be evisaged. Both implemetatios use space better. The Strig B-Tree requires more space with 7B etries i the leaves ad 11B etries for iteral odes. For real datasets, either da or text files, the distributio is far from beig uiform. Table 8 shows the distributio of the umber of etries from two real 100MB databases. The average umber of etries is 1,534 for the DNA database, ad 372 for the text database, with a importat variace. I the worst case (text files), the largest list has 1,480,008 etries. This fully justify our choice of storig the umber of etries i the directory, ad of usig this iformatio to sca the smallest list durig a search operatio. The buildig time is proportioal to the size of the idex. Our structures are built i bulk after sortig all -gram etries i a temporary file. This leads to comparable performaces. O our machie, the buildig time for a 100 MB file is about 1,000 s, ad the buldig rate is about 120 KB/s. A compariso of dyamic builds remais for future work. For AS-Idex, this would reduce mostly to the stadard techique of maitaiig a dyamic hash file. 8.3 Search time File AS-idex -gram idex Str. B-Tree 100 MB ( ) ( ) Table 7: Idex sizes i MB for text files

9 File Average Mi Max Std. dev. DNA 1, ,599 1,953 text ,480,008 5,543 Table 8: Distributio of etries for the real datasets (100 MB files) K AS gram Spd-up Str BT Spd-up Table 9: Search time i ms for 100 MB alpha(full) files We performed extesively patter searches i our databases. We extracted the patters from the files to guaratee that at least oe result is foud. Patter sizes rage from 10 symbols to 500 symbols. To avoid iitializatio costs ad side effects such as CPU or memory cotetio from other OS processes, we performed each search repeatedly util the search times stabilized. We report the average search time over a ru of oe hudred search operatios. We report the results o search time, i ms, i Tables 9, 10, 11 ad 12, for 100MB files. As expected, the Strig B-Tree ad AS-Idex behavior is costat regardless of the legth of the patter, while -gram idex performace degrades liearly with this legth. For our 100MB files, the height of the Strig B-Tree is 3 idepedetly of the alphabet size, for 10 6 idexed substrigs (recall that the faout if 550, ad that our bulk isertio creates full odes). The root is always i the cache, as well as a sigificat part of the level below the root, depedig o the idexed file size. The Strig B-Tree traversal is geerally reduced to a sigle disk access for loadig a leaf ode. I additio, each lookup i a ode requires a additioal radom disk access to the database i order to fetch the full strig. This leads to a fial cost of 4-5 physical disk accesses. The search time with Strig B-Tree is idepedet of the size of the patter ad of the size of the alphabet. Searchig with the AS-Idex takes about 17 ms for alpha(full), 30 ms for alpha(26) files, 30 ms for da files ad 18 ms for text files (real data). This is cosistet with the aalytical cost discussed i Sectio 6. The alpha datasets are uiformly geerated, ad this results i a almost costat umber of etries per bucket. Accordigly, the search is doe i few operatios. The differece betwee alpha(full) ad alpha(26) is explaied by the size of the buckets which are larger for alpha(26) sice the -gram values rage over the set of 26 4 possibilities compare to for alpha(full). For real data (DNA ad text), buckets are likely to be larger, either because the alphabet is so small that the set of existig -gram values is bouded ad caot fully beefit from the hash fuctio (see Subsectio 6.1 for a discussio), or because of o uiformity. The former case correspods to the DNA, the latter to our real text files. Table 8 shows that, o average, the umber of etries i a bucket is larger for DNA (1,534 etries) tha for text (372). The cost of DNA search is accordigly slightly higher ( 30 ms, agaist 20 ms). Recall that our algorithm chooses the smallest bucket for drivig the search, which limits the impact of skewed datasets ad the variace of search times. Figure 7 summarizes the results discussed above, ad illustrates the liearity of search times for gram-idex ad the costat K AS gram Spd-up Str BT Spd-up search time (ms) Table 10: Search time for 100 MB for alpha(26) files DNA - as_idex DNA - str_btree DNA - gr_idex text - as_idex text - str_btree text - gr_idex patter 200 size Figure 7: Search result o 100MB files, for varyig patter size search times for AS-Idex ad Strig B-Tree. Figure 8 shows the evolutio of search times as the size of the database icreases from 10 MB to 100 MB. Each curve represets the results for a give dataset, with patters cosistig of 10 to 200 symbols. As before, the AS-Idex exhibits a almost costat behavior (appr ms), eve for very large patters (200 symbols) searched i large files (100 MB). The search time with the Strig B-Tree slightly icreases with the size of the file (but remais costat with the patter size). The tree height remais equals to 3, however the umber of odes icreases with the file size. This accouts for a lower probability of a buffer hit durig tree traversal. The logarithmic behavior of the Strig B-Tree is almost blurred here, because of the large ode faout (550). It would appear with larger files. Give a file size greater tha 167MB for istace (i.e., correspodig to more tha strigs), its height would icrease to 4, with two (2) additioal disk accesses o average for a search. The search time evolves (sub)liearly for -gram idex, both i the size of the patter, ad i the size of the database. The slope is steeper for large files. This is explaied by the ecessity to sca a umber of iverted lists which is proportioal to the size of the patter. I additio, larger files imply larger lists, hece the behavior illustrated by Figure 8. However the cost remais subliear (the cost for 100MB is oly 3 times higher tha the cost for 5MB). This is due to the merge process which stops whe the smallest list has bee fully scaed, thereby avoidig a complete access to all lists. Figure 9 ad Figure 10 summarize the ratio of search times, givig the speed-up of AS-Idex over respectively the -gram idex ad the Strig B-Tree. The alpha(full) dataset family is a special case. Because of the uiform distributio that produces a large umber of -grams ragig over all the possible values, fidig the K AS gram Spd-up Str BT Spd-up Table 11: Search time for 100 MB da files

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative