AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures

Size: px
Start display at page:

Download "AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures"

Transcription

1 AS-Idex: A Structure For Strig Search Usig -grams ad Algebraic Sigatures ABSTRACT Cédric du Mouza CNAM Paris, Frace dumouza@cam.fr Philippe Rigaux INRIA Saclay Orsay, Frace philippe.rigaux@iria.fr AS-Idex is a ew idex structure for exact strig search i disk residet databases. It uses hashig, ulike kow alteratives, whether baesd o trees or tries. It typically idexes every -gram i the database, though o-dese idexig is possible. The hash fuctio uses the algebraic sigatures of -grams. Use of hashig provides for costat idex access time for arbitrarily log patters, ulike other structures whose search cost is at best logarithmic. The storage overhead of AS-Idex is basically %, similar to that of alteratives or smaller. We show the idex structure, our use of algebraic sigatures ad the search algorithm. We preset the theoretical ad experimetal performace aalysis. We compare the AS-Idex to mai alteratives. We coclude that AS-Idex is a attractive structure ad we idicate directios for future work. 1. INTRODUCTION Databases icreasigly store data of various kids such as text, DNA records, ad images. This data is at least partly ustructured, which creates the eed for full text searches (or patter matchig) [17]. I mai memory, matchig a patter P agaist a strig S rus i O( S / P ) at best [4]. Searchig very large data sets requires a idex, despite the storage overhead ad possibly log idex costructio time. We address the problem of searchig arbitrarily log strigs i exteral memory. We assume a database D = {R 1, R 2,, R } of records, viewed as strigs over a alphabet Σ. The database supports record isertio, deletio ad updates, as well a search o ay substrig of the records cotets. We call the iput to the strig the patter. Curretly deployed systems for documets use almost exclusively iverted files idexig words for keyword search. However, our eed for full patter matchig i records where the cocept of word may ot exist rules out this solutio. Suffix trees ad arrays form a class of idexes for patter matchig. Suffix trees work best whe Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. ACM CIKM 09, Hog Kog, Chia Copyright 200X ACM X-XXXXX-XX-X/XX/XX...$ Witold Litwi Uiv. of Paris Dauphie Paris, Frace witold.litwi@dauphie.fr Thomas Schwarz Sata Clara Uiv. Sata Clara, CA, USA tjschwarz@scu.edu they fit ito RAM. Attempts to create versios that work from disks are recet, experimetal, ad focus typically o specific applicatios [7, 20]. The literature attributes this to bad locality of referece, a ecessarily complex pagig scheme, ad structural deterioratio caused by iserts ito the structure. A suffix array stores poiters o a list of suffixes sorted i lexicographic order, ad uses biary search for patter matchig. However, a stadard database architecture stores records i blocks ad supports dyamic isertios ad deletios that make difficult the maiteace of sequetial storage. A suffix array search eeds O(log 2 N) block accesses, where N is the size (umber of characters) of D. We are ot aware of a geeric solutio to these difficulties i the literature, ad without oe, these costs disqualify suffix arrays for our eeds. Two approaches that explicitly address disk-based idexig for full patter-matchig searches are the Strig B-Tree [7] ad -gram iverted idex [18, 11]. The Strig B-Tree is basically a combiatio of B + -Tree ad Patricia Tries. The global structure is that of a B + -Tree, where keys are poiters to suffixes i the database. Each ode is orgaized as a Patricia Trie, which helps guidig the search ad isert operatios. A Strig B-Tree fids all occurreces of a patter P i O( P + log B N) disk accesses, where B is the block size. Aother directio for text search eeds are idexes ivertig -grams ( cosecutive symbols) istead of etire words. They are disk friedly i that they rely o fast sequetial scas, provide a good locality of referece, ad easily adapt to pagig ad partitioig. However, the search cost is liear, both i the size of the database ad the size of the patter. I the preset paper, we itroduce a ew data structure for idexbased full-text search called Algebraic Sigature Idex (AS-Idex). It follows the path of a iverted file based o -grams. Its ovel attractive property, uique at preset to our kowledge, is costat disk search time, idepedet of the size of the database ad of the legth of the patter. This results from its stadard database approach for large-scale idexig (amely hashig). AS-Idex hash calculus is however specific. As the ame suggests, it relies o algebraic sigatures, (ASs) [15], whose algebraic properties we ca exploit. Our experimets show AS-Idex to be a very fast solutio to patter searches i a database. AS-Idex search oly eeds two disk lookups whe the hash directory fits i mai memory. I our experimets, it proved itself to be up to oe order of magitude faster tha -gram idexes ad twice faster that Strig B-Trees. The basic variat of AS-Idex has a storage overhead of about 5 to 6. A variat of our scheme oly idexes selective -grams ad has lower storage overhead at the costs of slower search times. All

2 these properties should make AS-Idex a practical solutio for text idexig. The paper is orgaized as follows. Sectio 2 gives a bird s eye view of the AS-Idex basic priciples. We recall the theory of algebraic sigatures (Sectio 3). We the discuss the AS-Idex structure i Sectio 4 ad the search algorithm i Sectio 5. Sectio 6 aalyses the scheme s behavior, especially collisio ad false positive probability, as well as performace. Sectio 8 explais the details of our implemetatio of Strig B-Trees, -gram idex ad AS-Idex ad experimets. We review related work i Sectio 9. Fially, we summarize ad give future research directios i Sectio AS-INDEX OVERVIEW AS-Idex is a classical hash file with variable legth disk-residet buckets, (Fig. 1). Buckets are poited to by the hash directory. Simplicity ad performace of such files attracted coutless applicatios. The mai advatage is costat access time, idepedet of file ad patter size. Costat time is ot possible for a tree/trie access method. Each bucket stores a list of etries, each idexig some -gram i the database. The basic AS-Idex is dese, idexig every - gram. The hash fuctio providig the bucket for a etry uses the -gram value as the hash key. The hash fuctio is particular: it relies o algebraic sigatures of -grams, to be described i the ext sectio. I what follows, we oly deal with the static AS-Idex, but the hash structure may use stadard mechaisms for dyamicity, scalabilty ad distributio. h(s1) S1 Hash directory h(s2) Sp Patter P e1 AS(e1, e2, Sp) e2 S2 failure success Figure 1: A matchig attempt with AS-Idex I overview, a search for a patter P proceeds as follows (Fig. 1): First, we preprocess P for three sigatures: (i) of the iitial - gram S 1, (ii) of the fial -gram S 2 ad (iii) of the suffix S p of P after S 1. Hashig o S 1 locates the bucket with every etry e 1 idexig a -gram i the database with the sigature of S 1. Likewise, hashig o S 2 locates the bucket with every etry e 2 idexig a -gram with the sigature of S 2. We oly cosider pairs of etries that are i the same record ad at the right distace amog them. We thus locate ay strig S matchig P o its iitial ad termial -gram, at least by sigature. A algebraic calculatio AS(e 1, e 2, S p) determies whether S p may match the suffix of S as well. The method is probabilistic i ature, with low chaces of false positives. We ca avoid eve a miute possibility of a false match by a symbol for symbol compariso betwee patter ad the relevat part of the record. By limitig disk accesses to the two buckets associated to the first ad last -grams of the P, AS-Idex search rus idepe- Strig B-Trees -Gram AS-Idex Costr. O( D log B D ) O( D ) O( D ) Storage O( D ) O( D ) O( D ) (ratio) ( 6-7) ( 6) ( 5-6) Preproc. oe O( P ) O( P ) Search O( P + log B D ) O( D P ) O(1) Table 1: Disk-based idex structures for searchig a patter P i a database D. B is the block size. detly from P s size. The cost of the search procedure outlied above is reduced to that of readig two buckets. The hash directory itself ca ofte be cached i RAM, or eeds at most two additioal disk accesses, as we will show. With a appropriate dyamic hashig mechaism that evely distributes the etries i the structure ad scales gracefully, the bucket size is expected to remai uiform eough to let the AS-Idex ru i costat time, idepedetly of the database size. Table 1 compares the aalytical behavior of AS-Idex with those of two competitors (Strig B-Trees ad -gram idex) ad summarizes its expected advatages. The size of all structures is liear i the size of the database. The ratio directly depeds o the size of idex etries. We metio i Table 1 the ratio obtaied i our implemetatio, before ay compressio. The asymptotic search time i the database size is liear for -idex, logarithmic for Strig B- Trees ad costat for AS-Idex. Moreover, oce the patter P has bee pre-processed, AS-Idex rus idepedetly from P s, whereas -gram idex cost is liear i P. I summary, AS-Idex efficietly idetifies matches with oly two lookups, for ay patter legth. This efficiecy is achieved through extesive use of properties of algebraic sigatures, described i the ext sectio. 3. ALGEBRAIC SIGNATURES We use a Galois field GF (2 f ) of size 2 f. The elemets of GF are bit strigs of legth f. Selectig f = 8 deals with ASCII records ad f = 16 with Uicode. We recall that a Galois field is a fiite set that supports additio ad multiplicatio. These operatios are associative, commutative ad distributive, have eutral elemets 0 ad 1, ad there exist additive ad multiplicative iverses. I a Galois field GF (2 f ), additio ad subtractio are implemeted as the bitwise XOR. Log/atilog tables provide usually the most practical method for multiplicatio [15]. We adopt the usual mathematical otatios for the operatios i what follows. We use a primitive elemet α of GF (2 f ). This meas that the powers of α eumerate all the o-zero elemets of the Galois field. It is well kow that there always exist primitive elemets. Let R = r 0r 1 r M 1 be a record with M symbols. We iterpret R as a sequece of GF elemets. Idetifyig the character set of the records with a Galois field provides a coveiet mathematical cotext to perform computatios o record cotets. DEFINITION 1. The m-symbols sigature (AS) of a record R is a vector AS m(r) with m coordiates (s 1, s 2,... s m) defied by 8 >< >: s 1 = r 0 + r 1 α + r 2 α r M 1 α M 1 s 2 = r 0 + r 1 α 2 + r 2 α r M 1 α 2(M 1). s m = r 0 + r 1 α m + r 2 α 2m... + r M 1 α m(m 1) (1) We refer the reader to [15] for more details about defiitios ad properties of algebraic sigatures.

3 I our examples, we give the m-symbol AS of R as the cocateatio of the values s m,, s 1 i hexadecimal otatio. For istace if s 1 = #34 ad s 2 = #12, the we write the 2-symbol AS as s 2s 1 = #1234. Symbol f m M K L r 0, r 1, r M 1 s 1, s m S 1, S m Iterpretatio Size (i bits) of a GF elemet i GF (2 f ) (f = 8 or f = 16) Size of -grams Size of sigature vectors, m Size of records Size of patters, K > Number of lies i the AS- Idex Record characters/symbols Oe-symbol sigatures Oe-symbol -gram sigatures Table 2: Table of the symbols used i the paper We use differet partial algebraic sigatures of patter ad database records, as we ow explai. DEFINITION 2. Let l [0, M 1] be ay positio (offset) i R. The Cumulative Algebraic Sigature (CAS) at l, CAS m(r, l), is the algebraic sigature of the prefix of R edig at r l, i.e., CAS m(r, l) = AS m(r 0... r l ). The Partial Algebraic Sigature (PAS) from l to l is the value P AS m(r, l, l) = AS m(r l r l +1 r l ), with 0 l l, Fially, we most ofte use the PAS of substrigs of legth, i.e., of -grams. DEFINITION 3. The -gram Algebraic Sigature (NAS) of R at l is NAS m(r, l) = P AS m(r, l + 1, l), for l 1. With other words: NAS m(r, l) = (r l r l α 1, Record RM r 0 CAS(l) r l r l r l α 2( 1),.., r l r l α m( 1) ) (2) PAS(l, l) NAS(l) r l +1 r l r M 1 Figure 2: Computig CAS(l), P AS(l, l) ad NAS(l) i record R M I all the defiitios, we may drop R wheever it is implicit for brevity s sake. Figure 2 shows the respective parts of the record that defie the CAS, P AS ad NAS at offset l. The followig simple properties of algebraic sigatures, expressed for coordiate i, 1 i m, are useful for what follows. We ote i-th symbol of a CAS, as CAS m(l) i ad proceed similarly for NAS ad PAS. 0 Hash directory D L 1 C 0 C i Buckets etries for CAS c Structure of a bucket etries for CAS c e1 e2 <r1, l1, c> <r2, l2, c>... <ri, li, c >... Figure 3: Structure of the AS-idex Properties 3 ad 4 let us icremetally calculate ext CAS ad NAS while buildig the file, or preprocessig the patter, istead of recomputig the sigature etirely. This speeds up the process cosiderably. Property 5 also speeds up the patter preprocessig, as it will appear. Property 6 fially is fudametal for the match attempt calculus. CAS m(l) i = CAS m(l 1) i + r l α il (3) NAS m(l) i = NASm(l 1)i r l α NAS m(l) i = For 0 l < l: + r l α i( 1) (4) CASm(l)i CASm(l )i α i(l +1) (5) CAS m(l) i = CAS m(l ) i + α i(l +1) P AS m(l + 1, l) i (6) Table 2 summarizes the symbols used i the paper. 4. STRUCTURE Our database cosists of records that are made up of a Record IDetifier (RID) ad some o-key field. (Extesios to databases with more tha oe key ad/or o-key field are straight-forward.) We assume that the o-key field cosists of strigs of symbols from our Galois field. Our search fids all occurreces of a patter i the o-key field of every record i the database. Whe we talk about offsets ad algebraic sigatures, we refer oly to the o-key field. If R is such a field ad r i a character (Galois field elemet) i R, the we call i the offset. A -gram G = r l +1 r l is ay sequece of cosecutive symbols i R. By extesio, we the call l the offset of the -gram. A AS-Idex cosists of etries. DEFINITION 4. Let G be a -gram at offset o i R. The etry idexig G, deoted E(G), is a triplet (rid, o, c) where rid is the RID of R, c is CAS 1(R, o), ad o is o modulo 2 f 1. We assure costat size of the etries by takig the remaider. The choice of the modulus is justified by the idetity χ 2f 1 = χ for all Galois field elemets χ. The idexig is dese, i.e., every -gram i the database is idexed, ad by a differet etry. To costruct the idex, we process all -grams i the database. AS-Idex is a hash file, deoted D[0..L 1], with directory legth L = 2 v beig a power of 2 (Fig. 3). Elemets of D refer to buckets or lies of variable legth, each cotaiig a list of

4 etries. Lies are of variable legth to accommodate a possible ueve distributio of -gram values. Each D[i] cotais the address of the i-th bucket. All together, the AS-Idex lie structure is similar to a postig list i a iverted file, except for the presece of the CAS c i each etry ad a specific represetatio of the offset l. Sice we use a hash file, lies should have a collisio resolutio method such as classical separate chaiig that uses poiters to a overflow space. Such a techique accomodates moderate growth, but if we eed to accomodate large growth, the we eed a dyamic hashig method such as liear hashig. We ow describe how to calculate the idex i of the lie for a etry E(G) = (rid, o, c). We calculate i from the m-symbol NAS NAS m(g) = (s 1,..., s m). The coordiates of the NAS are bit strigs. By cocateatig them, we obtai a bit strig S = s ms m 1... s 1 that we iterpret as a large, usiged iteger. The idex i is: i = h L(S) = S mod L Sice L = 2 v, this amouts to extractig the last v bits of S. It is easy to see that m should be such that m ad m v/f. The choice of AS-Idex parameters m, ad L will be further discussed i Sectio 6. EXAMPLE 1. Cosider a 100 GB database with byte-wide symbols (f = 8). For the sake of example, let L = 2 30, leadig to buckets with /2 30 = 93 etries o the average. Let = 5 ad m = 4. To calculate lie idex i of -gram G we thus cocateate s 4..s 1 of NAS 4(G). The we cut lower 30 bits. Now, cosider the record with RID 73 ad o-key field Uiversity Paris Dauphie. Assume that NAS 4( Uive, 4) is # a. Sice this is the first 5-gram, CAS 1(73, 4) has the same value as the first compoet, i.e., is #9a. For subsequet 5-grams, the first coordiate of the NAS ad the CAS usually differ. The etry is E = (73, 4, #9a). Its lie idex is # a mod 2 30 = # a (we remove the leadig 2 bits). 5. PATTERN SEARCH Let P = p 0... p K 1 be the patter to match. AS-Idex search delivers the RID of every record i the database that cotais P. The search algorithm first prepocesses P by extractig values that become the etries ito AS-Idex to locate matches. 5.1 Preprocessig The preprocessig phase computes three sigatures from the patter P : (a) the m-symbols AS of the 1st -gram i P, called S 1; (b) the 1-symbol PAS of the suffix of P followig the 1st -gram, S p = P AS 1(P,, K 1). (c) the m-symbols AS of the last -gram i P, deoted as S 2; There are several ways to compute these sigatures. For istace, oe may compute S 1, the S p, the extract the m-symbol value of S 2 through property (5). Figure 4 shows the parts of the patter that determie the sigatures S 1, S 2 ad S p o our ruig example. Recall that = 5 ad m = 4. We preprocess the patter P = Uiversity Paris Dauphie ad obtai (a) S 1 = NAS 4(P, 4) (for 5-gram Uive ); (b) S 2 = NAS 4(P, 24) (for 5-gram phie ) ad (c) S p = P AS 1(P, 5, 24). This iformatio is used to fid the occurreces of the patter i the database. Record R Patter P CAS c 1 NAS S 1 l 1 PAS S p CAS c 2 coferece at the Uiversity Paris Dauphie 5.2 Processig 0 5 Uiversity Paris Dauphie Figure 4: The search algorithm NAS S 2 l = 2 l +K 1 Let i = h L(S 1) ad i = h L(S 2). Every etry (R, l 1, c 1) i bucket D[i] idexes a -gram G i a record R whose NAS S G is such that h L(S G) = h L(S 1). Likewise, every etry (R, l 2, c 2) i bucket D[i ] idexes a -gram G such that h L(G ) = h L(S 2). Up to possible collisios, G ad G respectively equal the first ad last -grams i patter P. We search every pair (G, G ) able to characterize a matchig strig S = P. We must have R = R ad l 2 = (l 1 + K ) mod (2 f 1). The last compoet i G, c 2 should match the value implied by c 1 ad S p (see Figure 4). Algebraic property 6 implies that c 2 should be 0 Hash directory D L 1 i i <r1, l1, c1> <r3, l2, c2> c 2 = c 1 + α l 1+1 S p. (7) <r1, l4, c4> <r2, l5, c5> <r3, l6, c6>... <r3, l3, c3> bucket for NAS( phie ) check CAS ad positios bucket for NAS( Uive ) Figure 5: Usig AS-Idex for a search Figure 5 illustrates the process. By hashig o S 1 = Uive, we retrieve the bucket D[i] which cotais, amog others, the etries idexig all occurreces of Uive i the database. Similarly we retrieve the bucket D[i ] which cotais etries for all the occurreces of the -gram phie. A pair of etries [e i(r 1, l 1, c 1), e i(r 1, l 4, c 4)] i (D[i], D[i ]) represets a substrig s of r 1 that begis with Uive ad eds with phie. Checkig whether s matches P ivolves two tests: (i) we compute c 1 + α l1+1.s p ad compare it with c 4 to check whether the sigatures of the middle parts match, ad (ii) we verify as discussed whether l 1 matches l 2 give K, i.e., S ad P have the same legth. If both tests are succesful, we report a probable match. The ext attempt will cosider (r 3, l 2, c 2) i D[i] ad (r 3, l 6, c 6) i D[i ]. Note that (r 2, l 5, c 5) i D[i ] is skipped because there is o possible match o r 2 i D[i]. The pseudo-code of the algorithm is give below. Algorithm AS-SEARCH Iput: a patter P = p 0... p K 1, the -gram size Output: the list of records that cotai P begi // Preprocessig phase S 1 := NAS m(p, 1) S 2 := NAS m(p, K 1)

5 Nr. Collisios Take Expected 0 16,568, , , > Table 3: Actual ad expected umber of collisios usig a 3B sigature o a dictioary of moderately sized Eglish words. S p := P AS 1 (P,, K 1) i := h L (S 1 ) // i is the first lie idex i := h L (S 2 ) // i is the secod lie idex // Processig phase for each etry E(R, l 1, c 1 ) i D[i] c 2 = c 1 + α l 1+1 S p l 2 = (l 1 + K ) mod (2 f 1) if (there exists a etry E (R, l 2, c 2 ) i D[i ]) the Report success for R edif edfor ed The selectivity of the process relies o its ability to maipulate three distict sigatures, S 1, S 2 ad S p. Therefore the patter legth must be at least Collisio Hadlig As hashig i geeral, our method is subject to collisios deliverig false positives. To elimiate ay collisios, it is ecessary to post-process AS-Search by attemptig to actually fid P i every R idetified as a match. This requires a symbol by symbol compariso betwee P ad its presumed match locatio. It will appear however that AS-Search should typically have a egligible probability of a collisio. Hece post-processig may be left to presumably rare applicatios eedig full assurace. Also, it should be RAM-based ad therefore typically egligible with respect to the disk search time. We thus do ot detail it here. 6. ANALYSIS We ow preset a short, theoretical aalysis of the expected performace of AS-Idex. 6.1 Hash uiformity Algebraic sigature values ted to have a more uiform distributio tha the distributio of character values due to the multiplicatios by powers of α i their calculatios. However, the total umber of strigs or -grams i a dataset gives a upper boud for the umber of algebraic sigatures calculated from them. Biological databases ofte store DNA strigs i a ASCII file. Oly the four characters A, C, G, ad T appear as characters i such a file. There are oly 4 6 = 4K differet 6-grams i the file ad the umber of differet NAS of these sigatures caot exceed this value. Icreasig the umber of coordiates i the NAS beyod 5 symbols is ot goig to achieve better uiformity. This cautio applies oly to files usig a small alphabet. For a differet experimet, we chose a list of Eglish words ( words MB) six or more characters loger take from a world list used to perform a dictioary attack o a password file (by a admiistrator tryig to weed out weak passwords). We calculated the 3B three compoet sigature of all words with more tha five characters (65536 words) ad calculated the umber of sigatures attaied by i words as well as this umber for a perfect hash fuctio (Table 3). The χ 2 value of shows very close agreemet. Whe repeatig the test usig 2B two compoet sigatures, we obtaied χ 2 = A much smaller set had χ 2 = , but there were less sigatures attaied by a high umber ( 5) of words. These results give experimetal verificatio for the flatess of sigatures. 6.2 Storage ad Performace Idex costructio time. The properties (2) - (6) of algebraic sigatures allow us to calculate all etries with a liear sweep of all records. We eed to keep a poiter to the symbol just beyod the curret -gram ad to the first symbol of the curret -gram. Usig equatios (3) ad (4), we ca the calculate the NAS of the ext -gram from the old oe ad update the ruig CAS of the record. Sice creatig the etry for a -gram ad isertig it ito the idex take costat time, buildig the idex takes time liear to the size of the database. Storage costs. The storage complexity of AS-Idex is O(N) for N idexed -grams. The actual size of a RID should be 3-4 bytes sice 3 bytes already allow a database with 16M records. The actual storage per etry should be about 5-6 bytes, which results i a storage overhead of about (5 6)N. We ca lower this storage overhead, e.g. to 125%, by o-dese idexig, that we do ot preset here due to space limitatio, at the expese of a proportioal icrease i search time. EXAMPLE 2. We still cosider a 100 GB database with 8b symbols. Assumig a average record size of 100 symbols, we have 1G records ad our record idetifier eeds to be 4B log. We previously set the size of the CAS to 1B. With the 1B offset ito the record, the etry is 6B. AS-Idex should use about 6 times more space tha the origial database. Now assume records of 10KB each. The record idetifiers ca be 3B log. This gives a total etry size of 5B or a storage factor of 5. Each elemet of the hash directory stores a bucket address with at least log 2 (L) bits. I the case of our large 100GB database, with L = 2 30, choosig 4 bytes for the address leads to the required storage of 4.1GB, smaller tha the curret data servers stadard capacity. I most cases, D is expected to fit i mai memory. Patter preprocessig. To preprocess the patter, we eed to calculate a PAS ad two NAS. We calculate both i a similar maer as above ad obtai preprocessig times liear i the size of the patter. Sice the result depeds o all symbols i the patter, we caot do better. Search speed. We assume that etries are close to uiformly distributed. The search algorithm picks up two cells i the hash directory, i order to obtai the umber of etries ad the bucket refereces of, respectively, D[i] ad D[i ]. The the buckets themselves must be read. Each of these accesses may icur a radom disk access, hece a (costat) cost of at most four disk reads. If the hash directory resides i mai memory, the cost reduces to loadig the two buckets. The mai memory cost is a i-ram joi of the two buckets. With l deotig the expected lie legth, ad assumig the lies are ordered, the average complexity of this phase should be (uder our uiformity assumptio) 2l. The worst case is O(l 2 ). This case is highly ulikely, as all the etries o both lies should fit ito the same record. The optioal collisio resolutio (symbol-tosymbol) test adds O( P ). This test is typically performed i RAM ad makes oly a egligible cotributio to the otherwise costat costs. Altogether, the search cost is S = C hash + C buck + C ram + C post (8)

6 Device Processor speed RAM speed Flash disk access Magetic disk Disk trasfert rate Access time 2-3 Gz per core 100s ms 5-7 ms 300 MB/s. Table 4: Hardware characteristics where C hash represets the Hash Directory access cost, C buck the bucket access cost, C ram the RAM processig cost ad C post the post-processig. We ow evaluate the actual search time that may result from the above complexity figures. We take as basis the characteristics of the curret popular hardware show i Table 4 (see also [8] for a recet aalysis). C hash is the cost of fetchig 2 elemets i the hash directory. The trasfer overhead is egligible, ad the (worst case) cost is therefore i the rage [10, 14] ms for magetic disks. The bucket access cost C buck is similar to C hash regardig radom disk accesses, but we fetch far more bytes per access. We eed to trasfer 2l etries. Table 4 suggests that we ca trasfer up to 300 KB per ms (Flash trasfer rate are similar.) Sice the size of a etry is typically 5-6 bytes, each search loads 10 12l. It follows that for l = 1K, the lie trasfer cost is egligible. For l = 16K, 160 to 200 KB must be trasferred. The cost is about 0.5 to 0.7 ms. This is still egligible with respect to disk accesses for AS-Idex o magetic disk, but ot o solid oe. I the latter case, the trasfer cost is equivalet to a additioal disk access. The basic formula (average cost) for C ram, the i-ram joi of the two lies, is 2l. I practice it is l E, where E is a couple of visited etries processig cost. I detail, we have 2 RAM accesses to Rids. I this test is successful, we eed 2 additioal accesses to CASs, 2 to offsets o ad 1 access to the log table for algebraic computatios. A coservative evaluatio of E = 500s seems fair. The cost of the i-ram joi ca thus be estimated as 100µs for l = 200, 500µs for l = 1, 000, ad 8ms for l = 16K. The first cost is egligible whatever the storage media; ext oe is so for disk, but ot for flash, the latter is ot for either. Fially, the postprocessig cost P ca be estimated based o a uit cost of 100s per symbol. Eve for a 1,000 symbols log patter, the postprocessig costs 100µs, which remais egligible for both magetic disk or flash memories. Its importace also depeds o the umber of matches, of course. 6.3 Choice of file parameters The previous aalysis leads to the followig coclusios regardig the choice of AS-Idex parameters. As a geeral rule of thumb, oe must choose parameter L so as to maitai the hash directory D i RAM. Cachig D i RAM, wheever possible, saves two disk access o four. Settig a limit o L may lead to icrease the average lie legth l, but our aalysis shows that this remais beeficial eve whe l reaches hudreds or eve thousads of etries. Uder our assumptios, l should reach 16K to add the equivalet of 1 radom access to the search cost, at which poit oe may cosider elargig the hash directory beyod the RAM limits. Large values for l may also be beeficial with respect to other factors. First, larger lies accommodate a larger database for a fixed. Secod, we may choose a smaller, with a smaller miimal size of + 1 for patters. Let us illustrate this latter impact. We cotiue with our ruig example of a 100 GB database with byte-wide symbols (f = 8), L is 2 30, ad a average load of l = /2 30 = 93 etries per lie. This implies the choice of m, which must be such that 2 mf L, i.e., m (log 2 L)/f. I our example, the lie idex i eeds to cosist of at least 30 bits. Correspodigly, NAS m eeds to be at least that log. Sice each coordiate of NAS m cosists of 8b, the value for m eeds to be at least 4. Each NAS the cotais at least 32 bits. The -grams used eed to cotai at least m symbols. Otherwise, the rage of -gram values is smaller tha L ad certai lies will ot cotai ay etries. If the -grams are reasoably close to uiformly distributed, the rage of values is 256 ad we ca pick = m. Still referrig to our example with L = 2 30, we ca choose = 4. However, the actual character set used is most ofte smaller tha 256, or oly a fractio of the characters appear frequetly. This requires a larger, sice the rage of possible -grams must cotai at least L values. Let v be the umber of values we expect per symbol. I a simple ASCII text, the umber of pritable character codes is v = 96. DNA ecodig represets a extreme case with v = 4. The -gram size must be such that v L. With v = 96 (simple ASCII text) ad L = 2 30, must be set to 5, the smallest value such that 96 L, to geerate all required NAS values. These parameter values were actually used for Example 1. Cosider ow the case of a DNA database where oly 4 of the possible 256 ASCII characters appear i records. We eed to set = 15 i order to obtai the 2 30 possible sigatures. + 1 is the miimal patter legth we allow to search for. Such limit should ot be evertheless a practical costrait for a search over a small alphabet. The eed there is rather for log patters [3]. If evertheless it was a cocer, oe may choose a smaller at the price of fewer, hece loger, lies. For istace choosig = 10 ad L = 2 20 for our DNA database results i the average of 95K etries i each lie. The miimal patter size decreases by five, i.e., to + 1 = 11 symbols. 7. NON-DENSE INDEXING While our basic scheme performs well, the storage overhead of about 500% might be too high for some applicatios. We ow describe a variat that addresses this issue. 7.1 Skewed -gram sigature distributio As previously observed, a skewed distributio ca lead to may etries i a sigle AS-Idex lie. We ow modify our search procedure as follows. I our pre-processig phase, we choose q -grams, q 2, amog them the startig ad fial -gram of the patter. The simplest choice is to have the -grams evely distributed over the patter. The search first determies the shortest lie legth of all lies idexed by ay of the q -grams. If ay of these lies is empty, the we are doe: the patter is ot foud i the database. If the shortest lie happes to be the oe idexed by the iitial -gram i the patter, othig chages. If it is the oe idexed by the fial - gram i the patter, the calculatio of the PAS still uses formula (7) uchaged because subtractio ad additio are the same i the Galois field. Now c 1 is the CAS i the lie idexed by the last -gram ad c 2 the oe i the first lie, while l does ot chage. Otherwise, let a be the offset of the -gram with miimal cout of etries ad S a its NAS. For each etry i lie D[h L(S a)], we search for a matchig etry i the lie of the first or of the last - gram i the patter, depedig o which oe has the smaller cout. This amouts to matchig the part of the patter betwee the selected -gram ad either the begiig or the ed of the patter. If this succeeds, we cotiue to use our calculus to match the other

7 part of the patter. Sice we elimiate most mismatches i the first step, AS-Idex will ow come more quickly to a decisio o average. t t t t t t t t a) 1 < t < b) t = c) t > o aliged patter Figure 6: No-dese AS idexig: (a) overlappig -grams, (b) tilig, (c) lossy Our ext variat lowers the storage overhead by idexig oly some istead of all -grams. The disadvatages are higher search costs ad a larger miimal size for the patters that we ca search for. Figure 6 shows the idea. Startig with the first -gram i a record, we oly idex the -grams that are t > 1 symbols apart, where parameter t is the idexig rate. We thus idex oly -grams startig at the offsets 0, t, 2t,. The size of the idex is ow reduced by a factor of about 1/t. We ca distiguish three cases, lossless idexig if = t, lossy idexig if < t, ad overlappig idexig if > t (Figure 6). I all cases, trailig characters of the patter might ot cotribute to ay idexed NAS. The search procedure is the same i all three cases. It eeds to be modified from the base procedure because the occurrece of a patter i a strig might ot match the tilig of the records by the idexed -grams. Assume that the patter is P = p 0p 1,... p K 1. We defie substrigs P i, i = 0, 1,... t 1 of the patter as P i = p i, p i+1,... p i+lt+ 1 with l = (K )/t. Thus, P 0 is the substrig of the patter that begis with the first -gram i P, eds with the last -gram i P startig at offset lt 1, ad cotais all symbols of P i betwee. P 1 starts at the secod character ad fiishes with the last -gram startig a multiple of t characters after the first oe, etc. Our search procedure ow first tries to match P 0 i the database. More precisely, we process the lie idexed by the first -gram. For a occurrece, the last -gram i the substrig matchig the sub-patter is also i AS-Idex ad our matchig will succeed. We the successively search for sub-patters P 1, P 2,... P t 1. This is guarateed to fid ay occurrece of the patter i the database. Sice we actually match various subpatters, a diagosed match is ot based o all symbols i the patter ad we might eed to verify that the patter actually occurred. Cosider the search for our P = Uiversity Paris Dauphie, Figure 6. Let = 4 ad t = 4. Suppose the use of a odese tilig AS-Idex (Figure 6.b). The search begis, as with the dese idex, by attemptig to match S 1 = Uiv. The last -gram S 2 for the dese idex would be S 2 = hie. Now, S 2 = phi, i subpatter P 0. The matchig -gram hie i the visited record caot be idexed if Uiv is. Provided the match of P 0 succeeds, we read the record ad attempt to match e to the symbol followig P 0 i the record (provided it exists). Next, we attempt matchig with P 1, usig thus S1 = ive ad S 2 = hie. If this attempt succeeds, we attempt to mach U as above. The matchig attempts cotiue with P 2 havig S 1 = ive ad S 2 = auph. The completio requires fidig U ad ie i the record before ad after the P 2 match i the record. The fial roud attempts to match P 3 with S 1 = vers ad S 2 = uphi followed by direct testig of Ui ad of e. The fial result is the uio of all the successful matches. Compared with dese idexig, the storage space of this (tilig) AS-Idex is reduced by factor of four, e.g., falls to (oly) 125% of the database size. Likewise, this is at least four times less tha ay alterative method we discussed. I cotrast, we eed four times more disk accesses, i.e., 16 or 8 at best usually. Fially, the miimal idexed patter is i tur = 8 symbols, istead of five. Clearly, may applicatios may gladly accept the discussed tradeoffs. I particular, sice smaller AS-Idex may the fit oto a flash disk. As this oe is about te times faster tha a magetic oe, all together the AS-idexig gets actually about 2.5 times faster. I the ruig example of our 100 GB database, the reduced AS-Idex size would be about 125 MB, already fittig curret mass-produced flash disks. 7.2 Scalable Distributed AS-Idex This variat targets the maiteace of a AS-Idex of a growig database that might ultimately reach a very large size. A AS- Idex with static legth ad width will evetually see may ad large overflows. We propose to deal with this problem by covertig AS-Idex ito a Scalable Distributed Data Structure (SDDS). Appropriate choices are LH LH [14] or its high-availability variats LH RS [13]. These schemes target especially the distributed RAM (or flash) for storage as they are orders of magitude faster tha disks. I a utshell, a SDDS variat of AS-Idex, let us call it SDAS-Idex, would work as follows. We create the SDAS-Idex as AS-Idex at some ode (site), called ode 0, or LH -bucket 0. Hashig of -grams ito lies i SDAS is dyamic usig the liear hash addressig. The umber of lies grows through splits. These are triggered by iserts that result i a overflow of a AS-Idex lie. If the umber of lies exceeds the capacity of a ode, the LH will split the cotet of a ode, movig about half to a ew ode. Durig this split, every other lie remais at the curret ode ad the remaiig oes are set to the ew ode. A search i SDAS-Idex is essetially the same as i AS-Idex, except that access to a lie ivolves messagig. Network latecy becomes the domiat part of the search cost if the cotets of a ode are stored i memory. Also, processig lie etries is ow doe i parallel. If we apply performace results from implemetatios of LH systems, the SDAS-Idex searches should complete i a few milli secods. The precise aalysis ad experimetal cofirmatio remai to be doe. A SDAS-Idex implemeted i distributed RAM ca easily grow to thousads of odes, each with a capacity of easily 1GB. This gives a total capacity of about 1TB. Depedig o whether we use o-dese idexig or ot, we could idex a database of 200 GB to 1TB with expected search times i the order of a 0.1 msec to a few milli-secods. Usig disks, we ca target PB-scale databases with search times somewhere betwee 0.1 sec to 1 sec. 8. EXPERIMENTAL EVALUATION We describe i this sectio our experimetal settig ad results. We implemeted our AS structure as well as a Strig B-Tree [7] ad a -gram idex, based o iverted lists [24]. The ratioale

8 for Strig B-Tree, discussed more i what follows, is that it appears attractive for disk-based use ad is amog most recet proposals. As Sectio 9 discusses, the iverted list was the commo basis for may variats, e.g., -gram/2l [11]. Notice that all the discussed structures are tree/trie based. Hece, oe offers the costat search performace of AS-Idex. I other words, their search speed must deteriorate beyod the oe of AS-Idex for a sufficietly large database. All structures are coded i C++, ad we ru the experimet uder Liux o a 2.40 GHz duo-core processor with 2GB i mai memory ad two 80GB disks. We coduct our experimets o a database of records idetified by a uique ID ad with cotet cosistig of a sequece of oebyte characters. We treat each character as a elemet i the Galois field GF (2 8 ). The desig of the database is meat to represet a large spectrum of situatios ragig from may small records to large-size protei descriptios. We also cosider databases of DNA sequeces. All records are stored o disk. 8.1 Settigs We implemeted a Strig B-Tree structure [7], makig our best efforts to miimize the storage overhead. Each ode cotais a compact represetatio of a Patricia Trie [19]. I our implemetatio a disk block occupies 4K, ad we ca store at most 550 etries i each leaf. We build the -gram idex i two steps. First, we sca all record cotets i order to extract all -grams with their positio. For each -gram, we obtai a triple <gram, rid, offset> which we isert i a temporary file. The secod step sorts all triples ad creates lists of 6-bytes etries. The fial step builds the B+-tree which allows to access quickly to a list give a -gram. Our simple costructio i bulk is fast ad creates a compact structure. Our implemetatio of AS-Idex is static as well. First, the - grams are collected. For each -gram at offset l i a record R the hash key k = h L(NAS m(r, l)) ad the CAS c = CAS 1(R, l) are computed. The quadruplet < k, c, rid, l > is iserted i a temporary file. Secod, the temporary file is sorted o the hash key k. This groups together etries which must be iserted ito the same bucket. A etry cosists of a CAS (1 byte), a record id (2 bytes) ad a offset. The latter is a 1B iteger obtaied by takig the actual offset of the -gram i the record, modulo 255. Usig the CAS as a secodary sort key, oe places the etries ito the required order for isertio ito the bucket. We use three types of datasets with quite distict characteristics: alpha, da ad text. The alpha(σ) type cosists of sythetic ASCII records, with uiform distributio, ragig over a alphabet Σ which is a subset of the exteded ASCII characters. We cosider two alphabets: Σ 26, with oly 26 characters, ad Σ full with all the 256 symbols that ca be ecoded with f = 8 bits. We call the resultig datasets alpha(26) ad alpha(full). They allow us to compare the behavior of our structure to the theoretical aalysis i Sectio 6. The secod type, da, cosists of real DNA records extracted from the UCSC database 1. For the types alpha(full), alpha(26) ad da, we composed datasets ragig from 10MB to 100MB. The last type, text, cosists of real text records created from ASCII files of large Eglish books. The typical size of a text record is 1-2 MB, ad we created a 100MB database of text files. The -gram size is set to 8 for DNA files, ad to 4 for the other datasets. 8.2 Space occupacy ad build time 1 File AS-idex -gram idex Str. B-Tree 5 MB 23.1 (20+3.1) 35 (30+5) MB 43.1 (40+3.1) 65 (60+5) MB ( ) 185 (180+5) MB ( ) 365 (360+5) MB ( ) 605 (600+5) Table 5: Idex sizes i MB for alpha(26) files File AS-idex -gram idex Str. B-Tree 5 MB 20.4 (20+0.4) 31 (30+1) MB 40.4 (40+0.4) 61 (60+1) MB ( ) 181 (180+1) MB ( ) 361 (360+1) MB ( ) 601 (600+1) Table 6: Idex sizes i MB for DNA files The size of the AS-idex is the sum of the size of cells ad the directory which stores the umber of etries for each bucket. The size of the -gram idex is the sum of the size of iverted lists ad the size of the B+tree. The size of the Strig B-Tree is the size of the B+tree where each ode is a serialized Patricia Trie. Tables 5, 6, ad 7 give respectively the idex sizes for alpha, DNA ad text files. The sizes of idexes are comparable. The size of AS-Idex beefits from its smaller etries (4 bytes), mostly due to the 1B size of the offset. Sophisticated compressio techiques [24] would beefit all structures. -gram idex offsets could be limited to 3B or eve 2B, which would still allow the size of records to grow to up to 2 24 or 2 16 symbols. If the database has a few log records with may occurreces of the same -gram, the we ca save space by storig each rid oly oce i the list, followed by a possibly compressed list of offsets. If, o the cotrary, the database cosists of may, relatively small records, compressio based o delta-codig ca be evisaged. Both implemetatios use space better. The Strig B-Tree requires more space with 7B etries i the leaves ad 11B etries for iteral odes. For real datasets, either da or text files, the distributio is far from beig uiform. Table 8 shows the distributio of the umber of etries from two real 100MB databases. The average umber of etries is 1,534 for the DNA database, ad 372 for the text database, with a importat variace. I the worst case (text files), the largest list has 1,480,008 etries. This fully justify our choice of storig the umber of etries i the directory, ad of usig this iformatio to sca the smallest list durig a search operatio. The buildig time is proportioal to the size of the idex. Our structures are built i bulk after sortig all -gram etries i a temporary file. This leads to comparable performaces. O our machie, the buildig time for a 100 MB file is about 1,000 s, ad the buldig rate is about 120 KB/s. A compariso of dyamic builds remais for future work. For AS-Idex, this would reduce mostly to the stadard techique of maitaiig a dyamic hash file. 8.3 Search time File AS-idex -gram idex Str. B-Tree 100 MB ( ) ( ) Table 7: Idex sizes i MB for text files

9 File Average Mi Max Std. dev. DNA 1, ,599 1,953 text ,480,008 5,543 Table 8: Distributio of etries for the real datasets (100 MB files) K AS gram Spd-up Str BT Spd-up Table 9: Search time i ms for 100 MB alpha(full) files We performed extesively patter searches i our databases. We extracted the patters from the files to guaratee that at least oe result is foud. Patter sizes rage from 10 symbols to 500 symbols. To avoid iitializatio costs ad side effects such as CPU or memory cotetio from other OS processes, we performed each search repeatedly util the search times stabilized. We report the average search time over a ru of oe hudred search operatios. We report the results o search time, i ms, i Tables 9, 10, 11 ad 12, for 100MB files. As expected, the Strig B-Tree ad AS-Idex behavior is costat regardless of the legth of the patter, while -gram idex performace degrades liearly with this legth. For our 100MB files, the height of the Strig B-Tree is 3 idepedetly of the alphabet size, for 10 6 idexed substrigs (recall that the faout if 550, ad that our bulk isertio creates full odes). The root is always i the cache, as well as a sigificat part of the level below the root, depedig o the idexed file size. The Strig B-Tree traversal is geerally reduced to a sigle disk access for loadig a leaf ode. I additio, each lookup i a ode requires a additioal radom disk access to the database i order to fetch the full strig. This leads to a fial cost of 4-5 physical disk accesses. The search time with Strig B-Tree is idepedet of the size of the patter ad of the size of the alphabet. Searchig with the AS-Idex takes about 17 ms for alpha(full), 30 ms for alpha(26) files, 30 ms for da files ad 18 ms for text files (real data). This is cosistet with the aalytical cost discussed i Sectio 6. The alpha datasets are uiformly geerated, ad this results i a almost costat umber of etries per bucket. Accordigly, the search is doe i few operatios. The differece betwee alpha(full) ad alpha(26) is explaied by the size of the buckets which are larger for alpha(26) sice the -gram values rage over the set of 26 4 possibilities compare to for alpha(full). For real data (DNA ad text), buckets are likely to be larger, either because the alphabet is so small that the set of existig -gram values is bouded ad caot fully beefit from the hash fuctio (see Subsectio 6.1 for a discussio), or because of o uiformity. The former case correspods to the DNA, the latter to our real text files. Table 8 shows that, o average, the umber of etries i a bucket is larger for DNA (1,534 etries) tha for text (372). The cost of DNA search is accordigly slightly higher ( 30 ms, agaist 20 ms). Recall that our algorithm chooses the smallest bucket for drivig the search, which limits the impact of skewed datasets ad the variace of search times. Figure 7 summarizes the results discussed above, ad illustrates the liearity of search times for gram-idex ad the costat K AS gram Spd-up Str BT Spd-up search time (ms) Table 10: Search time for 100 MB for alpha(26) files DNA - as_idex DNA - str_btree DNA - gr_idex text - as_idex text - str_btree text - gr_idex patter 200 size Figure 7: Search result o 100MB files, for varyig patter size search times for AS-Idex ad Strig B-Tree. Figure 8 shows the evolutio of search times as the size of the database icreases from 10 MB to 100 MB. Each curve represets the results for a give dataset, with patters cosistig of 10 to 200 symbols. As before, the AS-Idex exhibits a almost costat behavior (appr ms), eve for very large patters (200 symbols) searched i large files (100 MB). The search time with the Strig B-Tree slightly icreases with the size of the file (but remais costat with the patter size). The tree height remais equals to 3, however the umber of odes icreases with the file size. This accouts for a lower probability of a buffer hit durig tree traversal. The logarithmic behavior of the Strig B-Tree is almost blurred here, because of the large ode faout (550). It would appear with larger files. Give a file size greater tha 167MB for istace (i.e., correspodig to more tha strigs), its height would icrease to 4, with two (2) additioal disk accesses o average for a search. The search time evolves (sub)liearly for -gram idex, both i the size of the patter, ad i the size of the database. The slope is steeper for large files. This is explaied by the ecessity to sca a umber of iverted lists which is proportioal to the size of the patter. I additio, larger files imply larger lists, hece the behavior illustrated by Figure 8. However the cost remais subliear (the cost for 100MB is oly 3 times higher tha the cost for 5MB). This is due to the merge process which stops whe the smallest list has bee fully scaed, thereby avoidig a complete access to all lists. Figure 9 ad Figure 10 summarize the ratio of search times, givig the speed-up of AS-Idex over respectively the -gram idex ad the Strig B-Tree. The alpha(full) dataset family is a special case. Because of the uiform distributio that produces a large umber of -grams ragig over all the possible values, fidig the K AS gram Spd-up Str BT Spd-up Table 11: Search time for 100 MB da files

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1 CS200: Hash Tables Prichard Ch. 13.2 CS200 - Hash Tables 1 Table Implemetatios: average cases Search Add Remove Sorted array-based Usorted array-based Balaced Search Trees O(log ) O() O() O() O(1) O()

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

ECE4050 Data Structures and Algorithms. Lecture 6: Searching ECE4050 Data Structures ad Algorithms Lecture 6: Searchig 1 Search Give: Distict keys k 1, k 2,, k ad collectio L of records of the form (k 1, I 1 ), (k 2, I 2 ),, (k, I ) where I j is the iformatio associated

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8) CIS 11 Data Structures ad Algorithms with Java Fall 017 Big-Oh Notatio Tuesday, September 5 (Make-up Friday, September 8) Learig Goals Review Big-Oh ad lear big/small omega/theta otatios Practice solvig

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

Computers and Scientific Thinking

Computers and Scientific Thinking Computers ad Scietific Thikig David Reed, Creighto Uiversity Chapter 15 JavaScript Strigs 1 Strigs as Objects so far, your iteractive Web pages have maipulated strigs i simple ways use text box to iput

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

1 Graph Sparsfication

1 Graph Sparsfication CME 305: Discrete Mathematics ad Algorithms 1 Graph Sparsficatio I this sectio we discuss the approximatio of a graph G(V, E) by a sparse graph H(V, F ) o the same vertex set. I particular, we cosider

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

The isoperimetric problem on the hypercube

The isoperimetric problem on the hypercube The isoperimetric problem o the hypercube Prepared by: Steve Butler November 2, 2005 1 The isoperimetric problem We will cosider the -dimesioal hypercube Q Recall that the hypercube Q is a graph whose

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures

AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures ABSTRACT Cédric du Mouza CNAM Paris, France dumouza@cnam.fr Philippe Rigaux INRIA Saclay Orsay, France philippe.rigaux@inria.fr

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13 CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis

More information

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 201 Heaps 201 Goodrich ad Tamassia xkcd. http://xkcd.com/83/. Tree. Used with permissio uder

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Speeding-up dynamic programming in sequence alignment

Speeding-up dynamic programming in sequence alignment Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa - 443 December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis Outlie ad Readig Aalysis of Algorithms Iput Algorithm Output Ruig time ( 3.) Pseudo-code ( 3.2) Coutig primitive operatios ( 3.3-3.) Asymptotic otatio ( 3.6) Asymptotic aalysis ( 3.7) Case study Aalysis

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information

CSE 417: Algorithms and Computational Complexity

CSE 417: Algorithms and Computational Complexity Time CSE 47: Algorithms ad Computatioal Readig assigmet Read Chapter of The ALGORITHM Desig Maual Aalysis & Sortig Autum 00 Paul Beame aalysis Problem size Worst-case complexity: max # steps algorithm

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Minimum Spanning Trees

Minimum Spanning Trees Miimum Spaig Trees Miimum Spaig Trees Spaig subgraph Subgraph of a graph G cotaiig all the vertices of G Spaig tree Spaig subgraph that is itself a (free) tree Miimum spaig tree (MST) Spaig tree of a weighted

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Homework 1 Solutions MA 522 Fall 2017

Homework 1 Solutions MA 522 Fall 2017 Homework 1 Solutios MA 5 Fall 017 1. Cosider the searchig problem: Iput A sequece of umbers A = [a 1,..., a ] ad a value v. Output A idex i such that v = A[i] or the special value NIL if v does ot appear

More information

Octahedral Graph Scaling

Octahedral Graph Scaling Octahedral Graph Scalig Peter Russell Jauary 1, 2015 Abstract There is presetly o strog iterpretatio for the otio of -vertex graph scalig. This paper presets a ew defiitio for the term i the cotext of

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

1.2 Binomial Coefficients and Subsets

1.2 Binomial Coefficients and Subsets 1.2. BINOMIAL COEFFICIENTS AND SUBSETS 13 1.2 Biomial Coefficiets ad Subsets 1.2-1 The loop below is part of a program to determie the umber of triagles formed by poits i the plae. for i =1 to for j =

More information

Message Integrity and Hash Functions. TELE3119: Week4

Message Integrity and Hash Functions. TELE3119: Week4 Message Itegrity ad Hash Fuctios TELE3119: Week4 Outlie Message Itegrity Hash fuctios ad applicatios Hash Structure Popular Hash fuctios 4-2 Message Itegrity Goal: itegrity (ot secrecy) Allows commuicatig

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

Examples and Applications of Binary Search

Examples and Applications of Binary Search Toy Gog ITEE Uiersity of Queeslad I the secod lecture last week we studied the biary search algorithm that soles the problem of determiig if a particular alue appears i a sorted list of iteger or ot. We

More information

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits:

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits: Itro Admiistrivia. Sigup sheet. prerequisites: 6.046, 6.041/2, ability to do proofs homework weekly (first ext week) collaboratio idepedet homeworks gradig requiremet term project books. questio: scribig?

More information

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a 4. [10] Usig a combiatorial argumet, prove that for 1: = 0 = Let A ad B be disjoit sets of cardiality each ad C = A B. How may subsets of C are there of cardiality. We are selectig elemets for such a subset

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Ruig Time of a algorithm Ruig Time Upper Bouds Lower Bouds Examples Mathematical facts Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite

More information

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence _9.qxd // : AM Page Chapter 9 Sequeces, Series, ad Probability 9. Sequeces ad Series What you should lear Use sequece otatio to write the terms of sequeces. Use factorial otatio. Use summatio otatio to

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 9 Poiters ad Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 9.1 Poiters 9.2 Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Slide 9-3

More information

2. ALGORITHM ANALYSIS

2. ALGORITHM ANALYSIS 2. ALGORITHM ANALYSIS computatioal tractability survey of commo ruig times 2. ALGORITHM ANALYSIS computatioal tractability survey of commo ruig times Lecture slides by Kevi Waye Copyright 2005 Pearso-Addiso

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

Combination Labelings Of Graphs

Combination Labelings Of Graphs Applied Mathematics E-Notes, (0), - c ISSN 0-0 Available free at mirror sites of http://wwwmaththuedutw/ame/ Combiatio Labeligs Of Graphs Pak Chig Li y Received February 0 Abstract Suppose G = (V; E) is

More information

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana The Closest Lie to a Data Set i the Plae David Gurey Southeaster Louisiaa Uiversity Hammod, Louisiaa ABSTRACT This paper looks at three differet measures of distace betwee a lie ad a data set i the plae:

More information

6.851: Advanced Data Structures Spring Lecture 17 April 24

6.851: Advanced Data Structures Spring Lecture 17 April 24 6.851: Advaced Data Structures Sprig 2012 Prof. Erik Demaie Lecture 17 April 24 Scribes: David Bejami(2012), Li Fei(2012), Yuzhi Zheg(2012),Morteza Zadimoghaddam(2010), Aaro Berstei(2007) 1 Overview Up

More information

Sorting 9/15/2009. Sorting Problem. Insertion Sort: Soundness. Insertion Sort. Insertion Sort: Running Time. Insertion Sort: Soundness

Sorting 9/15/2009. Sorting Problem. Insertion Sort: Soundness. Insertion Sort. Insertion Sort: Running Time. Insertion Sort: Soundness 9/5/009 Algorithms Sortig 3- Sortig Sortig Problem The Sortig Problem Istace: A sequece of umbers Objective: A permutatio (reorderig) such that a ' K a' a, K,a a ', K, a' of the iput sequece The umbers

More information

Appendix A. Use of Operators in ARPS

Appendix A. Use of Operators in ARPS A Appedix A. Use of Operators i ARPS The methodology for solvig the equatios of hydrodyamics i either differetial or itegral form usig grid-poit techiques (fiite differece, fiite volume, fiite elemet)

More information

Analysis of Algorithms

Analysis of Algorithms Presetatio for use with the textbook, Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Aalysis of Algorithms Iput 2015 Goodrich ad Tamassia Algorithm Aalysis of Algorithms

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

A Parallel DFA Minimization Algorithm

A Parallel DFA Minimization Algorithm A Parallel DFA Miimizatio Algorithm Ambuj Tewari, Utkarsh Srivastava, ad P. Gupta Departmet of Computer Sciece & Egieerig Idia Istitute of Techology Kapur Kapur 208 016,INDIA pg@iitk.ac.i Abstract. I this

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Chapter 8. Strings and Vectors. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 8. Strings and Vectors. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 8 Strigs ad Vectors Overview 8.1 A Array Type for Strigs 8.2 The Stadard strig Class 8.3 Vectors Slide 8-3 8.1 A Array Type for Strigs A Array Type for Strigs C-strigs ca be used to represet strigs

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Massachusetts Institute of Technology Lecture : Theory of Parallel Systems Feb. 25, Lecture 6: List contraction, tree contraction, and

Massachusetts Institute of Technology Lecture : Theory of Parallel Systems Feb. 25, Lecture 6: List contraction, tree contraction, and Massachusetts Istitute of Techology Lecture.89: Theory of Parallel Systems Feb. 5, 997 Professor Charles E. Leiserso Scribe: Guag-Ie Cheg Lecture : List cotractio, tree cotractio, ad symmetry breakig Work-eciet

More information

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS Prosejit Bose Evagelos Kraakis Pat Mori Yihui Tag School of Computer Sciece, Carleto Uiversity {jit,kraakis,mori,y

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

CS 683: Advanced Design and Analysis of Algorithms

CS 683: Advanced Design and Analysis of Algorithms CS 683: Advaced Desig ad Aalysis of Algorithms Lecture 6, February 1, 2008 Lecturer: Joh Hopcroft Scribes: Shaomei Wu, Etha Feldma February 7, 2008 1 Threshold for k CNF Satisfiability I the previous lecture,

More information

Algorithm. Counting Sort Analysis of Algorithms

Algorithm. Counting Sort Analysis of Algorithms Algorithm Coutig Sort Aalysis of Algorithms Assumptios: records Coutig sort Each record cotais keys ad data All keys are i the rage of 1 to k Space The usorted list is stored i A, the sorted list will

More information

Data Structures Week #9. Sorting

Data Structures Week #9. Sorting Data Structures Week #9 Sortig Outlie Motivatio Types of Sortig Elemetary (O( 2 )) Sortig Techiques Other (O(*log())) Sortig Techiques 21.Aralık.2010 Boraha Tümer, Ph.D. 2 Sortig 21.Aralık.2010 Boraha

More information

Σ P(i) ( depth T (K i ) + 1),

Σ P(i) ( depth T (K i ) + 1), EECS 3101 York Uiversity Istructor: Ady Mirzaia DYNAMIC PROGRAMMING: OPIMAL SAIC BINARY SEARCH REES his lecture ote describes a applicatio of the dyamic programmig paradigm o computig the optimal static

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015 15-859E: Advaced Algorithms CMU, Sprig 2015 Lecture #2: Radomized MST ad MST Verificatio Jauary 14, 2015 Lecturer: Aupam Gupta Scribe: Yu Zhao 1 Prelimiaries I this lecture we are talkig about two cotets:

More information

CSE 2320 Notes 8: Sorting. (Last updated 10/3/18 7:16 PM) Idea: Take an unsorted (sub)array and partition into two subarrays such that.

CSE 2320 Notes 8: Sorting. (Last updated 10/3/18 7:16 PM) Idea: Take an unsorted (sub)array and partition into two subarrays such that. CSE Notes 8: Sortig (Last updated //8 7:6 PM) CLRS 7.-7., 9., 8.-8. 8.A. QUICKSORT Cocepts Idea: Take a usorted (sub)array ad partitio ito two subarrays such that p q r x y z x y y z Pivot Customarily,

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 22 Database Recovery Techiques Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Recovery algorithms Recovery cocepts Write-ahead

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

Alpha Individual Solutions MAΘ National Convention 2013

Alpha Individual Solutions MAΘ National Convention 2013 Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5

More information

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein 068.670 Subliear Time Algorithms November, 0 Lecture 6 Lecturer: Roitt Rubifeld Scribes: Che Ziv, Eliav Buchik, Ophir Arie, Joatha Gradstei Lesso overview. Usig the oracle reductio framework for approximatig

More information

Chapter 8. Strings and Vectors. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 8. Strings and Vectors. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 8 Strigs ad Vectors Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 8.1 A Array Type for Strigs 8.2 The Stadard strig Class 8.3 Vectors Copyright 2015 Pearso Educatio, Ltd..

More information

MATHEMATICAL METHODS OF ANALYSIS AND EXPERIMENTAL DATA PROCESSING (Or Methods of Curve Fitting)

MATHEMATICAL METHODS OF ANALYSIS AND EXPERIMENTAL DATA PROCESSING (Or Methods of Curve Fitting) MATHEMATICAL METHODS OF ANALYSIS AND EXPERIMENTAL DATA PROCESSING (Or Methods of Curve Fittig) I this chapter, we will eamie some methods of aalysis ad data processig; data obtaied as a result of a give

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

On Infinite Groups that are Isomorphic to its Proper Infinite Subgroup. Jaymar Talledo Balihon. Abstract

On Infinite Groups that are Isomorphic to its Proper Infinite Subgroup. Jaymar Talledo Balihon. Abstract O Ifiite Groups that are Isomorphic to its Proper Ifiite Subgroup Jaymar Talledo Baliho Abstract Two groups are isomorphic if there exists a isomorphism betwee them Lagrage Theorem states that the order

More information

Arithmetic Sequences

Arithmetic Sequences . Arithmetic Sequeces COMMON CORE Learig Stadards HSF-IF.A. HSF-BF.A.1a HSF-BF.A. HSF-LE.A. Essetial Questio How ca you use a arithmetic sequece to describe a patter? A arithmetic sequece is a ordered

More information

Lecture 2: Spectra of Graphs

Lecture 2: Spectra of Graphs Spectral Graph Theory ad Applicatios WS 20/202 Lecture 2: Spectra of Graphs Lecturer: Thomas Sauerwald & He Su Our goal is to use the properties of the adjacecy/laplacia matrix of graphs to first uderstad

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

Project 2.5 Improved Euler Implementation

Project 2.5 Improved Euler Implementation Project 2.5 Improved Euler Implemetatio Figure 2.5.10 i the text lists TI-85 ad BASIC programs implemetig the improved Euler method to approximate the solutio of the iitial value problem dy dx = x+ y,

More information

DATA STRUCTURES. amortized analysis binomial heaps Fibonacci heaps union-find. Data structures. Appetizer. Appetizer

DATA STRUCTURES. amortized analysis binomial heaps Fibonacci heaps union-find. Data structures. Appetizer. Appetizer Data structures DATA STRUCTURES Static problems. Give a iput, produce a output. Ex. Sortig, FFT, edit distace, shortest paths, MST, max-flow,... amortized aalysis biomial heaps Fiboacci heaps uio-fid Dyamic

More information

New HSL Distance Based Colour Clustering Algorithm

New HSL Distance Based Colour Clustering Algorithm The 4th Midwest Artificial Itelligece ad Cogitive Scieces Coferece (MAICS 03 pp 85-9 New Albay Idiaa USA April 3-4 03 New HSL Distace Based Colour Clusterig Algorithm Vasile Patrascu Departemet of Iformatics

More information