Speeding-up dynamic programming in sequence alignment

Size: px

Start display at page:

Download "Speeding-up dynamic programming in sequence alignment"

Catherine James
6 years ago
Views:

1 Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code :

2 Abstract Computig the optimal cost ad aligmet of two sequeces is oe of the most fudametal problems i bioiformatics. The stadard dyamic programmig based algorithm computes the optimal cost ad aligmet i O ( ) time ad space. Hirschberg gave a algorithm to compute the optimal aligmet i O () space, but the time remaied O ( ). The first major improvemet of ( the ) asymptotic ruig time was with the Four-Russias speedup. It reduced the time to O for computig the optimal cost, but it did ot address how to compute log the optimal aligmet withi that time. That was give by Kudeti ad Rajaskara, who combied Hirschberg s algorithm with the Four-Russias speedup. However the Four-Russias speedup is ot commoly used i practice, this is perhaps due to the overhead of usig the Four-Russias speedup. The motivatio of this thesis is to ivestigate if there is a practical speedup usig Four- Russias o dyamic programmig i sequece aligmet. This thesis also discusses issues which arise whe implemetig the Four-Russias speedup for edit distace ad script i quadratic as well as liear space. Also, implemetatios usig the Four-Russias with a theoretical speedup of O (t) rather tha O (log ) are preseted together with experimets showig the performace of these, ad a evaluatio of the Four-Russias speedup applicability o dyamic programmig i sequece aligmet. Dug My Hoa - 443

3 Cotets Itroductio 4. Motivatio Objectives Thesis outlie Backgroud 5. Sequece aligmet Iterpretatio Optimal cost ad aligmet Time ad space Dyamic programmig Time ad space Four-Russias speedup t-block Block fuctio Offset trick Applyig the offset block fuctio Time ad space Implemetatio ad experimetal Setup 3. Implemetatio Data hadlig Readig data from a represetatio Maagig sequeces which is ot a multiple of t Preprocessig Offset block fuctio Experimets Computer specificatio Test data ad parameters Expectatios Edit distace i quadratic space 9 4. Four-Russias speedup Time ad space Experimetal results Ratio Multiple pair-wise aligmets Edit script i quadratic space 3 5. Four-Russias speedup Hadlig padded sequeces Time ad space Experimetal results Edit distace i liear space 5 6. Four-Russias speedup Time ad space Experimetal results Ratio Multiple pair-wise aligmets Dug My Hoa - 443

4 7 Edit script i liear space 3 7. Hirschberg s algorithm Fidig the sub-path Splittig ito sub-problems Time ad space Four-Russias speedup Backtrackig usig offset block fuctio Hadlig padded sequeces Time ad space Experimetal results Ratio Summary ad discussio Quadratic vs. liear space Edit distace Edit script Future work 4 9. Local aligmet ad geeral cost fuctio Parallel programmig Coclusio 4 Dug My Hoa

5 Itroductio Oe of the most commo methods used for iferrig the biological fuctio of gees is sequece similarity search i protei ad DNA sequece databases. With the developmet of rapid methods for sequece aligmet, results based solely o sequece homology have become routie. Although dyamic programmig based sequece aligmet does provide optimal solutios they are computatioally expesive. Therefore, the most commoly used methods are curretly based o heuristics which are much faster, such as BLAST (Basic Local Aligmet Search Tool), at the cost of providig optimal solutios. Speed is importat give the size ad growth of the sequece databases curretly available. So the cotiuig developmet of fast ad accurate algorithms is appealig. The first algorithm itroduced usig dyamic programmig i global aligmet, was by Needlema ad Wusch [3]. Later algorithms for variatios of global aligmet, such as local aligmet ad affie gap cost, were developed i [6], [6] ad []. The first major improvemet i the asymptotic ruig time was achieved i [], also kow as the Four-Russias speedup or Four Russia algorithm. The algorithm improves the ruig time for computig the optimal cost by a factor of O (log ) but it did ot address how to compute the optimal aligmet withi the same ruig time. Hirschberg [9] gave a algorithm for computig the optimal aligmet i O ( ) time ad O () space. The space savig idea i Hirschberg s algorithm was applied i [7] ad []. However the asymptotic ruig time of computig the optimal aligmet remaied the same O ( ). Also, parallel algorithms for the optimal cost were studied i [3] ad [5]. I [4] liear space parallel algorithms were give, however the asymptotic ruig time was still assumed to be O ( ). A survey o all these algorithms ca be foud i ( [8]. I ) [] Hirschberg s algorithm were combied with the Four-Russias speedup givig a O algorithm for computig both the optimal cost ad aligmet i O () space. log. Motivatio Sequece aligmet is a fudametal applicatio i bioiformatics where it used to ifer fuctioal, structural ad evolutioary relatioships betwee sequeces. This makes it a iterestig area to explore speedup possibilities. Eve though the Four Russia paradigm have bee applied for a umber of dyamic programmig algorithms a actual implemetatio is rare [4], [5]. The motivatio of this thesis is to explore practical speedup possibilities of dyamic programmig i sequece aligmet. I decided to focus o the Four-Russias speedup as the theoretical speedup is a major improvemet i the computatio of sequece aligmet. I will ivestigate the Four- Russias speedup applicability o dyamic programmig i global sequece aligmet ad the compare the Four-Russias speedup with a stadard dyamic programmig based algorithm [3] ad Hirschberg s algorithm [9].. Objectives The mai objective of this thesis is to evaluate the applicability of Four-Russias speedup o dyamic programmig i sequece aligmet. I also aim to show that the Four-Russias speedup Although oly oe of them was Russia. Dug My Hoa

6 does ot just give a theoretical but also a practical speedup i the computatio of sequece aligmet. The theoretical ruig time ca ot always be achieved i practice, as the practical ruig time is depedet o the implemetatio choices. To fully uderstad the practical ruig time, a discussio o the implemetatio issues ad details which follows a implemetatio of the Four-Russias speedup ad how they may affect the practical ruig time is preseted. There are several elemets that eed to be take ito accout. How to hadle data, how to speedup the preprocessig part ad how to efficietly applied the Four-Russias speedup o a stadard implemetatio of dyamic programmig i sequece aligmet. The implemetatios will be based o edit distace ad script for global aligmet i quadratic as well as liear space..3 Thesis outlie Sectio will give a itroductio to sequece aligmet ad the algorithms used i the differet implemetatios. Sectio 3 presets implemetatio issues ad details o procedures ad elemets which are commo for both the quadratic ad liear space implemetatio of the Four-Russia speedup. Experimets used to test the implemetatios ad the expectatios of these are also preseted i sectio 3. I sectio 4, 5, 6 ad 7, the implemetatio issues ad details for the differet implemetatios will be preseted followed by the results of the experimets. To summarize the experimetal results for the differet implemetatios, a overview ad a discussio comparig the results will be preseted i sectio 8. Other results which are ot preseted i sectio 4, 5, 6 ad 7 will also be preseted here together with a discussio of these. A discussio o other speedup possibilities ad some ideas to further improve the practical ruig time of the Four- Russias speedup i sequece aligmet will be preseted i sectio 9. Fially i sectio the applicability of the Four-Russia speedup o dyamic programmig i sequece aligmet will be evaluated. Backgroud I this sectio I will give a overview of the algorithms used for the differet implemetatios. A detailed itroductio to sequece aligmet, dyamic programmig ad Four-Russias speedup. I will also to cover the termiologies, otatios ad ideas used for the differet implemetatios. Hirschberg s algorithm is oly used to compute a optimal aligmet i liear space, so it is ot covered i this sectio, but i sectio 7 alog with the implemetatio issues ad details. I this sectio the sequeces are assumed to be of equal legth. How to hadle sequeces of differet legth will be discussed i the implemetatios details for the differet implemetatios.. Sequece aligmet I bioiformatics, a sequece aligmet is a way of arragig DNA, RNA, or protei sequeces to idetify regios of similarity. Highly similar regios may be a cosequece of fuctioal, structural, or evolutioary relatioships betwee the sequeces. Aliged sequeces of ucleotide or amio acid residues are typically represeted as rows withi a matrix. Gaps are iserted betwee the residues so that idetical or similar characters are aliged i successive colums Dug My Hoa

7 .. Iterpretatio Two aliged characters correspods to either a match i both sequeces or a substitutio from sequece A to sequece B, i.e poit mutatios 4. A gap itroduced i sequece A correspods to a isertio i sequece B, ad a gap itroduced i sequece B correspods a deletio i sequece A, i.e idels 5... Optimal cost ad aligmet The similarity of two sequeces is measured as a cost for trasformig sequece A ito sequece B. There are differet types of cost measures. A commoly used measure, the edit distace, couts the umber of operatios required for the trasformatio. I a geeralizatio of the edit distace, the cost of each operatio is summed istead of just the umber of operatios eeded to trasform oe sequece ito aother. Sequece aligmet is about optimizig the cost of the aligmet. To emphasize similarity the objective is to maximize the umber of matches. To explaiig differeces the objective is to miimize the umber of idels ad poit mutatios, hece this is a optimizatio problem. There are differet types of edit distaces such as the Hammig distace 6, which oly measures substitutios betwee sequeces. The Leveshtei distace 7, which measures all the operatios metioed i sectio.., ormally referred to as edit distace. Ad the Damerau-Leveshtei distace 8, which icludes aother operatio which is ot metioed i sectio.., a traspose of two adjacet characters. Computig the edit distace correspods to settig the cost of each operatio to oe. I the geeral case each operatio ca have differet cost which meas that substitutios, isertios ad deletios ca be weighted differetly. To represet the cost of differet operatios a cost table ad a gap fuctio is typically used. A optimal aligmet is a aligmet with a optimal cost. Regios of mutatios ca be idetified from a optimal aligmet, so ofte the optimal aligmet is more importat tha the optimal cost. Heceforth edit distace will refer to the Leveshtei distace, ad edit script will refer to a optimal aligmet with the optimal edit distace...3 Time ad space A aive algorithm to compute the edit distace is show i figure (a). Implemetig this aively without storig edit distaces to sub-sequeces is very time cosumig, as the edit distace for sub-sequeces will be computed multiple times. As for the space it oly depeds o the size of the sequeces, i.e. O (). Computig a edit script is almost the same as computig the edit distace. The oly differece is that after the optimal edit distace have bee foud for a etry, we eed to backtrack, i.e. compute the aligmet which gave rise to the edit distace. A algorithm for computig the edit script is show i figure (b). Fidig the edit script is eve more time cosumig tha computig Dug My Hoa

8 optcost(i,j): d, v, h, s = udef if i > ad j > d = cost(i-,j-) + ed(a[i],b[j]) if i > ad j v = cost(i,j-) + if i ad j > h = cost(i-,j) + if i = ad j = s = retur mi(d,v,h,s) (a) Computig the edit distace aively. optalig(i,j): same as i cost fuctio except the last lie o = mi(d,v,h,s) if o = d optalig(i-,j-) alig A[i] with B[j] if o = v optalig(i-,j) alig A[i] with a gap if o = h optalig(i,j-) alig B[j] with a gap if o = s retur (b) Computig the edit script aively. Figure : Simple implemetatio. the optimal edit distace because of additioal calls to the cost fuctio. remais the same as computig the edit distace. Space cosumptio. Dyamic programmig Dyamic programmig is a method of breakig complex optimizatio problems ito smaller optimizatio sub-problems, which ca be combied to solve etire optimizatio problems. To be able to use dyamic programmig the problem has to cosist of slightly smaller overlappig sub-problems. Also the procedure to solve the problem must be to repeatedly solve the same sub-problem. The idea is to store solutios to sub-problems, which ca be retrieved later to solve bigger sub-problems ad thereby solve the etire optimizatio problem 9. Sequece aligmet is a optimizatio problem, where the optimal solutio ca be computed from combiig the optimal solutios from slightly smaller sub-problems. The solutios to the subproblems are used multiple times to compute other sub-problems, i.e. overlappig sub-problems. So applyig dyamic programmig to sequece aligmet will improve the computatio time of fidig the edit distace ad script. The edit distace for each pair of sub-sequeces are stored i a distace table, so they ca be retrieved for later use... Time ad space A algorithm for computig the edit distace with dyamic programmig is almost the same as the aive versio. Iterate over i =,..., ad j =,..., ad replace each recursive fuctio call with a lookup i the distace table. This way each value that is required to compute the edit distace i a etry is already computed ad ca be retrieved from the distace table. This is also kow as forward dyamic programmig. The time complexity is O ( ), as each etry i the distace table is computed oce i costat time. There are ( + ) pairs of sub-sequeces as the empty sequece is also treated as a sub-sequece, hece the space cosumptio is O ( ). Whe the distace table has bee filled out, computig the optimal edit script ca be doe by backtrackig i the distace table. The algorithm is almost the same as the aive versio, but with 9 Dug My Hoa

9 each recursive fuctio call replaced with a lookup i the distace table. Makig three lookups i the distace table for each etry takes costat time, which meas the time cosumptio oly depeds o the umber of etries we eed to visit i the distace table. Worst case is whe sequece A ad B is oly aliged with gaps, hece O () time..3 Four-Russias speedup Four-Russias speedup is a method to speedup dyamic programmig. The geeral idea of Four- Russias speedup is to partitio the distace table ito t-blocks ad compute essetial values i the table oe block at a time. The essetial values i the distace table are the values required to compute a block. The goal is to oly use O (t) time o each block, istead of the ormal Θ(t ) [8]..3. t-block Now cosider the stadard dyamic programmig procedure of computig the edit distace i a block i the distace table (see figure ). The block D is computed from the sub-seq A ad B, start value S, row R ad colum C. It is clear that the block D is a fuctio of these, hece the last row ad colum of the block is a fuctio of sub-seq A ad B, start value S, row R ad colum C. sub-seq B sub-seq A S R C D Figure : A block i the distace table. Let a t-block be a block of size t i the distace table, where the last row i the block is shared with the first row i the block below it (if ay), ad the last colum i the block is shared with the first colum i the block to its right (if ay)..3. Block fuctio Give sub-seq A ad B, start value S, row R ad colum C the block fuctio computes the last row ad colum of the block. The computatio time is Θ(t ) whe doe aively. The goal is to oly use O (t) o each block. Oe way is to precompute all possible iputs for the block fuctio, so that the last row ad colum ca be computed i O (t) time. By defiitio each etry ca hold a distace value from zero to, so there are + possible values for ay t-legth row ad colum. Hece the possible iput combiatios to the block fuctio is ( + ) t σ t, where σ is the size of the alphabet. For each iput, the block fuctio takes Θ(t ) time to compute the last row ad colum of a block. So the overall time to precompute the fuctio output is Θ(( + ) t σ t t ). But as t is at least oe this gives a Ω( ) precomputatio time. So there is o speedup with this solutio. Dug My Hoa

10 .3.3 Offset trick The domiat term i the precomputatio time is ( + ) t, as the size of σ is assumed fixed. Now cosider the values i the distace table, each D[i, j], with i, j >, is computed with the values from D[i, j ], D[i, j] ad D[i, j ]. If A[i] B[j] the D[i, j] is equal to the miimum of the three etries plus oe, if A[i] = B[j] the D[i, j] is equal to D[i, j ], hece D[i, j] is less tha or equal to these etries plus oe. Coversely, for adjacet row etries, the optimal edit distace of A[... i] ad B[... j] is located i D[i, j], by omittig B[j] from the aligmet the optimal edit distace for A[... i] ad B[... j ] is located i D[i, j ]. Now if the aligmet matches B[j] with some character i A[... i] the by omittig B[j] from the aligmet, the distace is icreased by at most oe. If B[j] is ot matched the its omissio will decrease the distace by at most oe. Hece D[i, j ] D[i, j] +. The same goes for adjacet colum etries. For adjacet diagoal etries, if A[i] is aliged with B[j] the it is clear that D[i, j ] D[i, j] +. If A[i] is ot aliged with B[j] the either A[i] or B[j] is aliged with a gap, ad D[i, j ] D[i, j]. Hece two adjacet etries ca differ by at most oe. With this kowledge it is easy to see that a row or colum ca be represeted as a start value ad the differece (offset) of each subsequet etry i the row or colum. A offset vector is the a t-legth vector of values {,, }, where the first etry must be zero. The key to make the Four-Russias speedup efficiet is to compute the edit distace usig oly the offset vectors. Because the umber of offsets is much less the the umber of possible distaces, makig the precomputatio time oly O ( 3 t σ t t ). Computig the offset vector of the last row ad colum of a t-block ca be doe without ay actual edit distace. Cosider a t-block i the distace table where the upper left corer is D[i, j] = S, where S is a ukow edit distace. The for a colum, k, i the block, the value i D[i, k] is the, S plus the total of the offsets i row i from colum j + to k. So eve though the value of S is ukow, the value of the etry ca be expressed as S plus a value which is computed from the row offset vector i row i. Each D[k, j] ca be expressed the same way. Now let D[i, j + ] = S + I ad D[i +, j] = S + J where I ad J is kow (the offset vectors of row i ad colum j). If A[i] = B[j] the D[i +, j + ] = D[i, j], if A[i] B[j] the D[i +, j + ] is the miimum of D[i, j] +, D[i, j + ] + ad D[i +, j] +. The compariso ca be doe by kowig the value of I ad J, hece D[i +, j + ] ca be expressed as S, S +, S + I + or S + J +. This way every etry i a block ca be expressed as a ukow S plus a value that ca be determied. Sice every etry ivolves the same variable S, the offset vector of the last row ad colum for a block ca be determied with a abitrary value of S..3.4 Applyig the offset block fuctio To use the offset block fuctio, cover the ( + ) distace table with t-blocks, with overlappig rows ad colums. Iitialized the first row ad colum of the distace table ad fid the offset values for them. Row-wise determie the last row ad colum of each block. Because the blocks overlap, the last row i a block provides the first row i the block below it (if ay) ad the last colum i a block provides the first colum i the block to its right (if ay). If Q is the total of the offset values computed for etries i row, the D[, ] = D[, ] + Q = + Q. Dug My Hoa

11 .3.5 Time ad space The offset block fuctio computes the last row ad colum, so each block uses oly ( O ) (t). There are Θ( t ) blocks, so the total time used whe applyig the Four-Russias is O t. Settig ( ) t = log gives a ruig time of O. If the distace table occupies quadratic space the log space usage is the O ( + 3 t σ t t ) ad O ( + 3 t σ t t ) for distace table i liear space. 3 Implemetatio ad experimetal Setup This sectio will describe the implemetatio ad experimetal setup. The reaso for this sectio is because there are some part of the implemetatios, which is the same for both the quadratic ad liear space Four-Russias speedup. Data hadlig, as i represetatio of the data used, see sectio 3... Maagig sequeces, where the offset block fuctio ca ot be applied o the whole distace table, see sectio For preprocessig of t-blocks, see sectio The experimets are very similar i both the quadratic ad liear space. The oly major differece are the limitatio o iput size for the quadratic space. See sectio 3.. for specificatio of the computer used i the experimets. Test data ad parameters as i t-block size ad iput size ad the reaso for these choices, see sectio 3... A discussio o what to expect from them, see sectio I the rest of the report, sequece A refers to the sequece represeted across the rows with size, ad variable i correspods to a row i the distace table. Sequece B refers to the sequece represeted across the colums with size m, ad variable j correspods to a colum i the distace table. 3. Implemetatio Four differet types of implemetatios were made for sequece aligmet. A stadard dyamic programmig implemetatio i quadratic space, ad oe where Four-Russias speedup has bee applied. See sectio 4 ad 5 for the implemetatio of edit distace ad edit script respectively. A stadard dyamic programmig implemetatio i liear space, ad oe where the Four-Russias speedup has bee applied. See sectio 6 ad 7 for the implemetatio af edit distace ad edit space respectively. The implemetatios without the Four-Russias speedup will be referred to as stadard implemetatios. This sectio will describe parts of the implemetatios with Four-Russias speedup i a more detailed level. First is the data hadlig, as i the represetatio of sub-sequeces ad offset vectors. Why the represetatio is appropriate, how much space it cosumes ad how it might affect the ruig time of the computatio. Secod is the maagemet of sequeces, whe the distace table ca ot be partitioed ito t-blocks, e.g. there are missig some rows ad colums, so the offset block fuctio ca ot be applied o the whole distace table. I will preset a solutio ad discuss how this might affect the ruig time. Fially I describe the preprocessig, how to compute the offset block fuctio, implemetatio choices ad how it affect the ruig time. Dug My Hoa - 443

12 3.. Data hadlig The size of the sub-sequeces ad offset vectors is t which differs from what have bee described i sectio.3. The first offset i every offset vector is always set to, so here there is o eed to have a offset vector of size t. The first character i each sub-sequece ca be omitted from a block sice the offset vectors are give before preprocessig or computig the edit distaces. The distace values i the first row ad colum ca be computed from these offset vectors with a abitrary start value (see sectio.3.3), so the first character i each sub-sequece does ot participate i the computatio ad therefore ot eeded for lookups either. The Four-Russia speedup is about precomputig ad storig iformatio about all possible istaces of a sub-problem. For all possible combiatios of sub-sequeces ad offset vectors, i this case (3 4) (t ) as the sub-sequeces ad offset vectors is of size t, there are two offset vectors associated with each istace. So (t )sizeof (it) space is used per offset vector if implemeted aively. But as t grows i size the data structure used to store the precomputed data will explode i size. As a example whe t = 5 the size of the structure would be (3 4) 4 4, over GB of space. The eed to pack the data arises, ad a idea could be to oly use bits to represet the bases ad offsets. For this there is the bitset ad the vector bool cotaier from Stadard Template Library i C++. vector bool allows for dyamic resizig whereas bitset is fixed, boost also made a variatio of the bitset which allows dyamic resizig. I decided ot to use ay of these because with t = 6 the structure occupies (3 4) 5 B assumig that each etry oly uses oe byte which is roughly 6GB (sectio.3.5). I aim to test with t = 5 ad therefore the sub-sequeces ad offset vectors are stored usig usiged iteger istead. I could have used siged iteger istead, which would ot have made ay differece. This way I ca use bit-wise operatios to pack ad upack the data. Also, whe usig the offset block fuctio, the offset vector represetatios it returs ca be stored for future lookups without havig to upack the data, see below for more details. Two bits is used to represet a base A, C, G, T ad two bits to represet a offset,,. For t = 5 the structure uses (3 4) 4 4, 6GB assumig that each etry oly uses four bytes, which is still ruable o a 4GB RAM machie. character -bit idex A C G T 3 - Table : Table of character ad offset covertio. Each base ad offset is hardcoded, as the implemetatio is oly iteded for DNA sequece with edit distace. These are kept i a char ad it array, sigma ad offset, used for costructig a actual sub-sequece ad offset vector from a represetatio. Furthermore there is a table covertig each of the bases ito their idex value, eeded for costructig a usiged iteger iterpretatio give a sub-sequece. For offsets, the idex value is the offset value +, i this way you ca get the offset vector values directly from a usiged iteger represetatio by oly usig Dug My Hoa - 443

13 bit shift operatios. Also you ca costruct a offset vector represetatio from the offset vector values, by usig the offset value plus oe... = ACGT.. = {,,, } Figure 3: Example of sub-sequece ad offset vector represetated as a usiged iteger. There are oly three offsets so there is a combiatio of bits which will ot be used. For that reaso a table for covertig a offset vector represetatio to a idex value is eeded offbitsit ad from a idex value to offset vector represetatio itoffbits. Because whe allocatig the structure to store the preprocessed data oly (3 4) (t ) etries are eeded. But idexig the etries to the usiged iteger represetatio of a offset vector will icrease the size, as if there were four offsets istead of three, givig a size of (4 ) (t ) B, assumig that each etry oly uses oe byte. But that would ot be the case whe packig the data usig oly the size eeded to represet a offset vector. As t = 5 the bits eeded to represet a offset vector is 8 ad there is two of them so there is (4 ) 4 which is 8GB, ad havig a structure of this size eve with a machie which is capable of testig the implemetatio, a large part of the structure will ot be cached which leads to a poor performace because of RAM latecy. 3.. Readig data from a represetatio Readig a character from a sub-sequece represetatio ca be doe usig a idex value of the sub-sequece. The represetatio of the sub-sequece is stored at the (t ) least sigificat bits (assumig right side of the usiged iteger). So for a umber, subbits, betwee {,,..., 4 t }, bit shift subbits to the left by wordsize (t ). The sub-sequece represetatio is ow at the (t ) most sigificat bits. Depedig o which idex i that eeds to be read, startig at idex, bit shift subbits further to the left by i, the bit shift subbits to the right by wordsize so there is oly two bits left represetig the sub-sequece. The by usig the usiged iteger value of subbits i the array sigma which cotais the character, the character of idex i from subbits ca be retrieved... = ACGT.. read idex = C Figure 4: A example of readig a character from a represetatio. Readig a offset from a offset vector represetatio, works the same way as for readig a character from a sub-sequece. The oly differece is the bit combiatio will ot appear i a offset vector represetatio as there are oly three offsets, ad the array offset is used istead to look-up the actual offset value. Dug My Hoa - 443

14 3..3 Maagig sequeces which is ot a multiple of t The idea behid Four-Russias speedup is to partitio the distace table ito t-blocks. A t-block has its last row shared with the first row i the t-block below it (if ay) ad its last colum shared with the first colum of the t-block o its right (if ay). This meas that the sequeces have to be a multiple of t, if ot the there are missig some rows at the bottom of the table ad/or some colums to the right of the table. Cosequetly the offset block fuctio ca ot be applied o the last rows ad/or colums of the table. Oe way to solve this is to use stadard dyamic programmig o the rows ad colums where offset block fuctio ca ot be applied. That meas worst case havig (t ) + (t ) m etries that uses stadard dyamic programmig to be filled i. This is however ot very good i practise, as the time used with stadard dyamic programmig will be (t )+m(t ) (t ) i worse case. Though it will ot chage the asymptotic ruig time of the program. A better solutio would be to make sure you ca use the offset block fuctio o the whole table. This is doe by paddig the sequeces without chagig the optimal edit distace. Furthermore oe has to make sure that the etry cotaiig the optimal edit distace is either i a row or colum which is a multiple of t, so the optimal edit distace is attaiable whe doe applyig the offset block fuctio o the whole table. Also, oe eeds to make sure that the optimal edit script ca still be obtaied with these paddigs. By lettig sequeces A always be the logest of two, there are two case of paddigs. If is ot a multiple of t, the both sequeces are padded with As i frot, so that the padded sequeces A ow has a size which is a multiple of t. The if m plus the size padded i frot is ot multiple of t, the As are padded i the back of sequeces B. This way the offset block fuctio ca be applied o the whole table. The size to pad i frot of each sequeces, padfrot, is give by (t ) ( mod (t )), ad the size to pad i the back of sequece B, padm ed, is give by (t ) ((m + padfrot) mod (t )). AA A (a) Whe ad m is ot a multiple of t. (b) Padded As i frot. (c) Whe m is ot a multiple of t. Figure 5: Paddig sequeces. The paddigs i frot do ot chage the optimal edit distace of the two sequeces, as the optimal edit distace will go diagoal dow to where the actual sequeces starts, takig edit distace value with it. As for the paddig at the ed of sequece B, they are igored as they are oly there to make sure that offset block fuctio ca be used o the whole table. So by kowig how much have bee padded i the frot of both sequece ad i the ed of sequece B, the optimal edit distace ca be read from the last row i the table i etry, + padfrot, m padm ed. As Dug My Hoa

15 for the optimal aligmet, it ca still be obtaied with these paddigs as follows. The aligmet with paddigs i frot has extra As aliged for both sequeces, which are igored whe computig the edit script. Backtrackig starts i the etry where the optimal edit distace is located, so the paddigs i the ed of sequeces B will ot participate i the backtrackig. A A A A A A Figure 6: Edit distace ad script ca still be computed with padded sequeces. So what do these paddig do to the ruig time of the algorithm? Worst case is that it is oly, the size of sequece A, which is ot a multiple of t. So by paddig As i frot of both sequeces, meas that you will have to pad As i the back of sequece B, as the padded sequece B o loger is a multiple of t. This results i two extra look-ups for each row of t-blocks, which is a costat umber of extra lookups. Cosequetly the asymptotic ruig time of the algorithm remais the same Preprocessig The preprocessig is about makig aligmets for all possible sub-sequeces ad offset vectors. The save the offset vectors of the aligmets i a table for fast retrieval. First, all possible subsequeces ad offset vectors are costructed. These are eeded for computig the edit distace for all the sub-problems. They are ot eeded for other the the preproceesig step, so they are oly stored temporarily. Costructig all possible sub-sequeces is easy as they ca geerated from their idex value. So by goig through the umbers,..., 4 t, the sub-sequeces ca be geerated by readig idex,,..., t from their idex value ad cocateate the characters read. For details o how to read a character i a sub-sequeces represetatio see sectio 3... The offset vectors requires a bit more work as there are oly three offsets. So it is ot all the combiatios of two bits that is used. Costructig a offset vector is doe by keepig a local usiged iteger, offbits, which is modified, for every umber of offset vectors there is, to reflect a offset vector represetatio. So by goig through the umbers,..., 3 t, oe is added to offbits as log as the two least sigificat bits are ot equal to two. If the two least sigificat bits are equal to two, reset them to zero by bit shiftig offbits to the right by two. If the two ew least sigificat bits are equal to two, reset it agai by bit shiftig offbits to the right by two. This cotiues util the two least sigificat bits are ot equal to two ad oe is added to offbits. Shift the two bits, that were affected by addig oe, back to their origial positios. This way offbits ow represet oe of the possible offset vector. It is the same as alteratig the last offset i a offset vector through all possible offset values. Whe reachig the ed of possible offset value, ( i bit represetatio), the last offset value is reset ad a carrier is added to the ext offset i the offset vector. If this offset is the last of all possible offset values, reset it ad move the carrier Dug My Hoa

16 to the ext offset value ad so o. For each iteratio offbits acts as a couter for offset vectors i their usiged iteger represetatios. From the costructed offset vector represetatio the offset vectors ca be determied by readig from the represetatio. For details o how to read a offset value i a offset vector represetatio see sectio = {-,,,} add oe.. {-,,,} add oe, gives carrier, reset two bits.... reset two bits add carrier move bits back to origial positio.. {-,,,} Figure 7: A example of costructig a usiged iteger represetatio of a offset vector. Now all possible sub-sequeces ad offset vectors have bee costructed ad ca be retrieved by their idex value i a temporary storage. So fidig the offset vectors of all the possible combiatios of sub-sequeces ad offset vectors ca be doe by goig through all the possible combiatios of their idex values ad fetchig each sub-sequece ad offset vector. The perform a stadard dyamic programmig computatio o each combiatio usig a (t ) (t ) distace table D. By usig the start value t ad the applyig the offset vectors to this value, the distace table will ot cotai ay egative distaces. Not that it matters as it is oly the offsets which is eeded. A distace table of size (t ) (t ) is oly eeded i the preprocessig step, as the first row ad colum are give by the start value ad offset vectors. The oly iformatio eeded for each etry is the diagoal, vertical ad horizotal values from the etry. These are kept i temporary variables alog with three extra variable, a to remember the first value i the last colum, b to remember the first value i the last row ad bh for storig the first horizotal value for a row which is used whe goig from a row to the row below it. a a bh d v h bh d v h b (a) Start of a stadard dyamic programmig computatio. b (b) d,v ad h movig alog a row. Figure 8: Preprocessig usig stadard dyamic programmig. Whe v reaches the last colum, but is still ot withi the distace table, the a = v. The same goes for b whe h reaches the last row. bh is set to h each time h starts oe a ew row. D is filled out the same way as i sectio 4. The first offset for the last colum is D[, t ] a ad the remaiig offsets are D[i, t ] D[i, t ] for i =,..., t. The first offset for the last row is D[t, ] b ad the remaiig offsets are D[t, j] D[t, j ] for j =,..., t. Costructig the offset vector represetatio from the offset values is doe by usig a temporary usiged iteger, offbits, Dug My Hoa

17 offset block table pair ptr itoffbits Figure 9: Mappig of offset vectors. iitialized to. For each offset value plus oe, add it to offbits the bit shift offbits left by two ad add the ext offset value plus oe, ad so o util all the offset values have bee added to offbits. The offset block fuctio is basically a four dimesioal array. For each possible combiatio of sub-sequeces ad offset vectors, the offset block fuctio returs a pair of offset vector represetatios which correspods to the last row ad colum of a t-block. Istead of havig two offset vectors stored for each etry i the offset block table, a poiter to a pair of offset vector poiters is used, see figure 9. I this way each etry of the offset block table oly uses a wordsize. The pair of poiters poits to the offset vector represetatios i itoffbits, which is the table used to get a offset vector represetatio from its idex value. After costructig the offset vector represetatio i the preprocessig phase, the idex values are also eeded to set a pair of poiters saved i a array, pair ptr. The idex values are retrieved from offbitsit which is the mappig of a offset vector represetatio to its idex value. The preprocessig time is the the time to costruct the sub-sequeces ad offset vectors, plus the time to fid the aligmets for all combiatios of these, ad plus the time it takes to read the offsets ad costruct the offset vector represetatios. The time to costruct the sub-sequeces ad offset vectors is O ( (4 t + 3 t )(t ) ). The time for aligig the combiatio ad makig the offset vector represetatio is O ( 4 (t ) 3 (t ) ((t ) + (t )) ). But sice (t ) < (t ) ad (4 t + 3 t )(t ) < 4 (t ) 3 (t ) (t ) the overall time is O ( 4 (t ) 3 (t ) (t ) ). The space cosumptio is O ( 4 (t ) 3 (t )) as each etry i the offset block table oly uses four bytes, regardless the size of t Offset block fuctio To use the offset block fuctio the idex values of the sub-sequeces ad offset vectors are eeded. To get these their represetatio eeds to be costructed. After costructig the offset vector represetatio the idex value ca be retrieved from offbitsit. For the sub-sequeces the idex values correspods to their represetatios. 3. Experimets The objective of my experimets is to ivestigate whether the Four-Russias speedup for dyamic programmig i sequece aligmet does give a speedup i practise. So I eed to costruct experimets that show if this occurs with ad without preprocessig. As the advatage of the Four- Russias speedup is the preprocessed data ca be used o multiple pair-wise aligmets. Also, for fidig a edit script the Four-Russias should be able to give a speedup. I the quadratic space, the distace table has to be fill i before a edit script ca be computed. With the Four-Russias speedup the distace table is oly partial filled i. A backtrackig o a partial filled distace table Dug My Hoa

18 is the to go through the blocks that cotai the optimal path. Whe the time used to compute the sub-paths i these block is less tha the speedup gaied by usig the Four-Russias, it should yield a better ruig time. See sectio 5 for more details. For liear space, the Four-Russias speedup ca be directly applied to Hirschberg s algorithm for computig optimal aligmets i liear space. Whe the preprocessig takes less time tha the time used to compute the edit script, there is a good possibility that the overall time for computig a edit script is faster whe oly usig Hirschberg s algorithm, see sectio 7 for more details. The experimets will be ru with ad without optimizatio. The reaso is that the stadard implemetatios will beefit a lot more from optimizatio tha the oes with Four-Russias speedup. Each etry i the stadard implemetatios is computed by comparig three other etries ad two characters, so costat time is used i each etry. Each block i the implemetatios with Four-Russias speedup cosist of computig the idex values of the sub-problem, applyig the offset block fuctio ad fillig i the last row (ad colum) of the block, so the time is depedet o the size of the block. Comparig the workload of each loop it is clear that the stadard implemetatios have less work per loop. Now cosider the loop sizes. There are m etries to be filled i for the stadard implemetatios ad t m t blocks with the Four-Russias speedup. Although there are fewer loops with Four-Russias speedup the workload is heavier. Hece ay loop optimizatio (loop urollig) will beefit the stadard implemetatios a lot more the the oes with Four-Russias speedup. As the overhead, of usig the offset block fuctio ad oliear writes to the memory (fillig i the last colum of a block) for the quadratic space versio, will domiate the ruig time with Four-Russias speedup. So it is iterestig to see how the results without optimizatio (-O) are compared to results with optimizatio (-O3). 3.. Computer specificatio The experimets are ru o a.66 GHz Itel Core quad (Q945) CPU machie with 4GB RAM ad 6MB cache, ruig Ubutu Test data ad parameters Radom sequeces are used for testig i geeral. The reaso is the cotets of the sequeces will ot affect the stadard implemetatios, as they do ot use a lookup table like the Four-Russias speedup. So to be able compare the implemetatios radom sequeces are used. Padded sequeces might have a effect o the ruig time for the implemetatios with Four-Russias speedup. As there will be more lookups with the offset block fuctios. So the legth of the sequeces will always be a multiple of t. The block size to be tested are, 3, 4 ad 5. Block size will ot make ay sese sice t would be zero. For block size larger tha 5 the offset block table is too big to be tested by the machies available. With the block size, 3, 4 ad 5, sequeces which are a multiple of 6 is used as 6 is a multiple of,, 3 ad 4. The first test is to show whe we achieve a speedup usig Four-Russias with ad without preprocessig compared to the stadard implemetatio. For this purpose the iput size has small iterval betwee them ad the max size will be. If the Four-Russias has ot achieved a speedup for sequeces of this size, the implemetatio has failed. The a test to see the advatage of the preprocessed data used o multiple pair-wise aligmets (oly doe for edit distace). The iput sizes will be determie from a sigle ru of sequece aligmet, where the Four-Russias Dug My Hoa

19 outperforms the stadard implemetatio. These tests will be doe for both computig the edit distace ad script. A test where the sequeces cosist of As oly to see how a aligmet that have may of the same look-up would affect the ruig time of the implemetatio. Block size 5 will ot be tested for the quadratic space, as the size of the distace table grows there will ot be eough space for the offset block table. The test o sequeces with oly As will oly be tested o block size 4 ad 5 as the test is just to see how much faster it would be compared to radom sequeces. All tests are ru multiple times to equalize the ruig time, tests o block size are doe with the same two sequeces Expectatios Tests with block size are expected to be slower tha the stadard implemetatio. Because with block size the whole distace table will be filled out ad the overhead of usig Four-Russias will domiate the ruig time. To use the offset block fuctio for a etry the idex values of the combiatio are eeded. Whereas for the stadard implemetatios fillig i a etry is to fid the maximum of other three etries, which is basically comparig three values. So costructig/loadig four idex values ad makig a lookup usig the offset block fuctio versus comparig three values, I would guess the latter case to be fastest of the two. As for block size 3 there should be a advatage whe usig Four-Russias. There will be at least oe etry for each block that is ot filled i (itermediate results are ot filled i whe usig Four-Russias speedup i liear space, see sectio 6. for more details). Without preprocessig the Four-Russias speedup should outperform the stadard implemetatio later the for larger block sizes. With preprocessig the Four-Russias speedup should outperform the stadard implemetatio earlier tha for larger block sizes, as the preprocessig time is lower. For block size 4 the preprocessig time is icreased. So eve though the Four-Russias speedup without preprocessig might outperform the stadard implemetatio earlier tha for block size 3, it will occur later with preprocessig. The questio would the be whe it is more beeficial to use block size 4 over block size 3. For large sequeces there would be a advatage while for smaller sequeces it might tur out to be more appropriate to use block size 3. For block size 5 the preprocessig time will be very log ad it might tur out to be useless for sequeces sizes that I am capable of testig with the machies I have available. Never the less, the speedup without the preprocessig should be faster tha for block size 4. The iterestig part would be how much faster it is compared to the other block sizes. The iterestig part of the test with sequeces cosistig of oly As would be to see how much faster it is compared to radom sequeces, as the lookup data is goig to be catched. So it is expected to be the fastest of all the tests, as it is tested without ay itermediate results beig filled i. For edit script tests i quadratic space, the backtrackig part is slower for the Four-Russias implemetatio tha for the stadard implemetatio. Because, beside backtrackig o the distace table, some of the etries i the table still eed to fill i to be able to backtrack o the distace table. But the overall ruig time should be faster as the distace table eeds to be filled i before this is possible. The backtrackig part, eve though it takes loger time tha the stadard implemetatio, will still be low ad combiig it with the ruig time to compute the edit distace will probably ot result i a sigificat icrease. For liear space, the ruig Dug My Hoa

20 time should be faster tha the stadard implemetatio as it uses the offset block fuctio while backtrackig. A quick summary, without preprocessig the larger the block size is the better the speedup is. Oly block size might show to be slower tha the stadard implemetatio. With preprocessig block size 3 might be best for small sequeces ad block size 4 for larger sequeces, while block 5 might tur out to be useless. I geeral the tests o quadratic space are expected to be slower the the tests o liear space. Because, the distace table i liear space will always be withi the cache size with the sequeces sizes used, which is ot the case i quadratic space. 4 Edit distace i quadratic space The stadard implemetatio is very straight forward. Here, forward dyamic programig is used to fill i the distace table. So for a sequece A with size ad a sequece B with size m, the distace table D is of size + m + with + rows ad m + colums. Start by fillig out the first row with,..., ad the the first colum with,..., m. Goig through the etries row by row startig i D[, ], the etries are filled i accordigly: the etry D[i, j], where i =,..., ad j =,..., m, is equal to D[i, j ] if A[i ] equals B[j ]. If A[i ] is ot equal to B[j ] the D[i, j] is equal to the miimum of D[i, j ] +, D[i, j] + ad D[i, j ] +. Whe the whole table is filled i the edit distace ca be foud i D[, m]. It is easy to see that the time ad space are both O (m). 4. Four-Russias speedup Applyig the offset block fuctio o the distace table correspods to loopig over the idexes of the table which are multiples of t. The for each sub-problem costruct the sub-sequece represetatios for the sub-problem. Costructig a sub-sequece represetatio is very straight forward. For each character i a sub-sequece look up the idex value i the charit table ad add it to a local usiged iteger, subbits, which is iitialized to. For each subsequet character i the sub-sequece, bit shift subbits left by two ad add the idex value of the character. There is o eed to costruct the offset vector represetatios, because whe either the first row or the first colum is a part of the sub-problem the offset vectors cosist of oly s. From how the offset vector represetatios are saved i itoffbits that offset vector represetatio ca be foud i the last idex of itoffbits (see sectio 3.. for details). Furthermore the offset block fuctio returs two offset vector represetatios. The offset vectors of a row ca be saved i their usiged iteger represetatios, so they ca be retrieved later for lookups i the row of blocks below. By usig this kowledge, there is o eed to costruct the offset vector represetatios, see figure. The pair of offset vector represetatios that the offset block fuctio returs are used to fill i the last row ad colum of the block. For details o how to read from a offset vector represetatio, see sectio 3... Subtractig oe from a offset idex value will give the offset value, so there is o eed to look it up i offset, which stores them, because the idex value is just the offset value plus oe, see table. Whe fillig i the distace table, the last offset i the colum offset vector is ot eeded, because the last etry of the block will be filled out usig the last offset of the row offset vector. Dug My Hoa

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access