Microprocessors and Microsystems

Size: px

Start display at page:

Download "Microprocessors and Microsystems"

Arnold Carter
5 years ago
Views:

Mroproessors and Mrosystems 36 (2012) 96 109 Contents lsts avalable at SeneDret Mroproessors and Mrosystems journal homepage: www.elsever.

Portugal artle nfo abstrat Artle hstory: Avalable onlne 30 May 2011 Keywords: Hardware aelerator DNA Loal sequene algnment Traebak Dynam programmng algorthms are wdely used to fnd the optmal sequene

1 Mroproessors and Mrosystems 36 (2012) Contents lsts avalable at SeneDret Mroproessors and Mrosystems journal homepage: Hardware aelerator arhteture for smultaneous short-read DNA sequenes algnment wth enhaned traebak phase Nuno Sebastão, Nuno Roma, Paulo Flores INESC-ID/IST, Rua Alves Redol, 9, Lsboa, Portugal artle nfo abstrat Artle hstory: Avalable onlne 30 May 2011 Keywords: Hardware aelerator DNA Loal sequene algnment Traebak Dynam programmng algorthms are wdely used to fnd the optmal sequene algnment between any two DNA sequenes. Ths manusrpt presents a new, flexble and salable hardware aelerator arhteture to speedup the mplementaton of the frequently used Smth Waterman algorthm. When ntegrated wth a general purpose proessor, the developed aelerator sgnfantly redues the omputaton tme and memory spae requrements of algnment tasks. Suh effeny manly omes from two nnovatve tehnques that are proposed. Frst, the usage of the maxmum sore ell oordnates, gathered durng the omputaton of the algnment sores n the matrx-fll phase, n order to sgnfantly redue the tme and memory requrements of the traebak phase. Seond, the explotaton of an addtonal level of parallelsm n order to smultaneously algn several query sequenes wth the same referene sequene, targetng the proessng of short-read DNA sequenes. The results obtaned from the mplementaton of a omplete algnment system based on the new aelerator arhteture n a Vrtex-4 FPGA showed that the proposed tehnques are feasble and the developed aelerator s able to provde speedups as hgh as 16 for the onsdered test sequenes. Moreover, t was also shown that the proposed approah allows the proessng of larger DNA sequenes n memory restrted envronments. Ó 2011 Elsever B.V. All rghts reserved. 1. Introduton The advent of the latest generatons of sequenng tehnologes [1] has opened many new researh opportuntes n the felds of bology and medne, nludng ell Deoxyrbonule Ad (DNA) sequenng, gene dsovery and evolutonary relatonshps. These tehnologes have ontrbuted to the exponental growth of bologal data that s avalable for researhers. For nstane, the GenBank [2] has doubled ts data sze approxmately every 18 months and n ts Deember 2010 release t nluded over base pars (bps) from several dfferent spees. To assst the bologsts n the extraton of useful nformaton and n the nterpretaton of the huge szed sequene databases, a set of algnment algorthms (e.g. the wdely used Smth Waterman (S W) [3]) have been developed to solve many open problems n the feld of bonformats, suh as () DNA re-sequenng, where genome assembly s done aganst a referene genome; () Multple Sequene Algnment (MSA), where multple genomes are algned to perform genome annotaton; and () Gene fndng, where Rbonule Ad (RNA) sequenes (the transrptome) are algned aganst the organsm genome to dentfy new genes. Correspondng author. E-mal address: Nuno.Sebastao@nes-d.pt (N. Sebastão). Currently, a ommon sequenng approah s based on the applaton of Hgh Throughput Short Read (HTSR) tehnologes [4], to redue the ost of the sequenng proess. Ths tehnque onssts of uttng the DNA fragments under analyss nto shorter fragments (reads), whh are ndvdually sequened and algned aganst a referene sequene. At present, the three most mportant HTSR sequenng platforms are: the GS FLX Genome Analyzer (454), the Solexa 1G Sequener (lllumna) and the SOLD Sequener (Appled Bosystems). The bohemstry tehnology underlyng eah of these platforms leads to very dfferent haratersts, n terms of reads length, throughput and raw errors. However, ndependently of the adopted platform, the length of the reads produed by these platforms s small when ompared to prevous generaton sequenng tehnologes and muh smaller than the orgnal omplete DNA sequene. Nevertheless, the sheer volume of data that s generated and the need to algn these reads to large referene genomes lmts a dret and nave applaton of standard Dynam Programmng (DP) tehnques. One smple example of a ommon hallenge omes from the need to algn up to 100 mllon reads aganst a referene genome that an be as large as 3 Gbp. For the SOLD sequener, wth reads as short as 30 bps, ths orresponds to the omputaton of 100 mllon matres of dmenson , whh results n a omputatonal task that s unfeasble even for a standard hgh performane mahne /$ - see front matter Ó 2011 Elsever B.V. All rghts reserved. do: /j.mpro

2 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) Hene, the omputatonal demands for the analyss of the bologal data produed by the varous sequenng tehnologes has lead to the development of several aeleratng strateges that am at parallelzng the exeuton of the algnment algorthms. Some of these strateges are software based, whle others use dedated hardware mplementatons. Among the former, an optmzed mplementaton usng Sngle-Instruton Multple-Data (SIMD) nstrutons for urrent CPUs [5] s ommonly adopted n sequene algnment programs, lke SSEARCH35. Other software mplementatons make use of the hghly parallel exeuton apabltes presented by Graphs Proessng Unt (GPU) to aheve a hgh algnment throughput [6]. Wth regard to the hardware mplementatons, these nlude both Applaton Spef Integrated Crut (ASIC) [7 10] and Feld Programmable Gate Array (FPGA) [11 15] mplementatons. Regardless of the onsdered mplementaton, the most ommon and effent hardware arhtetures map the algnment algorthm to a systol array of Proessng Element (PE). Furthermore, although some bdmensonal arrays have been presented [16], the most ommon mplementatons adopt undmensonal (lnear) arrays [7 13,15]. In fat, the man dfferenes among the several mplementatons relate to the desgn of the ndvdual PE. However, some of these desgns oversmplfy the mplemented algorthm, by only alulatng the edt dstane between a sequene par [12,14], therefore not beng suted to aelerate the more gener S W algorthm. A ommeral soluton [17], developed by CLC bo and mplemented n FPGA, was also made avalable but lttle nformaton s gven about ts arhteture. Nevertheless, all the prevously presented hardware solutons only fous on aeleratng the frst phase of the S W algorthm (DP matrx fll), ompletely dsregardng the seond phase (traebak), whh s typally performed usng a General Purpose Proessor (GPP) n a post proessng step. In Ref. [18] t was proposed a hardware arhteture that also aelerates the traebak phase. However, only the global algnment problem s addressed. Furthermore, the prevously proposed hardware arhtetures are not easly optmzed to deal wth short reads sequenes, obtaned from urrent HTSR sequenng platforms (e.g. Illumna). In another perspetve, there has been a growng nterest n the development of proessng solutons that merge, n a sngle pakage, the reonfguraton apabltes offered by FPGAs wth the advantages of a hardwred CPU. Suh solutons, lke the Intel Atom E645C proessor [19], allow to mplement hghly spealzed hardware aelerators tghtly oupled wth a general purpose CPU, n order to sgnfantly mprove the overall system performane. Furthermore, by makng use of the offered reonfguraton apabltes, t s possble to mplement a wde range of aelerators aordng to the spef task that s urrently beng exeuted. Ths task-multplexng apablty along the tme redues the total ost of ownershp of suh system due to ts adaptablty, low ntal ost and hgh performane. To overome the lmtatons of prevous aelerator arhtetures, to mprove the overall sequenng performane and to make use of the advantages provded by urrent FPGAs, a new hardware aelerator arhteture together wth a new tehnque to speedup the sequene algnment, s now proposed. Suh aelerator, targetng an embedded platform, s based on the explotaton of the followng two mportant ontrbutons that are extensvely desrbed n the remanng setons of the manusrpt: An nnovatve and qute effent tehnque that makes use of the nformaton gathered durng the omputaton of the algnment sores n the matrx fll phase (n hardware), n order to sgnfantly redue the tme and memory requrements of the traebak phase (later mplemented n software) [20]. To support suh tehnque, the developed hardware aelerator arhteture was tghtly ntegrated wth a GPP, to form a omplete and qute effent loal algnment system mplemented n an FPGA. The obtaned expermental results show that the proposed aeleratng struture may provde speedups as hgh as 16 for the mplementaton of the whole algnment proedure when ompared to an Intel Core2 Duo proessor. It s also observed that a sgnfant reduton of the memory resoures requred by the subsequent traebak phase s aheved. An addtonal level of parallelsm s also exploted n the proposed aeleratng struture, to further nrease ts performane. Wth the presented struture, several query sequenes may be smultaneously algned wth the same referene sequene, thus allowng a sgnfant aeleraton of the algnment task of the short reads aganst the referene genome, as used by HTSR tehnques. Ths s aheved by onfgurng the developed aelerator n a multple-stream struture, by nludng multple lnear arrays that work n parallel. Besdes the speedup that s aheved wth suh mprovement, whh s proportonal to the number of lnear arrays that are mplemented (defned through platform parameterzaton), the aelerator also takes advantages of the temporal loalty n the manpulaton of the larger referene genome, thus redung the number of requred memory and I/O aesses to perform the algnment. Ths manusrpt s organzed as follows: Seton 2 gves a bref overvew on the wdely adopted S W algorthm to determne the optmal algnment. The proposed tehnque to speed up the traebak phase s presented n Seton 3. Seton 4 ntrodues the newly developed aelerator arhteture that mplements the proposed enhanements n the algnment proedure, nludng the optmzatons for short reads sequenes. A performane model of the entre algnment system s presented n Seton 5. In Seton 6, the prototypng platform that ntegrates the proposed aelerator and a GPP s presented whle n Seton 7 the obtaned results are dsussed and the aheved speedups are presented. The onlusons are drawn n Seton Parwse loal sequene algnment Sequene algnment s the method by whh useful nformaton s extrated from the large amounts of sequened DNA. The algnments an be lassfed ether as loal or global. In global algnments, the omplete sequenes are algned from one end to the other, whereas n loal algnments only the subsequenes that present the hghest smlarty are onsdered. In prate, the loal algnment s generally preferred when searhng for smlartes between dstantly related bologal sequenes, sne ths type of algnment more losely fouses on the subsequenes that were onserved durng evoluton. One of the most wdely adopted algorthms to fnd the optmal loal algnment between a par of sequenes s the S W algorthm [3]. Ths algorthm s based on a DP method and s haraterzed by the smallest runtme among the optmal loal algnment algorthms. Wth a runtme omplexty of O(nm), where n and m denote the szes of the sequenes beng algned, the S W algorthm omputes the algnment n two phases: a DP matrx fll phase and a traebak phase Smth Waterman algorthm Consder any two strngs S 1 and S 2 of an alphabet R wth szes n and m, respetvely. The loal algnment of strngs S 1 and S 2 reveals whh par of substrngs of S 1 and S 2 optmally algn, suh that no other pars of substrngs have a hgher algnment sore. Let G(, j)

3 98 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) Table 1 Example of a substtuton sore matrx. Fg. 1. Obtaned loal algnment for the onsdered example sequenes. represent the best algnment sore between a suffx of strng S 1 [1..] and a suffx of strng S 2 [1..j]. The S W algorthm allows the omputaton of G(n, m) by reursvely alulatng G(, j), whh wll reveal the hghest algnment sore between the substrngs of strngs S 1 and S 2. The reursve relaton to alulate the loal algnment sore G(, j) s gven by Eq. (1), where Sb(S 1 (), S 2 (j)) denotes the substtuton sore value obtaned by algnng harater S 1 () aganst harater S 2 (j) and a represents the gap penalty ost (the ost of algnng a harater to a spae, also known as gap nserton). An example of a substtuton funton s shown n Table 1. 8 Gð 1; j 1ÞþSbðS 1 ðþ; S 2 ðjþþ >< Gð 1; jþ a Gð; jþ ¼max Gð; j 1Þ a >: 0 Gð; 0Þ ¼Gð0; jþ ¼0 The algnment sores are usually postve for haraters that math, thus denotng a smlarty between them. Msmathng haraters may have ether postve or negatve sores, aordng to the type of algnment that s beng performed, denotng the bologal proxmty between them. Dfferent substtuton sore matres may be used to reveal dfferent types of algnments. In fat, the partular sore values are usually defned by bologsts, aordng to evolutonary relatons. The gap penalty ost a s always a postve value. As soon as the entre sore matrx G s flled, the substrngs of S 1 and S 2 wth the best algnment an be found by loatng the ell wth the hghest sore n G. Then, all matrx ells that lead to ths hghest sore ell are sequentally determned by performng a traebak phase. Ths last phase onludes when a ell wth a zero sore s reahed, dentfyng the algned substrngs as well as the orrespondng algnment. The path taken at eah ell s hosen based on whh of the three neghborng ells (left, top-left and top) was used to alulate the urrent ell value usng the reurrene gven by Eq. (1). Table 2 shows an example of the alulated sore matrx for algnng two sequenes (S 1 = CAGCCTCGCT and S 2 = AATGCCATTGAC) Table 2 Example of an algnment sore matrx. ð1þ usng the substtuton sore matrx presented n Table 1 (a math has a sore of 3 and a msmath a sore of 1). The gap penalty has a value of 4. The shadowed ells represent the traebak path (startng at the hghest sore ell (8, 10)) that was taken n order to determne the best algnment. The resultng algnment s llustrated n Fg Trakng the algnment orgn and end ndexes As prevously referred, whenever a sequene par algnment s requred, t s neessary to mplement the traebak phase of the S W algorthm. Most sequene algnment hardware aelerators that have been proposed untl now [11,15,16] only mplement the sore matrx omputaton (wthout performng the traebak phase). Therefore, they smply output the alulated algnment sore (the hghest value of matrx G). Afterwards, whenever the obtaned sore s greater than a gven user-defned threshold, the whole G matrx must be realulated (usually by software, usng a GPP). However, ontrastng to what happened n the hardware aelerator, n ths realulaton all the ntermedate data that s requred to perform the traebak and retreve the orrespondng algnment must be mantaned n the GPP memory. Moreover, ths re-omputaton does not re-use any data from the prevous alulaton performed by the hardware aelerator. Suh stuaton an be even aggravated by the fat that typal algnments onsder sequenes wth a qute dssmlar sze, wth m n (e.g. HTSR sequenng analyss). Therefore, the sze of the subsequenes that partpate n the algnment s always n the order of n, meanng that a large part of matrx G that must be ompletely reomputed n the GPP s not even requred to obtan the algnment. To overome ths neffeny, an nnovatve tehnque s now proposed to sgnfantly redue the tme and memory spae that s requred to fnd the loal algnment n the traebak phase of ths algorthm. In fat, assumng that t s possble to know that the loal algnment of a gven sequene par S 1 and S 2 starts at poston S 1 (p) and S 2 (q), denoted as (p, q), and ends at poston S 1 (u) and S 1 (v), denoted as (u, v), then the loal algnment an be obtaned n the traebak phase by just onsderng the sore matrx orrespondng to substrngs S a = S 1 [p..u] and S b = S 2 [q..v]. To determne the harater poston where the algnment starts, an auxlary matrx C b s proposed. Let C b (, j) represent the oordnates of the sore matrx ell where the algnment of strngs S 1 [1..] and S 2 [1..j] starts. Usng the same DP method that s used to alulate matrx G(, j), t s possble to smultaneously buld matrx C b, wth the same sze as G, that traks the ell that orgnated the sore that reahed ell G(, j) (.e. the start of the algnment endng at ell (, j)). The reursve relatons to ompute matrx C b are gven by Eq. (2), wth ntal ondtons of C b (,0)=C b (0, j) = (0, 0). 8 ð; jþ; f Gð; jþ ¼Gð 1; j 1Þ þsbðs 1 ðþ; S 2 ðjþþ and C b ð 1; j 1Þ ¼ð0; 0Þ >< C b ð 1; j 1Þ; f Gð; jþ ¼Gð 1; j 1Þ C b ð; jþ ¼ þsbðs 1 ðþ; S 2 ðjþþ ð2þ and C b ð 1; j 1Þ ð0; 0Þ C b ð 1; jþ; f Gð; jþ ¼Gð 1; jþ a C b ð; j 1Þ; f Gð; jþ ¼Gð; j 1Þ a >: ð0; 0Þ; f Gð; jþ ¼0

4 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) Table 3 Example of an AOEI trakng matrx. Table 4 Redued algnment sore matrx. great reduton of the omputatonal effort (tme and spae) of the whole algnment algorthm. 4. Algnment ore arhteture Hene, by applyng the proposed tehnque, denoted as Algnment Orgn and End Indexes (AOEI) trakng, and by knowng the ell where the maxmum sore (G(u, v)) ourred, t s possble to determne from C b (u, v)=(p, q) the oordnates of the ell where the algnment began. Consequently, to obtan the desred algnment, the traebak phase only has to rebuld the sore matrx for the subsequenes S 1 [p..u] and S 2 [q..v], whh are usually onsderably smaller than the entre S 1 and S 2 sequenes. The obtaned matrx C b for the algnment example of sequenes S 1 and S 2, whose G matrx was presented n Table 2, s shown n Table 3. In ths example, by knowng from G matrx that the maxmum sore ours at ell (8, 10), t s possble to retreve the oordnates of the begnnng of the algnment n ell C b (8, 10) = (3, 4). Wth ths nformaton, the optmal loal algnment between S 1 and S 2 an be found by proessng only the substrngs S a = S 1 [3..8] = GCCTCG and S b = S 2 [4..10] = GCCATTG. Suh algnment (between S a and S b ) an now be determned by omputng a muh smaller G 0 matrx n the traebak phase, as shown n Table 4. The major advantage of ths tehnque s the sgnfant reduton of the tme and memory spae requred to reompute matrx G for the subsequenes that atually partpate n the algnment, when ompared to the entre sequenes. Therefore, t provdes a The loal algnment algorthm desrbed n Seton 2 s usually appled to proess bologal sequenes wth pronouned dssmlar szes m and n, where m n (e.g. m 10 6 and n 10 2 ). The matrx fll phase of the algnment algorthm s the most omputatonally ntensve part beng, therefore, a good anddate for parallelzaton. However, the data dependenes that exst n the alulaton of eah matrx ell hghly restrt the parallelzaton model. In fat, only the omputaton of the values along the matrx ant-dagonal dreton an be performed n parallel (to alulate the value for ell G(, j) t s neessary to know the values of G( 1, j 1), G(, j 1) and G( 1, j)). Spealzed parallel hardware that s apable of performng a great number of smultaneous arthmet operatons s espeally suted for ths task. In partular, lnear systol arrays wth several dental Proessng Elements (PEs), as shown n Fg. 2, have proved to be effent strutures to mplement ths type of omputaton, by smultaneously omputng the values of matrx G that are loated n a gven ant-dagonal [15] Base proessor element arhteture The PE s arhteture proposed n ths paper s based on the PE struture desrbed n Ref. [15] and llustrated n Fg. 3. Ths base PE only mplements the bas sore matrx alulaton and t s omposed by a two stage ppelned datapath that alulates eah matrx ell value (output n G(, j)). The throughput of eah element s one sore value per lok yle. Sne the S W algorthm requres the evaluaton of the maxmum sore value among the set of sores that ompose the entre matrx, t s neessary to nlude an addtonal datapath that selets the maxmum value that was Referene Sequene S 2 (M)... S 2 (2) S 2 (1) Query Sequene S 1 (1) S 1 (2) S 1 () S 1 (N) PE 1 PE 2 PE PE N Sb(S 1 (1),*) Sb(S 1 (2),*) Sb(S 1 (),*) Sb(S 1 (N),*) Query Sequene Data (substtuton matrx olumn) SR SR SR SR Auxlary Query Sequene Data Load Struture Fg. 2. Systol array struture for DNA algorthms.

5 100 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) S 2 (j) 2 2 S 2 (j 1) Sb(S 1 (),S 2 (j)) Sb w Max(-1,j) Max(,j) sgn extend G(,j 1) G( 1,j) + G( 1,j 1) + Sb(S 1 (),S 2 (j)) - α + G(,j) Fg. 3. Base arhteture of proessor element PE. alulated n the whole PE array (output Max(, j)). The wdth of the buses, denoted as and Sb w, are onstraned by the onsdered mplementaton ondtons of the aelerator. In partular, the wdth of the sore bus,, s dretly onstraned by the maxmum sze of the query sequene (the shortest among the two sequenes) and the sore matrx values. Suh substtuton sore values also determnes the sze of the orrespondng bus (Sb w ). Wth suh datapath, PE outputs the maxmum sore that was omputed by PEs 1 through. The array evolves along the tme, by shftng the referene sequene haraters through the PEs. The query sequene harater S 1 () s alloated to the th PE and ths PE performs, at every lok yle, the omputatons requred to determne the sore value of a ertan matrx ell. After all the referene sequene haraters S 2 (j) have passed through all the PEs, the algnment sore s avalable at output Max(, j) of the last PE. The omputaton that s performed n eah PE requres, among other operatons, the seleton of the substtuton sore orrespondng to the two haraters,.e. the value of Sb(S 1 (), S 2 (j)). Sne eah PE always operates wth the same harater of S 1, t only needs to store the olumn of the substtuton sore matrx (Sb) that represents the osts of algnng harater S 1 () wth the entre alphabet. In the omputaton of eah matrx ell value G(, j), the evaluaton of the maxmum value among the results of the three dstnt possbltes presented n Eq. (1) s also requred. In partular, the zero ondton of the S W algorthm s mplemented by ontrollng the reset nput of the regsters that store the G(, j) value. Suh reset makes use of the sgn bt of the sore value,.e. f the maxmum value among the three partal sores s negatve, then the regsters that hold that sore are leared Enhaned proessor element arhteture The PE arhteture that s now proposed mplements the AOEI aelerator tehnque that was desrbed n Seton 3. Wth ths tehnque, the re-omputaton of the entre G matrx when performng the traebak phase s avoded. It s mplemented by propagatng, through the PEs, not only the partal maxmum sores (as n the base PE), but also the oordnates of ther orgn (the begnnng of the algnment), together wth the oordnates where the maxmum sore ourred. As t was shown n Seton 3, ths greatly smplfes the traebak phase by only fousng on the substrngs that are atually nvolved n the algnment and avodng the re-omputaton of the whole matrx G. The arhteture of the enhaned PEs s presented n Fg. 4. Eah PE features a datapath that mplements Eqs. (1) and (2). The addtonal hardware that s requred to mplement Eq. (2) (the AOEI tehnque) s manly omposed of multplexers and regsters. The sgnals that ontrol these addtonal multplexers are generated by the magntude omparators ntegrated n the unts and that were already present n the base PE arhteture. The wdths of the oordnates buses, Cq w and Cr w, are onstraned by the maxmum query sequene sze and the maxmum referene sequene sze, respetvely. The wdth of the C w bus s the sum of Cq w and Cr w. The oordnates of the matrx ell under proessng are obtaned by usng the hardwred PE ndex () and the symbol oordnate (j) that omes alongsde wth the sequene harater present at nput S 2 (j). Regardng the nput data sgnals, the orgn oordnates that orrespond to the sore at nput G( 1, j) are present at nput C b ( 1, j). Lkewse, the orgn oordnates orrespondng to the sore at output G(, j) are present at output C b (, j). Fnally, the oordnates of the urrently hghest sore (present at Max(, j)) are output at MaxC b (, j) Short-read optmzatons When the query sequenes under proessng are aqured by short-read sequenng platforms, the sample sequenes an be extremely short and n some ases they may even have less symbols than the number of avalable PEs n the array. For nstane, the reads generated by the Illumna platform an be as short as 35 nuleotdes long. In suh a ase, several of the PEs do not perform any useful alulatons, due to the fat that no query sequene symbol s attrbuted to them. Ths stuaton would ertanly result n a substantal derease of the throughput of the array. Therefore, onsderng that n most pratal setups there s a very sgnfant number of short-read sequenes that must be algned wth the same referene sequene, alternatve arrangements of

6 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) j Cr w j 1 S 2 (j) 2 2 S 2 (j 1) Max(-1,j) Magntude Comparator Magntude Comparator Max(,j) MaxC b ( 1,j) 2C w MaxC b (,j) Sb(S 1 (),S 2 (j)) Cr w Sb w Cq w C b (,j 1) 2C w sgn extend C w G(,j 1) G( 1,j) + G( 1,j 1) + Sb(S 1 (),S 2 (j)) Magntude Comparator - α + Magntude Comparator G(,j) C w C w C b (,j) C b ( 1,j) C w = 0 C w C b ( 1,j 1) j 1 Fg. 4. Enhaned arhteture of proessor element PE. the avalable PE resoures may be onsdered n order to make t possble to smultaneously perform the algnment of more than one short query sequene to the same referene sequene. Ths optmzaton an be aheved wth the proposed arhteture by onfgurng the hardware aelerator n a multple-stream proessng sheme. In suh onfguraton, the aelerator nludes several oupled lnear arrays of PEs that work n parallel and algn to the same referene sequene. Hene, whle the referene sequene s smultaneously shfted to the multple arrays, the set of ndependent query sequenes to be proessed s dstrbuted and assgned among the PEs of the multple-stream array, as shown n Fg. 5. The exat number of parallel PE arrays s Query Sequenes A, B,...,X S A (1) S A (2) S A () S A (N) Referene Sequene S 2 (M)... S 2 (2) S 2 (1) PE 1A PE 2A PE A PE NA S B (1) S B (2) S B () S B (N) PE 1B PE 2B PE B PE NB S X (1) S X (2) S X () S X (N) PE 1X PE 2X PE X PE NX Fg. 5. Example onfguraton of a multple-stream PE array (several ndependent streams).

7 102 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) onfgurable aordng to the sze of the short-read sequenes to be algned and to the amount of avalable hardware resoures. It s worth notng that the mplemented multple-stream array also allows an mprovement of the resoure usage of the aelerator, sne t s possble to share a set of resoures that are ommon among the multple parallel PEs that are proessng the same referene sequene. Ths s aomplshed by usng a ommon set of regsters that hold the referene symbol (S 2 (j)) and the respetve oordnate (j) for the several elements of the array that work n parallel, as shown n Fg. 6 for a dual-stream onfguraton. Ths optmzaton an sgnfantly nrease the atual throughput of the array sne an nreased number of PEs s performng useful omputatons and the algnment of more than one query sequene may be smultaneously performed, therefore leadng to a greater speedup than would be aheved wth just a sngle array. Furthermore, the use of suh onfguraton also leads to a reduton of the amount of data that s transferred to the aelerator, sne the referene sequene s smultaneously algned wth more than one query sequene. Ths s espeally sgnfant when a large number of short-read query sequenes are algned to a large referene genome sequene Array programmng Sne eah PE ompares the referene sequene symbols wth a sngle query sequene harater, t wll just aess the values present at the orrespondng olumn of the substtuton matrx. Therefore, eah PE wll only reeve the substtuton sore matrx olumn that orresponds to the query sequene harater alloated to that PE. Suh data s stored n dedated regsters wthn eah PE, sne ths allows for a fast reprogrammng of a new query sequene. In the event of a PE s not beng used (beause the query sequene has a smaller sze than the number of avalable PEs (N)), the substtuton sore data that s stored n suh PE orresponds to a matrx olumn n whh every value s zero. To program the sore values orrespondng to query sequene S 1, an auxlary data load struture, omposed by a n bt-wdth shft regster, was nluded n the array. Ths struture allows the preloadng of the next query sequene data nto ths temporary storage shft regster, by serally shftng the substtuton matrx olumn, whle the array s stll proessng the data orrespondng to the urrent query sequene. As soon as the array has fnshed the proessng of the urrent query sequene, the next query sequene data (already stored n the auxlary shft regster) s parallel loaded (n just one lok yle) nto the respetve PEs. In ase the proposed aelerator arhteture s onfgured as a multple-stream struture, eah ndvdual PE array has the orrespondng auxlary data load struture for the query sequene, whh allows the smultaneous load of the query nformaton to the several PEs. Ths allows to mask the tme that would be requred to shft the next query sequene data nto the array and therefore sgnfantly redues ts programmng tme. Furthermore, the use of ths shft regster also provdes a salable method to program the proessor array, as t avods the use of a ommon data bus to program the several PEs Interfae To ntegrate the proposed hardware aelerator wth the GPP that wll mplement the remanng algnment proedure (.e. the Command buffer... Controller Status Data buffer... PE array Output buffer... Interonneton Bus GPP Fg. 6. Example of a multple-stream PE n dual-stream onfguraton. Fg. 7. Aelerator nterfae and nteronneton wth the GPP n the prototypng platform.

8 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) traebak), the systol array nludes an embedded ontroller that s responsble for deodng seven nstrutons (requred to properly ontrol the array), as well as to reeve the data to be proessed. The developed nterfae, llustrated n Fg. 7, s omposed of two nput Frst-In Frst-Out (FIFO) queue (one for the referene sequene and the other for ommands and the query sequene), one output FIFO queue (to return the proessed values) and one status regster. The two nput FIFOs allow the next query sequene to be loaded nto the array whle the urrent algnment s beng proessed, wthout nreasng the omplexty of the ontrol that would arse from havng all of the data (query and referene sequenes data) nput through the same FIFO. In the ase of a multple-stream onfguraton, the several query sequenes are nput usng the prevously mentoned nput FIFO and are then sent to a spef PE array, based on the nformaton defned by the man program runnng on the GPP. Afterwards, as soon as the algnment sores and orrespondng AOEI oordnates are alulated, they are serally stored n the output FIFO for later proessng n the GPP. Eah FIFO has a depth of 64 words and s 32-bts wde, to math the bus-wdth adopted by most urrent GPPs. The status regster ontans some nformaton about the avalable postons n eah of the nput FIFOs, allowng the mplementaton of a flow ontrol mehansm. Furthermore, ths status regster also ontans some nformaton regardng to the avalablty of output data n the output FIFO, ndatng when the aelerator has onluded the algnment. The developed nterfae allows ths aelerator to be nteronneted to several types of nteronneton buses, requrng only the desgn of the approprate log to deode the spef bus ontrol sgnals. The nput and output FIFOs an be mapped to the GPP memory address spae and therefore be easly aessble usng ommon load/store nstrutons. Ths type of nterfae an be used ether n PCI, PCIe, AMBA APB or other types of nteronnetons, therefore allowng ths aelerator to be used n a wde range of platforms. 5. Performane model The performane of a omplete algnment system omposed of several dfferent modules depends on the performane of eah ndvdual module and how they nterat. Among these are the CPU performane, the nteronneton (bus) throughput and the aelerator performane (f present). To better understand and evaluate the advantages provded by the proposed algnment struture, ths seton presents a thorough modelzaton of the resultng global performane. Typally, the set of operatons that are requred to perform an algnment n a system wthout an aelerator are: () database read, () data transfer to the proessng deve and () omputaton, whh nludes the matrx fll and the traebak phases. Assumng that these operatons are ompletely sequental, the total algnment tme (T s ) an be modeled as the sum of the database read tme (T db ), the data transfer tme T ds and the CPU proessng tme for the matrx fll phase and the traebak phase : T Ms T s ¼ T db þ T ds þ T Ms þ T T ð3þ The tme orrespondng to eah ndvdual omponent s gven by: T db ¼ n þ m f d B w Cg ; ð0 < g d d d 6 1Þ T ds ¼ n þ m f B w Cg ; T s ¼ TMs ð0 < g 6 1Þ þ T T ¼ ðnmþ f g þ k f g ; ð0 < g 6 1; P 1Þ T T ð4þ where n and m denote the query and referene sequene szes, k represents the number of ells traversed durng the traebak phase, f d, f and f denote the database read, nteronneton bus and CPU proessng frequenes, respetvely. C represents the ompresson fator (how many nuleotdes are enoded n an 8-bt word), whle B w d and Bw denote the wdth (n bytes) of the database read deve and of the nteronneton (bus), respetvely. The g d, g and g parameters denote effeny fators, whh take nto aount eventual ontenton on aessng the database, the nteronneton and the CPU, as well as nherent wat states and protool dependent ontrol operatons. Fnally, represents the average number of CPU lok yles requred to proess a sngle ell of the DP matrx. In sequental sngle-ore CPUs (wthout any aelerator), T s T db þ T ds, due to the OðnmÞ runtme of the matrx fll phase, whh leads to the ommonly observed total algnment tme of T s T s. In ontrast, when the proposed aelerator s present, the proessng s splt among the aelerator and the CPU. By onsderng (as an example) the arhteture of the proposed aelerator n a sngle-stream onfguraton, the tme t takes to ompute the whole DP sore matrx, n the aelerator (T a ) s gven by: T a ¼ N þ m 1 f a ; ðn 6 NÞ ð5þ where N represents the number of PEs n the array and f a denotes ther operatng frequeny. In ths parallel proessng sheme, the aelerator omputes the whole sore matrx (G), of sze n m, whle the CPU performs a muh smpler matrx fll (tme T Mr ) and traebak over the smaller matrx G 0, totalng a omputaton tme of T r ¼ TMr þ T T. Typally, an algnment only nludes part of the onsdered sequenes. Hene, the number of traversed ells durng the traebak (k) an be used to major the sze of the subsequenes that are used to ompute the smaller matrx G 0, whh wll thus have a maxmum sze of k k. By usng the proposed aelerator, t s possble to parallelze some operatons. In ths ase, the aelerator performs the DP matrx fll phase of the urrent sequene par algnment, whle the CPU mplements the traebak of the prevous sequene par. Therefore, both the aelerator and the CPU work n a ppelned way. Furthermore, t s also possble to read the next query sequene (as well as the referene sequene, f neessary) from the database n parallel wth the proessng of both the aelerator and the CPU. Ths type of proessng nvolves three dstnt data transfers, wth the respetve duraton: () from the database to the system s man memory T ds, () from the system s man memory to the aelerator T sa, and () from the aelerator to the system s man mem- ory T as. The tme to transfer the sore and oordnates output from the aelerator to the CPU T as, whh onssts of no more than fve 32-bt values, s qute small and thus an be negleted when ompared to the other parels T as T sa. The data transfers between the several omponents an our n parallel wth the remanng proessng (tme T r T ds algnment tme an be modeled as: n T p max T a ; T ds þ T sa ; T Mr þ T T ; T db þ T sa ). Therefore, the total Assumng that a data parallel 32-bt wde bus s used to nteronnet the aelerator, then B w ¼ 4. The same wdth s also typally used n the database deve nterfae, makng B w d ¼ 4. Furthermore, the used 2-bt enodng per nuleotde leads to C ¼ 4. Therefore, the algnment tme for these partular ondtons beomes: T p max N þm 1 nþm ; þ nþm ; ðkkþ þ k nþm ; f a 44 f g 44 f g f g f g 44 f d g d o ð6þ

9 104 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) Algnment tme (ms) T r T r T db T a T p Algnment Tme (ms) T p T s Speedup Speedup Referene/Query sze relaton (a) Referene/Query sze relaton (b) Fg. 8. Varaton of the algnment tme (T p ) and speedup (T s /T p ) aordng to the model desrbed n Eq. (7). ( T p max N þ m 1 n þ m ; þ n þ m ; ) ðk2 þ kþ n þ m ; f a 16 f g 16 f g f g 16 f d g d Hene, n an algnment senaro where the same referene sequene s algned to a large number of query sequenes (Q) and assumng that the referene sequene an be permanently stored n the system s man memory whle algnng all the respetve query sequenes, the database readng tme and the orrespondng data transfer tme to the system s memory are redued, leadng to an average algnment tme per query sequene ðt 1 Þ: ( ) T 1 max N þ m 1 ; n þ m=q þ n þ m f a 16 f g ; ðk2 þ kþ 16 f g f g ; n þ m=q 16 f d g d Moreover, onsderng that the aelerator may be able to perform b smultaneous algnments by usng the multple-stream feature, the average algnment tme for eah sequene par ðt b Þ s gven by ( ) T b max N þ m 1 ; n þ m=q þ n þ m=b f a b 16 f g ; ðk2 þ kþ 16 f g f g ; n þ m=q 16 f d g d As an example, Fg. 8 depts the total algnment tme aordng to the model desrbed by Eq. (7), n whh the referene sequene s read from the database for eah query sequene (worst ase). The onsdered model parameters are: n = k = 128, = 50, N = 128, f a = f = f = f d = 100 MHz, g = 0.2, g = 0.8 and g d = 0.5. One nterestng observaton that an be extrated from the presented model s onerned to the aelerator role n the resultng performane of the whole algnment system. In fat, as the relaton between the referene and the query sequene sze nreases, the aelerator beomes the most lmtng performane fator, as t has the hghest workload. Therefore, the nrease of the CPU performane, above a gven mnmum value, does not sgnfantly nfluene the performane of the algnment system leadng to a quas-statonary speedup value. In the presented example, the threshold value s about 8000, whh orresponds to the relaton between the referene and query sequenes szes frequently adopted n bonformat applatons. ð7þ ð8þ ð9þ 6. Prototypng platform To valdate the funtonalty and to assess the performane of the proposed hardware aelerator n a pratal realzaton, a omplete loal algnment system based on the S W algorthm was developed and mplemented. The bas onfguraton of ths system, used as a proof-of-onept, onssts of a Leon3 GPP proessor [21] that exeutes all operatons of the S W algorthm, exept those onernng the sore matrx omputaton phase. Suh phase s exeuted by the proposed hardware aelerator, atng as a spealzed funtonal unt of the GPP. The software mplementaton of the S W algorthm nludes some optmzatons n order to aheve more effent applatons n embedded systems. In partular, all memory aesses were optmzed by usng a stat memory alloaton mehansm. Speal attenton was also devoted to the data transfers of both the referene and query sequenes from the GPP to the proposed hardware aelerator, so that a hgh level of effeny s aheved Leon3 proessor The Leon3 proessor [21] s one of the most used proessor ores that are freely avalable. It was spefally desgned for embedded applatons by the European Spae Ageny, although nowadays t s mantaned by Gasler Researh. It onssts of a hghly onfgurable and fully syntheszable ore, desrbed n VHDL, mplementng a RISC arhteture onformng to the SPARC v8 defnton. Suh freely avalable VHDL desrpton allows ths GPP to be mplemented n several dfferent platforms (e.g. ASIC), unlke other propretary GPPs (e.g. Xlnx s MroBlaze). Furthermore, the avalablty of relable software development tools (e.g. ompler and debugger) for the Leon3 proessor make t an adequate hoe for the proof-of-onept system. The Leon3 32-bt ore s based on a Harvard mro-arhteture wth a 7-stage nstruton ppelne and 32-bt nternal regsters. The ore funtonalty an be easly extended by means of the AMBA 2.0 AHB/APB on-hp buses. The AMBA 2.0 AHB s used to onnet the Leon3 proessor wth hgh-speed ontrollers, suh as the ahe and memory ontrollers. On the other hand, the AMBA 2.0 APB s used to aess most on-hp perpherals and s onneted to the Leon3 proessor va the AHB/APB Brdge. External memory aess and memory mapped I/O operaton are

10 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) provded by a programmable memory ontroller wth nterfaes to PROM, SRAM and SDRAM hps DNA algnment perpheral A new perpheral, onsstng of the proposed hardware aelerator for DNA algnment, was developed and embedded n the Leon3 proessor (see Fg. 7). Ths algnment perpheral was onneted to the AMBA 2.0 APB as a slave deve. Ths bus was seleted not only beause t has enough bandwdth for all of the sequene data transfers, but also beause t offers a smple nterfae and low-power onsumpton. Some addtonal wrapper log, responsble for the adaptaton of the aelerator to the AMBA 2.0 APB bus, was also nluded, onsstng mostly of multplexers, deoders and a smple ontrol unt that mplements the bus protool. The I/O FIFOs and the status regster of the algnment ore are mapped n the Leon3 memory address spae. Hene, by usng suh nterfae, the wrte and read operatons over ths perpheral an be easly mplemented usng smple load and store operatons FPGA mplementaton The mplementaton of the proposed loal algnment system was realzed n an FPGA deve by usng a GR-CPCI-XC4V development board from Pender Eletron Desgn. Suh development system nludes a Vrtex4 XC4VLX100 FPGA deve from Xlnx, a 133 MHz 256 MB SRAM memory bank, and several perpherals for ontrol, ommunaton and storage purposes. The adopted Leon3 proessor s based on verson gpl b3403 of GRLIB. Ths soft-proessor ore was onfgured to norporate a hardware dvde and multply unt, an nterrupt ontroller, separate data and nstruton ahe ontrollers and an SRAM memory ontroller, all nteronneted wth the AMBA 2.0 AHB nterfae. Moreover, suh ore also enompasses two 32-bt tmers and the proposed DNA Algnment perpheral, whh were all onneted to the system AMBA 2.0 APB. 7. Expermental results The prevously presented aelerator arhteture, desrbed usng parameterzable VHDL ode, was syntheszed usng Xlnx ISE 10.1 (SP3) software tools and mplemented n the prevously desrbed FPGA. Ths reonfgurable embedded system, used fundamentally as a proof-of-onept prototypng platform, s omposed by the Leon3 GPP and the algnment aelerator ore wth an array omposed by a maxmum of 128 PEs. Although the maxmum operatng frequeny of the aelerator ore s 120 MHz, the atual operatng frequeny of the entre system s 60 MHz, as a onsequene of a lmtaton mposed by the onsdered Leon3 proessor mplementaton. However, as t was explaned n Seton 5, for the usual ranges of the relaton between the referene and query sequenes szes ths GPP lmtaton does not sgnfantly onstrant the overall performane of the system Sngle-stream onfguraton The obtaned resoure alloaton results of the entre algnment system, when onsderng the sngle-stream array onfguratons, are presented n Table 5. The resoures solely ouped by the Leon3 proessor are also presented as a referene. These results show that the Leon3 proessor alone oupes 18% of the avalable log resoures of the used FPGA deve. In what onerns the resoure alloaton for the systol array usng the enhaned PEs, t s possble to observe that t s 77% larger n relaton to the orrespondng base onfguraton, wthout the AOEI trakng funtonalty. However, the exat nrease of the amount of used hardware depends on the onsdered operatng envronment, namely, the sze of the sequenes to be algned (whh determnes the btwdth of the oordnate representaton) and the adopted sorng sheme (whh nfluenes the bt-wdth of the sore alulatons). To valdate and assess the performane of the proposed system, a set of real DNA sequenes was used for the referene sequene. These sequenes were obtaned from the GenBank database [2] and ther sze ranges from about to nuleotdes. Table 5 FPGA resoure alloaton of a sngle-stream array. PE Sore wdth Maxmum sze Resoure usage Type # Referene Query Regsters LUTs Leon (6%) 17,788 (18%) Base (8%) 19,818 (20%) Base ,736 (12%) 28,148 (29%) Base ,031 (16%) 34,130 (35%) Enh (10%) 22,168 (23%) Enh ,625 (23%) 36,114 (37%) Enh ,024 (41%) 56,541 (58%) Table 6 Proessng tme results for the algnment system when usng a sngle-stream array wth 128 PEs and a query sequene of 128 nuleotdes. Referene sze Proessng tme usng only the Leon3 proessor (ms) Matrx fll T Ms Traebak T T Total T s Proessng tme usng the Leon3 proessor and the proposed aelerator (ms) Sore and oordnates (HW) maxft a; T sa g Redued matrx fll (Leon3) T Mr Redued traebak (Leon3) T T Cyle perod T p 17, , , , , , ,311, ,623, Speedup

11 106 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) In what onerns the query sequenes, ther maxmum sze s lmted by the number of avalable PEs n the array. Consequently, for the mplemented onfguraton, t must not be greater than 128 nuleotdes long (a sze entrely ompatble wth the latest Next-Generaton Sequenng tehnologes [1]). In ths partular nstantaton, the sze of the hosen query sequenes s 128 nuleotdes. For larger query sequenes, the number of PEs n the array has to be nreased and, f neessary, the array an be expanded by onnetng another FPGA deve. The advantages provded by the proposed AOEI tehnque, as well as the performane of the developed hardware aelerator, were assessed usng the prevously seleted sequenes, whh were algned usng two dfferent methods: () pure software mplementaton, where the algnment between eah sequene par s obtaned usng a pure and straght-forward mplementaton of the S W algorthm runnng exlusvely on the GPP (keepng the entre sore matrx n memory) and () hardware aelerated mplementaton, where the algnment s obtaned by usng the developed aelerator (wth the enhaned PEs) and the GPP. The obtaned exeuton tme results for both of these methods are presented n Table 6. Whle the total proessng tme for the pure software mplementaton (T s ) s the sum of the partal tmes, the total tme of the hardware aelerated mplementaton (T p ) onsders the fat that the aelerator and the GPP work onurrently n a ppelned sheme: the aelerator determnes the sore and the algnment oordnates of a gven sequene par whle the GPP s performng the matrx reomputaton and traebak for the prevous par of proessed sequenes. Therefore, n the onurrent onfguraton, the presented total tme (T p ) s the maxmum value between the hardware aelerator max T a ; T sa and GPP exeuton tmes T Mr þ T T (see Eqs. (6) and (7)). It should be noted that the presented results for the aelerator proessng tme already onsder the ommunaton between the GPP and the aelerator max T a ; T sa and that the database readng and orrespondng data transfer tmes are not onsdered, sne the queres and referene sequene were pre-loaded to the system s man memory. The obtaned speedup was determned by omparng the tme requred to obtan eah whole algnment usng the pure software sequental mplementaton of the S W algorthm and the tme requred to obtan the same algnment wth the ad of the proposed AOEI tehnque and the orrespondng hardware aelerator. Aordng to the obtaned results, the attaned speedups may be as hgh as These speedups are n aordane to the trends predted n Seton 5 (see Fg. 8) and are the onsequene of a twofold ontrbuton: on the one hand, the parallelzaton of the whole matrx fll phase by the systol array; on the other hand, the reduton of the proessng tme requred to perform the traebak n the GPP, due to the sgnfant reduton of the sze of the sore matrx that must be reomputed n ths phase. At ths respet t s worth notng that the tme omplexty of the G matrx omputaton durng the matrx fll phase mplemented n the GPP s O(nm), whereas n the aelerator ths omplexty s redued to O(m), due to the parallel proessng n the n PEs. These two fators justfy the sgnfant speedup value that s attaned n determnng the loal algnment sore as t was desrbed n Seton 5. In what onerns the traebak phase, the tme omplexty s the same n both ases (O(n + m)). In fat, n order to perform the traebak n the GPP t s neessary to reompute the whole G matrx. Nevertheless, ths reomputaton tme s sgnfantly redued when the proposed AOEI tehnque s adopted. As an example, and onsderng the algnment of the 128 nuleotde query sequene wth the 1,311,701 referene sequene, the obtaned loal algnment spans over only a 124 nuleotde long subsequene of the referene sequene and over a 123 nuleotde subsequene of the query sequene. If the entre G matrx had to be reomputed to obtan the algnment, t would have approxmately ells, whh sgnfantly ontrasts wth the stuaton provded by the proposed AOEI tehnque, where the sze of the G 0 matrx that needs to be reomputed n the GPP s redued to only ells. Ths sgnfant reduton (of about four orders of magntude) s partularly mportant when the mplementaton of the algnment proedure s onsdered n embedded systems, wth strt memory and power onsumpton restrtons. Hene, not only does the proposed tehnque allow to sgnfantly redue the tme requred to obtan the algnment, but t also makes t possble to proess larger sequenes as t sgnfantly redues the amount of memory used by the GPP (e.g. the 2,623,402 nuleotde long referene sequene, whose memory requrements prevent t from beng algned usng the pure software approah on the GPP). Fnally, t s also mportant to note that the obtaned throughput results of the proposed systol PE array are n lne wth the results orrespondng to smlar arhtetures presented n the past [11,15,16]. However, suh past arhtetures were only foused on aeleratng the matrx-fll phase of the S W algorthm. In ontrast, besdes aeleratng the matrx-fll phase, the presented aelerator arhteture also mplements the new AOEI method and therefore returns addtonal nformaton that s subsequently used to further redue the omputatonal requrements. Suh feature s not nluded n any other proposals, therefore beng a dfferentatng haraterst of ths work and hnderng a dret and far omparson Multple-stream onfguraton To evaluate the developed multple-stream apablty, several dfferent onfguratons of the algnment system were mplemented. The orrespondng resoure usage results are presented n Table 7. The maxmum number of mplemented multplestreams was 3, sne the resoures of the onsdered FPGA deve do not allow for addtonal streams. However, any number of proessng streams s supported f there are enough resoures avalable to mplement them. All of the onsdered onfguratons have a maxmum referene sequene sze of As expeted, the results n Table 7 show a slght reduton n the amount of used resoures when ompared to an dentally szed sngle-stream array (.e. when the number of PEs of the snglestream array s equal to the number of PEs of the multple-stream array multpled by the number of streams). Ths reduton s due to the shared resoures among the multple-stream PEs, as well as the reduton n the bt-wdth requred to represent the AOEI oordnates, sne the query sequene beng algned s smaller. Therefore, n terms of used resoures, a trple-stream onfguraton s more advantageous to algn several short-read sequenes when ompared to three ompletely ndependent arrays. To evaluate the performane of the algnment task, three streams of short-read query sequenes, eah wth 35 nuleotdes and obtaned wth the Illumna sequenng platform, were algned Table 7 FPGA resoure usage of the multple-stream arrays. PE Sore wdth Resoure usage Type # n-stream Regsters LUTs Leon (6%) 17,788 (18%) Enh ,024 (41%) 56,541 (58%) Enh ,625 (23%) 36,114 (37%) Enh ,349 (39%) 54,183 (55%) Enh ,427 (16%) 28,071 (29%) Enh ,149 (25%) 38,169 (39%) Enh ,687 (33%) 48,205 (49%) Enh ,299 (35%) 50,143 (51%)

12 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) Table 8 Proessng tme results usng multple-stream array onfguratons, to algn three query sequene streams, eah wth 35 nuleotdes. #PE n-stream Referene sze Proessng tme usng the Leon3 proessor and the proposed aelerator (ms) PE oupany rate (%) Sore and oordnates (HW) Redued matrx fll (Leon3) Redued traebak (Leon3) ,623, ,623, ,623, Table 9 Performane omparson wth an Intel Core2 Duo CPU. Query sze Deve Intel CPU Aelerator (128 1) Intel CPU Aelerator (35 1) Aelerator (35 3) Tme (ms) Speedup Equvalent MCUPS to the same 2,623,402 nuleotdes long referene sequene. Sx dfferent aelerator onfguratons, all wth the proposed AOEI funtonalty, were used to obtan the algnments: () the sngle-stream onfguraton wth 128 PEs that was used n the prevous seton, () a sngle-stream, and a () dual-stream onfguratons wth 64 PEs eah, (v) a sngle-stream, (v) a dual-stream, and (v) a trple-stream onfguratons wth 35 PEs eah. The 35 PE arrays are adequately ftted to the sze of the short-reads beng algned usng ths sequenng tehnology. The aheved proessng tme results for algnng the three query streams usng the prevously desrbed aelerator onfguratons are presented n Table 8. As t s possble to observe, the algnment task s onsderably faster when the aelerator s onfgured as a multple-stream array wth the number of PEs n eah array dental to the query sequene sze, sne ths leads to a onfguraton where all the PEs are performng useful alulatons, leadng to a PE oupany rato of 100%. If the sngle-stream array wth 128 PEs s used to algn the three streams of 35 nuleotdes long query sequenes, the PE oupany rato of the array s sgnfantly dereased (down to 27%). Ths means that a sgnfant part of the PEs would be performng null operatons, sne only 35 of them would have a query sequene symbol assgned, therefore dereasng the atual throughput of the array. Consequently, the requred tme to obtan the sore and the ndex oordnates for the three streams of query sequenes s roughly three tmes the tme requred to obtan the same nformaton for a sngle stream (see Table 6). However, usng a trple-stream array where the number of PEs s adequately ftted to the query sequene sze (35 nuleotdes), t s possble to smultaneously algn three dfferent query sequenes usng the same hardware resoures, as presented n Table 8. Therefore, not only s the overall effeny of the system sgnfantly nreased, as an addtonal speedup s also aheved, proportonal to the number of mplemented lnear arrays Comparson and dsusson To omplete the presented arhteture evaluaton, the performane of the proposed aelerator was also ompared to the performane of a pure-software mplementaton runnng on a ommon CPU. The SSEARCH35 software program from the FASTA framework was used for ths purpose, sne t s one of the most used programs to determne the loal algnment. Ths program mplements the state-of-the-art SIMD optmzatons proposed n Ref. [22] and was exeuted on a 2.4 GHz Intel Core2 Duo proessor. The exeuton tmes were obtaned by algnng the same query and referene sequenes adopted n the prevous evaluatons. The obtaned exeuton tmes, presented n Table 9, show that the speedup attaned wth the oneved aelerator when ompared wth a pure software mplementaton runnng on the Core2 Duo may be as hgh as 16. In partular, the lower proessng tme obtaned for the short sequenes s due to the better usage of the avalable hardware resoures provded by the aelerator, whh enabled a trple-stream onfguraton usng the same FPGA deve. Table 9 also nludes the equvalent mllon ell updates per seond (MCUPS) metr, whh s ommonly used to ompare the performane of algnment algorthms aross dfferent platforms. However, ths metr only takes nto aount the throughput of the matrx fll phase of the S W algorthm, wthout onsderng the traebak phase requrements. Nevertheless, the performane obtaned usng the developed system stll aheves a sgnfant speedup ompared to the SSEARCH35 program. The derease n performane of the software based soluton for the smaller query sequenes reveals ts nablty to mantan the performane levels wth suh short sequenes. Furthermore, t s mportant to reall that the overall performane of the aelerator s proportonal to the total number of PEs, thus the apparent smaller equvalent performane of the 35 trple-stream PEs (35 3 = 105) array when ompared to the 128 sngle-stream PEs array. Regardng the database read rate, the mplemented aelerator requres one referene sequene nuleotde n eah lok yle. As prevously mentoned, four nuleotdes are enoded n eah byte, thus the aelerator requres an nput transfer rate of, at least, 15 MB/s. In the worst ase senaro, n whh the referene sequene s not stored n the man memory and needs to be read from the database at eah algnment, the database readng rate (whh also needs to aount for the muh smaller query sequene reads) has to be hgher than 15 MB/s n order to sustan the operaton of the aelerator at ts maxmum performane. Current manstream storage deves (hard dsk drves) have a sustaned throughput above 100 MB/s. Even when onsderng the aelerator runnng at 120 MHz, the database readng rate would double to 30 MB/s, stll well below the throughput of the storage deves.

13 108 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) To further demonstrate the mplementaton alternatves offered by the proposed aelerator, the proessng ore was also syntheszed for the FPGA deve avalable n the Intel Atom E645C proessor, an Altera Arra II GX deve [19]. The synthess was performed usng the Quartus II v10.1 software from Altera. The obtaned results demonstrated that the proessor s apable of operatng at a lok frequeny of 120 MHz. Synthess results also revealed that the avalable hardware resoures of ths deve allow to mplement aelerators wth 128 PEs n dual-stream onfguraton and wth 35 PEs n a 6-stream onfguraton. Aordng to the model derved n Seton 5, these onfguratons sgnfantly mprove the overall performane of the aelerator allowng for the onurrent algnment of two 128 nuleotdes long query sequenes (N = 128, b = 2) or sx 35 nuleotdes long query sequenes (N = 35, b = 6), respetvely. In ths ase, only the aelerator s mplemented n the FPGA, whle the Intel Atom proessor performs the role of the GPP. Fnally, a last observaton onernng the system ost s deserved. In fat, the aquston ost of a system based on a hybrd platform, lke the Intel Atom E645C, s smlar to the ost of urrent off-the-shelf omputng systems, lke those based on the Intel Core2 Duo proessors. However, f the hgher throughput provded by the aelerator mplemented n the FPGA s taken nto aount, the algnment system based on ths new platform wll aheve a muh smaller ost per algnment than urrent mplementatons. Moreover, the memory sze reduton provded by the proposed aelerator also allows a further reduton of the total system ost. 8. Conlusons A hghly effent hardware aelerator arhteture that sgnfantly speedups the mplementaton of DNA loal algnment algorthms s presented. Suh aelerator s based on the explotaton of an nnovatve and qute effent tehnque to sgnfantly redue the omputatonal tme and memory requrements of the traebak phase that s exeuted as part of the wdely used Smth Waterman algorthm. Furthermore, the developed struture also explots an addtonal level of parallelsm, n order to smultaneously algn several query sequenes wth the same referene sequene, by adoptng a mult-stream proessng flow. Suh feature s partularly useful n the proessng of short-read DNA sequenes obtaned from urrent HTSR sequenng tehnologes. The developed aelerator was ntegrated wth a Leon3 general purpose proessor, n order to prototype a omplete embedded algnment system for DNA proessng. The oneved platform was mplemented n a Vrtex-4 FPGA. The obtaned results demonstrate that the developed aelerator provdes speedups as hgh as 6042, when ompared wth a pure software verson of the Smth Waterman algorthm, runnng on the Leon3 proessor. Speedups up to 16 were also aheved when ompared to an hghly optmzed SIMD software mplementaton runnng on an Intel Core 2 Duo Proessor. The obtaned results also reveal that the proposed multplestream onfguratons favor the explotaton of the avalable FPGA resoures and ad n mantanng the array runnng at maxmum performane n dfferent algnment senaros. Moreover, t was shown that the use of the proposed aelerator enables the algnment of larger DNA sequenes, even n a memory restrted envronments. Aknowledgments The presented researh was performed n the sope of projet HELIX: Heterogeneous Mult-Core Arhteture for Bologal Sequene Analyss, funded by the Portuguese Foundaton for Sene and Tehnology (FCT) wth referene PTDC/EEA-ELC/113999/2009, and partally supported by FCT (INESC-ID multannual fundng) through the PIDDAC Program funds and through the Ph.D. grant wth referene SFRH/BD/43497/2008. Referenes [1] J. Shendure, H. J, Next-generaton DNA sequenng, Nat. Botehnol. 26 (2008) [2] D.A. Benson, I. Karsh-Mzrah, D.J. Lpman, J. Ostell, E.W. Sayers, GenBank, Nule Ads Res. 38 (2010) D46 D51. [3] T.F. Smth, M.S. Waterman, Identfaton of ommon moleular subsequenes, J. Mol. Bol. 147 (1981) [4] M.J. Chasson, P.A. Pevzner, Short read fragment assembly of bateral genomes, Genome Res. 18 (2008) [5] M.S. Farrar, Strped Smth-Waterman speeds database searhes sx tmes over other SIMD mplementatons, Bonformats 23 (2007) [6] L. Lgowsk, W. Rudnk, An effent mplementaton of Smth Waterman algorthm on GPU usng CUDA, for massvely parallel sannng of sequene databases, n: IEEE Int. Symp. Parallel & Dstrbuted Proessng, IPDPS 2009, IEEE, 2009, pp [7] E.T. Chow, J.C. Peterson, M.S. Waterman, T. Hunkapller, B.A. Zmmermann, A systol array proessor for bologal nformaton sgnal proessng, n: Pro. of the 5th Internatonal Conf. on Superomputng, ICS 91, ACM, New York, NY, USA, 1991, pp [8] P. Guerdoux-Jamet, D. Lavener, SAMBA: hardware aelerator for bologal sequene omparson, Bonformats 13 (1997) [9] T. Han, S. Parameswaran, Swasad: an as desgn for hgh speed DNA sequene mathng, n: Pro. of the 2002 Asa and South Paf Desgn Automaton Conf., ASP-DAC 02, IEEE Computer Soety, Washngton, DC, USA, 2002, pp [10] C. Whte, R. Sngh, P. Rentjes, J. Lampe, B. Erkson, W. Dettloff, V. Ch, S. Altshul, BoSCAN: a VLSI-based system for bosequene analyss, n: Pro. IEEE Int. Conf. on Computer Desgn: VLSI n Computers and Proessors, ICCD 91, pp [11] K. Benkrd, Y. Lu, A. Benkrd, A hghly parameterzed and effent FPGA-based skeleton for parwse bologal sequene algnment, IEEE Trans. Very Large Sale Integr. (VLSI) Syst. 17 (2009) [12] G. Caffarena, C. Pedrera, C. Carreras, S. Bojan, O. Neto-Taladrz, FPGA aeleraton for DNA sequene algnment, J. Cruts Syst. Comput. 16 (2007) [13] M. Gokhale, B. Holmes, A. Kopser, D. Kunze, D. Loprest, S. Luas, R. Mnnh, P. Olsen, Splash: a reonfgurable lnear log array, n: Int. Conf. on Parallel Proessng, 1990, pp [14] S.A. Guone, E. Keller, Gene mathng usng JBts, n: Pro. 12th Int. Conf. Feld-Programmable Log and Applatons. FPL 02, Sprnger-Verlag, London, UK, 2002, pp [15] T. Olver, B. Shmdt, D. Maskell, Hyper ustomzed proessors for bosequene database sannng on FPGAs, n: Pro. 13th Int. Symp. Feld- Programmable Gate Arrays, FPGA 05, ACM, 2005, pp [16] L. Hasan, Z. Al-Ars, Z. Nawaz, K. Bertels, Hardware mplementaton of the Smth Waterman algorthm usng reursve varable expanson, n: 3rd Int. Desgn and Test Workshop, IDT 2008, IEEE, 2008, pp [17] CLC Bo, Whte paper on CLC Bonformats Cube 1.03, Tehnal Report, CLC Bo, Fnlandsgade Aarhus N Denmark, [18] S. Lloyd, Q. Snell, Sequene algnment wth traebak on reonfgurable hardware, n: Int. Conf. Reonfgurable Computng and FPGAs ReConFg 08, IEEE, 2008, pp [19] Intel Ò Atom Proessor E6x5C Seres Produt Prevew Datasheet, Intel Corporaton, [20] N. Sebastão, T. Das, N. Roma, P. Flores, Integrated aelerator arhteture for DNA sequenes algnment wth enhaned traebak phase, n: Internatonal Conferene on Hgh Performane Computng and Smulaton. HPCS, 2010, pp [21] Aeroflex Gasler, SPARC V8 32-bt Proessor LEON3/ LEON3-FT CompanonCore Data Sheet, Verson 1.0.3, [22] M. Farrar, Strped Smth Waterman speeds database searhes sx tmes over other SIMD mplementatons, Bonformats 23 (2007) Nuno Sebastão was born n Lsbon, Portugal n Sne 2007, he holds a M.S. degree n Eletral and Computer Engneerng from Insttuto Superor Téno (IST), Tehnal Unversty of Lsbon, Lsbon, Portugal. In 2007 he joned the Insttuto de Engenhara de Sstemas e Computadores R&D (INESC-ID) as a researher of the Sgnal Proessng Group (SPS) where he s urrently workng towards hs PhD degree, also n Eletral and Computer Engneerng. Hs man researh nterests are foused on Dedated Mult-Core Computer Arhtetures and Hgh-Performane Systems for Bologal Sequene Algnment (DNA, RNA and protens). He s a member of the IEEE Cruts and Systems Soety.

N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) 96 109 109 Nuno Roma was born n Entronamento Portugal n 1975. He reeved the Ph.D.

He s urrently an Assstant Professor wth the Department of Computer Sene and Engneerng at IST, and a Senor Researher of the Sgnal Proessng Systems Group (SPS) of Insttuto de Engenhara de Sstemas e

14 N. Sebastão et al. / Mroproessors and Mrosystems 36 (2012) Nuno Roma was born n Entronamento Portugal n He reeved the Ph.D. degree n eletral and omputer engneerng from Insttuto Superor Téno (IST), Unversdade Téna de Lsboa, Lsbon, Portugal, n He s urrently an Assstant Professor wth the Department of Computer Sene and Engneerng at IST, and a Senor Researher of the Sgnal Proessng Systems Group (SPS) of Insttuto de Engenhara de Sstemas e Computadores R&D (INESC-ID). Hs researh nterests nlude spealsed omputer arhtetures for dgtal sgnal proessng (nludng bologal sequenes proessng and mage and vdeo odng/transodng), embedded systems desgn and ompressed-doman vdeo proessng algorthms. He has ontrbuted to more than 40 papers to journals and nternatonal onferenes. He s a member of the IEEE Cruts and Systems Soety and a member of ACM. Paulo Flores reeved the fve-year engneerng degree, M.S. and Ph.D. degrees n eletral and omputer engneerng from the Insttuto Superor Téno, Tehnal Unversty of Lsbon, Lsbon, Portugal, n 1989, 1993, and 2001, respetvely. Sne 1990, he has been teahng at Insttuto Superor Téno, Tehnal Unversty of Lsbon, where he s urrently an Assstant Professor n the Department of Eletral and Computer Engneerng. He has also been wth the Insttuto de Engenhara de Sstemas e Computadores R&D (INESC-ID), Lsbon, sne 1988, where he s urrently a Senor Researher n the Algorthms for Optmzaton and Smulaton Group (ALGOS). Hs researh nterests are omputer arhteture and CAD for VLSI ruts n the area of embedded systems, test and verfaton of dgtal systems, and omputer algorthms, wth partular emphass on optmzaton of hardware/software problems usng satsfablty (SAT) models. He s a member of the IEEE Crut and Systems Soety.

Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec

Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec Matrx-Matrx Multplaton Usng Systol Array Arhteture n Bluespe Team SegFault Chatanya Peddawad (EEB096), Aman Goel (EEB087), heera B (EEB090) Ot. 25, 205 Theoretal Bakground. Matrx-Matrx Multplaton on Hardware