Solving Planted Motif Problem on GPU

Size: px

Start display at page:

Download "Solving Planted Motif Problem on GPU"

Ira Chapman
5 years ago
Views:

1 Solvng Planted Motf Problem on GPU Naga Shalaja Dasar Old Domnon Unversty Norfolk, VA, USA Ranjan Desh Old Domnon Unversty Norfolk, VA, USA Zubar M Old Domnon Unversty Norfolk, VA, USA zubar@cs.odu.edu ABSTRACT (l,d) planted motf problem s defned as: Gven a sequence of n DNA sequences, each of length L, fnd M, the set of sequences(or motfs) of length l whch have at-least one d- neghbor n each of the n sequences. Planted motf problem s an mportant and well-studed problem n computatonal bology. Motf fndng s useful for developng methods to obtan transcrpton factor bndng stes, sequence classfcaton, n developng methods for buldng phylogenetc trees etc. The planted motf problem s dffcult to solve especally for challengng nstance szes (15,5), (17,6), (19,7), and (21,8). The challengng nstances are computatonally ntensve and requre large amount of memory. Several seral mplementatons have been proposed for solvng ths problem. The tme requred by these methods for solvng large challenge nstances s prohbtvely expensve. In ths paper, we propose a parallel mplementaton on GPU that solves the challenge nstance (21,8) n 1.1 hours. We are not aware of any sequental or parallel method that wll solve ths challenge nstance n better tme. Addtonally, to the best our knowledge we are not aware of any prevous mplementaton of a parallel method to solve the planted motf problem on GPU. 1. INTRODUCTION Motf fndng s an mportant and well-studed problem n computatonal bology [18] [6]. Motf fndng s useful for developng methods to obtan transcrpton factor bndng stes, sequence classfcaton, n developng methods for buldng phylogenetc trees etc. Fndng motf s a computatonally expensve and challengng task. Many varants of motf fndng problem can be found n the lterature. One set of varants concentrates on fndng repeated patterns n a sngle sequence, and the other set concentrates on fndng patterns that appear n multple sequences. The planted motf problem (PMP) falls n the second category. An (l,d) planted motf problem can be defned as Gven a sequence of n DNA sequences, each of length L, fnd M, the set of sequences(or motfs) of length l whch have at-least one d-neghbor n each of the n sequences. A d-neghbor of an l-mer(sequence of length l) p s defned as an l-mer that s at a Hammng dstance of d or less from p. In the rest of the paper, we refer to l as enumeraton length and d as enumeraton dstance. A number of approaches have been proposed to solve the motf fndng problem ncludng PMP. Some of these approaches fnd approxmate motfs [12], [2], [14] and others fnd exact motfs[9], [16], [17], [5], [15], [11], [3], [13], [10], [8]. These approaches can be classfed nto two types: teratve approaches and combnatoral approaches. Iteratve approaches lke Gbbs samplng and expectaton maxmzaton are based on poston weght matrces whle combnatoral approaches lke MITRA, WINDOWER are based on hammng dstances. Planted motf problem defned n ths paper s based on hammng dstances. Most approaches to solve PMP are seral n nature and are dffcult to parallelze. We had recently proposed a new parallel approach to solve PMP called BtBased approach[7]. BtBased s a smple, easly parallelzable approach. It outperforms all the approaches proposed so far to solve the plantedmotf problem. Inths paper, we show howtomplement BtBased on GPU archtecture. Iteratve approaches lke Gbbs samplng [19] and MEME [4] have been mplemented on GPU whle there are no combnatoral approaches mplemented on GPU currently. BtBased s an enumeraton based approach to solvng planted motf problem. Ituses n btarrays, n n, ofsze 4 l eachto fnd the planted motfs. Each bt n the bt array corresponds to an l-mer. The key dea of BtBased s to enumerate all the l-mers n the nput sequences to fnd ther d-neghbors and set the bts correspondng to the d-neghbors n the bt arrays. It then uses the bt arrays to fnd the planted motfs. It can be notced that BtBased has hgh memory requrement. To reduce memory requrement one can use the teratve BtBased approach at the expense of ncreasng the executon tme. Iteratve approach works by vrtually parttonng the bt arrays nto chunks such that a chunk fts n the avalable memory. We then make multple passes of the orgnal algorthm to fnd motfs. The number of passes s determned by the number of vrtual parttons. A small chunk sze results n ncreased number of vrtual parttons, and thus ncreasng the overall tme to fnd motfs.

2 GPUs are becomng ncreasngly popular n the world of parallel computng. GPUs, whch were once used only for graphcs, are now beng used for dfferent types of applcatons to acheve hgh performance. Wth the advent of CUDA, the task of programmng for GPU has become much smple. A GPU s a massvely parallel, mult-threaded, manycore processor wth hundreds of cores and huge computaton power. It can execute thousands of threads concurrently. The programmer must carefully desgn her applcaton to map to GPU and effectvely utlze the hardware. In ths paper we parallelze the BtBased approach[7] for GPU. Though BtBased approach s easly parallelzable, t s challengng to effectvely mplement t on GPU. The reason beng the hgh memory requrement. We have seen that BtBased uses bt arrays to fnd planted motfs and that the bt arrays are of sze 4 l bts each. And moreover the access to the bt arrays s very scattered. For example, to solve a (15,5) nstance, BtBased needs bt arrays of sze 128MB each. Such amount of memory s only avalable on GPU s global memory. But global memory has very hgh latences especally when the access pattern s scattered. In such cases t s hghly recommended to use GPU s shared memory. But the shared memory s too small (16KB for Tesla C1060 and S1070) to accommodate the bt arrays. So we use teratve BtBased approach and partton the bt arrays nto chunks that ft n shared memory. We then optmze the approach by decreasng the regster usage whch ncreases the occupaton of the GPU. We also do reorderng of shared memory to avod bank conflcts.. We have mplemented BtBased on NVda Tesla C1060 whch has one GPU devce and NVda Tesla S1070 whch has four GPU devces. Tesla C1060 has 30 mult-processors wth 8 streamng processor cores each whle Tesla S1070 has 960 cores. We tested the (15,5), (17,7), (19,7), (21,8) challengng nstances. Tesla C1060 took 8 seconds, 1.52 mnutes, 19.7 mnutes and 4.5 hours respectvely and Tesla S1070 took 3 seconds, 23.9 seconds, 5 mnutes and 69 mnutes respectvely. These are the best tmngs obtaned for planted motf problem so far. We also compare wth the results on multcore archtecture. We found that a sngle GPU shows up to 13 to 14 tmes speed-up and 4 GPU devces shows up to 40 to 60 tmes speed-up compared to sngle core CPU. 2. THE BITBASED APPROACH BtBased approach s a smple, easly parallelzable approach to solvng PMP. It s based on exhaustve enumeraton of l- mers n the nput sequences. Let S = {S 0 n 1} be the set of n nput sequences. An l-mer n S startng at locaton j, 0 j L l s represented as S l {j}. The set of d-neghborsofall thel-mersns srepresentedbyn l,d. Its easy to see that the set of planted motfs s M = n 1 =0 Nl,d. Therefore, to fnd the planted motfs we frst need to generate the set of N l,d, 0 n 1, and then fnd the motfs,.e. l-mers that are present n all N l,d, 0 n 1. The man ssue here s the memory requrement. To see the ssue consder (15,5) nstance. For a 15-mer, there can be number of 5-neghbors. For a sequence of length 600, the sze of N l,d s ntegers whch requres approxmately 2GB of memory for a sngle sequence. To reduce the memory requrement we use bt arrays of sze 4 l. Each bt n the bt array corresponds to an l-mer. For example, when l = 4 bt 0 represents AAAA, bt 1 represents AAAC, bt 255 represents TTTT assumng A=0, C=1, G=2, T=3. For (15,5) nstance we now requre only 4 15 bts.e. 128MB of memory for each nput sequence. The memory requrement can further be reduced usng the approaches mentoned n sectons 2.1.1, and The basc BtBased approach The basc BtBased approach conssts of two phases, settng bts and fndng motfs. In settng bts phase, N l,d, 0 n 1, sgenerated. N l,d s represented usng bt arrays. A bt array B s assgned to each nput sequence S, 0 n 1. Each l-mer n sequence S s enumerated to generate all ts d-neghbors and the bts are set n the bt array B at the ndexes correspondng to the d-neghbors. The ndex correspondng to an l-mer can be obtaned by replacng A by 00, C by 01, G by 10 and T by 11. For example the ndex correspondng to the 4-mer GACT s After settng bts phase, a bt array B has a bt set only f the l-mer correspondng to ts ndex s present n N l,d. In fndng bts phase, the equvalent to M = n 1 =0 Nl,d s performed. We perform logcal AND operaton on the bt arrays to generate a sngle bt array whch can be used to obtan the planted motfs. The fnal bt array B s obtaned by B = B 0 B1... Bn 1. If a bt s set at ndex j n B only f the bt s set at ndex j n all the bt arrays B, 0 n 1. In other words, the l-mer correspondng to the ndex j s present n all N l,d, 0 n 1 makng the l-mer a planted motf. Therefore the planted motfs are nothng but the l-mers correspondng to the ndexes n B n whch a bt s set. To reduce the memory requrement further, we use two modfcatons to the basc approach: Increment motfs and flterng motfs. These modfcatons, f applcable, not only reduce the memory requrement but also mprove the performance Increment Motfs Ths modfcaton s based on the observaton that gven the set of motfs for (l 1,d) nstance ther d-neghbors and correspondng dstances n all the n sequences, we can fnd the motfs for (l,d) nstance n O(n) tme. Let p be a motf for (l 1,d) nstance. Let (j 0,j 1,...,j n 1) and (d 0,d 1,...,d n 1) be the locatons of d-neghbors n n sequences and ther dstances respectvely. We can say that p R, R {A,C,G,T} and s append operaton, has a d-neghbor n sequence S f t satsfes any of the followng condtons: 1. resdue at locaton j + l s R. 2. d < d. For each motf p for (l 1,d) nstance, we fnd f p A, p C, p G, p T s a motf for (l, d) nstance usng the above condtons. Therefore to fnd (l,d) motfs, we can frst fnd (l,d) motfs and then use the above logc ncrementally to fnd (l,d) motfs. Wth decreasng values of l, the number of (l,d) motfs ncrease exponentally and hence the tme spent n ncrement motfs. Therefore the value of l must be carefully chosen Flter Motfs Instead of settng bts and fndng motfs for all n sequences, ths modfcaton frstfndsthe motfs for n sequenceswhere

3 n n. These motfs are called canddate motfs. These canddate motfs are then fltered to fnd the fnal planted motfs. Ths s done by checkng each of the canddate motfs f t s present n all the remanng n n nput sequences. Ths modfcaton reduces the memory requrement because we now requre only n buffers nstead of n buffers. By decreasng the value of n, not only the space requrement decreases but also the tme decreases. The reason beng that the tme taken by BtBased approach s domnated by settng bts phase. By reducng n we need to set the bts for fewer sequences and hence reducng the tme taken. But f the value of n s chosen to be too low, then the tme spent n flterng motfs ncreases and so the overall tme. So t s mportant to chose an optmum value for n. 2.2 The Iteratve BtBased Approach Ths s a crucal modfcaton to the basc BtBased approach and also s the bass for mplementng BtBased on GPU. As we have seen prevously, BtBased has hgh memory requrement. It mght not always be possble to satsfy such requrement. In such cases, we can use the teratve BtBased approach. Iteratve BtBased approach solves the planted motf problem wth much less memory requrement but at the expense of ncrease n tme due to the ncrease n number of operatons. Iteratve approach works by reusng the avalable memory to accomplsh the requred task, whch s to fnd planted motfs. Let l max=max{ 4 bts of memory can be allocated}. We vrtually partton the bt array of sze 4 l nto 4 l lmax chunks, each chunk of sze 4 lmax bts. In th teraton, the l-mers of nput sequences are enumerated n such a way that the bts are only set n the th chunk. After fndng motfs n th chunk the same memory s then reused for the (+1)th teraton. Note that when bt array of sze 4 l bts s parttoned nto 4 l lmax chunks, the frst l l max resdues correspondng to the ndexes n a chunk are all the same. For example, when we partton 4 17 bts nto 16 parttons, all the 17-mers correspondng to the ndexes n the frst chunk start wth AA, second chunk starts wth AC, and so on. To effectvely enumerate the l-mers, we reduce the enumeraton length from l to l max as shown n algorthm 1. Note that the more number of chunks the bt array s parttoned nto, the less s the enumeraton length. 3. OVERVIEW OF GPU GPU s a massvely parallel, mult-threaded, manycore processor. Each GPU devce s an array of streamng multprocessor whch n turn conssts of a number of scalar processor cores. GPU s capable of runnng thousands of threads concurrently. It s able to do so by employng SIMT(snglenstructon multple-threads) archtecture. The threads are created, scheduled and executed n groups called warps. All the threads n a warp share a sngle nstructon unt. The threads n a GPU are extremely lght weght and they can be created and executed wth zero schedulng overhead. CUDA s a parallel programmng model that enables programmers to develop scalable applcatons to be executed on GPU. It exposes a set of extenson to C and C++. A CUDA program s organzed nto sequental host code whch s executed on CPU and calls to functons called kernels whch are executed on GPU. A kernel contans the devce code that s executed by the GPU threads n parallel. CUDA threads Algorthm 1 IteratveApproach Input: n, l, l max Output: M, the set of (l, d) planted motfs 1: Let l dff = l l max 2: M = 3: for dx = 0 to 4 l dff 1 do 4: get the sequence p of length l dff that corresponds to dx 5: {settng the bts n dx th chunk} 6: for = 0 to n 1 do 7: for j = 0 to L l+1 do 8: get dstance d between p and S l dff {j} 9: generate N lmax,d d {j +l dff } 10: for each l max-mer q n N lmax,d d {j +l dff } do 11: get ndex dx correspondng to q 12: set B [dx ] = 1 13: end for 14: end for 15: end for 16: 17: {fndng motfs n dx th chunk} B = B 0 B1... Bn 1 18: for = 0 to 4 lmax 1 do 19: f B[] = 1 then 20: Let r be the l max-mer correspondng to 21: Append r to p and add the appended sequence to M 22: end f 23: end for 24: clear all the bt arrays B 0 to B n 1 25: end for can be grouped nto thread blocks. Usng CUDA one can defne the number of blocks and the number of threads per block that can execute a kernel. 3.1 Memory organzaton The devce RAM s vrtually and physcally dvded nto dfferent types of memory: global, local, constant and texture memory. Apart from devce RAM the threads can also access on-chp shared memory and regsters as shown n fgure 1. Global memory and texture memory have hghest latency compared to the other types of memory. A thread has exclusve access to ts local memory. All the threads n a block can access on-chp shared memory. All the threads across all thread blocks have access to global, texture and constant memory. Constant and texture memores are read only whle global s both read and wrte. 3.2 Performance consderatons A CUDA program should be properly desgned takng advantage of the resources for better performance. Snce GPU uses a SIMT archtecture n whch all the threads n a warp use a sngle nstructon unt, the best results can be acheved when all the threads n a warp execute wthout dvergng. When threads dverge they are executed serally, thus decreasng performance. Global memory has very hgh latency. But by coalescng the global memory accesses, hgh throughput can be acheved. For example f the threads n a warp access contguous ad-

block enumerate the l-mers n such a way that they generate the d-neghbors only n the chunk of bt arrays assgned to the block. We use the same logc as n teratve approach.

4 block enumerate the l-mers n such a way that they generate the d-neghbors only n the chunk of bt arrays assgned to the block. We use the same logc as n teratve approach. Note that the enumeraton length here s l s. Fgure 1: GPU Memory dress, then only two transactons are ssued. But f the threads access separate addresses then 32 transactons are ssued. Shared memory s dvded nto equally szed blocks called banks. If two threads n a half warp access the same bank, ths would result n bank conflct and the accesses are seralzed thus reducng the effectve bandwdth. In order to avod ths, the programmer should try to make sure that the threads n a half warp access dfferent banks. The memory latences can be hdden by executng other warps when a warp s paused. So to keep the hardware busy there should be enough actve warps. Occupancy s the rato of number of actve warps per mult-processor to the maxmum possble number of actve warps. If the occupancy s too low, then the memory latency cannot be hdden resultng n performance degradaton. So the programmer should try to ncrease the occupancy to effectvely use the hardware. 4. PARALLELIZING BITBASED ON GPU Though BtBased s a easly parallelzable approach, t s not straght-forward to mplement t on the GPU. The man ssue s that BB has hgh memory requrements. As we have seen n secton 2, t requres 4 l bts of memory for each bt array. Such hgh amount of memory s only avalable on the global memory. But global memory has a drawback of hgh latency. Furthermore, the access pattern of the bt arrays s very scattered makng t dffcult to use the coalescng feature of the global memory. So to avod usng global memory, we partton the bt arrays nto smaller chunks that ft n shared memory. Ths s smlar to the teratve approach dscussed n secton 2.2. The only dfference s that nstead of teratng, we assgn the task of each teraton to a GPU thread block. Let t be the number of threads n each block. To solve (l,d) nstance we frst fnd l and n as explaned n [7]. Let l s=max{ 4 n bts of memory can be allocated on shared memory}. The bt arrays are parttoned nto chunks of 4 ls bts of memory. Each chunk s assgned to a sngle block. Thus the number of blocks s 4 l l s. The threads n each The t threads n a block are responsble for settng bts n the chunkof bt arrays assgned to the block. The l-mers are dstrbuted among the t threads. The consecutve l-mers are assgned to consecutve threads. After all the threads have fnshed enumeratng the l-mers and settng bts, the threads enter the fnd Motfs phase. After fndng the canddate motfs, they must be fltered by checkng f they are present n the remanng n n nput sequences. We perform ths step n a separate kernel called FlterMotfs to avod dvergence of threads. So a thread, after fndng a canddate motf nstead of performng the flterng phase, t wrtes t to the global memory so that the canddate motf can be accessed n the FlterMotfs kernel. To wrte on to global memory, we use a varable called gindex. When a thread fnds a canddate motf, t frst atomcally ncrements gindex and then wrtes the canddate motf to the global memory at the ndex returned by the atomc operaton. Ths s to avod dfferent threads n dfferent blocks wrtng to the same ndex n global memory. After fndng the canddate motfs, flterng them s straght forward. Let c be the number of canddate motfs. For the FlterMotfs kernel, we need c/t blocks. The c canddate motfs are equally dstrbuted among the blocks. Wthn the block, the canddate motfs are further dstrbuted among the threads. Each thread s assgned a canddate motf and t checks f the canddate motf has d-neghbors n the remanng n n nput sequences whch were not consdered durng FndCanddateMotfs kernel. If a thread fnds that the canddate motf s a planted motf, t wrtes to the global memory usng the same logc explaned prevously. We mprove ths mplementaton by usng two modfcatons: Bt representaton and reparttonng and reorderng. 4.1 Bt Representaton As we have seen n secton 3, each multprocessor has a lmted number of regsters. Ths mplementaton s lmted by the number of regsters. Snce each thread consumes large number of regsters, the number of threads per block s less and hence the occupancy of GPU. To mprove the occupancy and performance, we need to reduce the regstry usage as much as possble. Each nput sequence of length L has L l+1 l-mers. If the nput sequence s represented usng a character array then an l-mer requres l bytes of memory. Instead we can represent an l-mer usng an nteger, 2 bts for each resdue [1] [15]. For example, the 4-mer CGGA can be represented usng an nteger whose bnary representaton s By dong so, an l-mer, l 16, would need only 4 bytes and l 32 would need 8 bytes of memory. So we convert the nput character array nto an nteger array, the nteger at ndex represents the l-mer startng at locaton n the nput sequences. By convertng nto nput array, GPU threads only need to read one nteger rather than l bytes. Ths would not only reduce the regstry usage by also reduce the I/O tme as only an nteger need to be read. We use texture bndng to read the nput sequences. 4.2 Reparttonng and reorderng

5 Fgure 2: (a) The nteger array s parttoned nto 16 chunks so that the th thread n a half warp only accesses th chunk. (b) The nteger array s reordered such that the th thread n a half warp only accesses th bank. Table 1: Comparson wth multcore (15, 5) (17, 6) (19, 7) (21, 8) GPU tme speed-upspeed-up tme speed-upspeed-up tme speed-upspeed-up tme speed-upspeed-up devces(seconds) 1 core 16 cores(seconds) 1 core 16 cores(mnutes) 1 core 16 cores(hours) 1 core 16 cores CPU CPU CPU CPU CPU CPU CPU CPU We have seen n secton 3 that the shared memory s organzed nto banks. Successve 32-bt words are assgned to successve banks. We mplement a bt array usng a 32-bt nteger array. Therefore successve ntegers are assgned to successve banks. Each thread executng the kernel enumerates l s-mers n the nput sequence and may set the bts n any of the nteger and therefore n any bank resultng n bank conflcts. In order to avod bank conflcts we repartton the nteger array and then reorder the nteger array. The nteger array, whch was once parttoned to ft n the shared memory, s reparttoned nto 16 chunks(as there are 16banksnTesla). The ththreadn ahalfwarp enumerates the l s-mers to set the bts n th chunk. We then reorder the nteger array such that the th thread n a half warp would only access the ntegers n the th bank. For example, when l s = 6, each bt array has 4 6 bts and s mplemented usng an nteger array of sze 128. We partton the nteger array nto 16 chunks each of sze 8 ntegers. Fgure 2(a) shows the parttoned bt array. The frst thread n a half warp(threads 0, 16, 32,...) only accesses the frst chunk.e. ntegers 0 to 7. Now we reorder the ntegers n the bt array such that the ntegers 0, 1,.., 7 belong to the same bank. Fgure 2(b) shows the reordered nteger array. It can be seen from the fgure that threads 0 and 16 only access the ntegers n bank 0 and threads 15, 31 only access the threads n bank 15. Therefore there wll be no bank conflcts after reorderng the nteger array. In addton to avodng the bank conflcts, reparttonng and reorderng has another advantage. Parttonng a bt array nto chunks reduces the enumeraton length. Because we partton the nteger array nto 16 chunks, the enumeraton length reduces from l s to l s 2. Note that the maxmum enumeraton dstance s equal to the enumeraton length. For example, when enumeraton length s 4, the maxmum enumeraton dstance s 4. So the maxmum enumeraton dstance also decreases by 2. Thus we only need to enumerate to generate (l s 2)-neghbors nstead of l s-neghbors. Ths would reduce the regstry consumpton of each thread and hence we can ncrease the number of threads per block. Havng more threads per block would ncrease the occupancy resultng n better performance. 5. EXPERIMENTAL RESULTS We have mplemented BtBased on Nvda Tesla C1060 and Nvda Tesla S1070 both runnng at 1.3GHz. C1060 has 30

6 multprocessors wth 8 scalar processor cores each. S1070 has four GPU devces wth 240 cores each. We have tested our code wth 20 nput sequences of length 600 each. We tested t on random sequences wth motfs planted at random postons n the 20 sequences. We have used n = 6 for all our experments. C1060 and S1070 both have a shared memory of 16KB per processor. As we have descrbed n secton 4 we need to fnd the value of l s where l s=max{ 4 n bts of memory can be allocated on shared memory}. We have found that 6 s the most sutable value for l s. Table 1 shows the performance results obtaned on 1 to 4 GPUs. We have also expermented the approach usng 1 to 120 multprocessors on Tesla S1070 wth only one actve block for each multprocessor and the load s dstrbuted equally among the multprocessors. It can be seen from Fgure 3 that the approach scales well wth the number of multprocessors. We have also collected the results usng dfferent number of GPU devces. Fgure 4 shows the speed-up of the approach wth respect to number of GPU devces. It can be seen clearly that the approach scales well wth the ncrease n number of GPU devces The BtBased approach was mplemented on a 4 quadcore 2.67 GHz Intel Xeon X5550 machne wth a total of 16 cores usng 1GB memory. The basc BtBased approach was used for (15, 5) and lower nstances and teratve BtBased approach was used for (17,6) and hgher nstances. Table 1 shows the results obtaned on the multcore machne. It shows the speed-up obtaned on GPU wth respect to 1 core CPU and 16 cores CPU. The actual results for multcore are dscussed n [7]. It can be seen that a sngle GPU devce s 13 to 14 tmes faster than a sngle core of Xeon X5550 machne. It performs better than 16 core Xeon machne. 4 GPU devces are 40 to 60 tmes faster than sngle core CPU and 4 to 6 tmes faster than 16 core CPU. 6. CONCLUSION We presented an effcent parallel approach for solvng the planted motf problem on GPU. Ths approach s modfcaton of a BtBased approach that was orgnally proposed for Intel based multcore archtectures. The BtBased approach had to be modfed for GPU archtecture. The proposed mplementaton solves the challenge nstance (21,8) of planted problem n 1.1hrs. We are not aware of any sequental or parallel method that wll solve ths challenge nstance n better tme. Addtonally, to the best our knowledge we are not aware of any prevous mplementaton of a parallel method to solve the planted motf problem on GPU. 70 speed-up number of mult-processors Fgure 3: Plot showng the speed-up of the approach wth respect to number of multprocessors. speed-up (15,5) (17,6) (19,7) number of GPU devces Fgure 4: Plot showng the speed-up of the approach wth respect to the number of GPU devces. 5.1 Comparson wth multcore 7. REFERENCES [1] S. Altschul, W. Gsh, W. Mller, E. Myers, and D. Lpman. Basc local algnment search tool. Journal of molecular bology, 215(3): , [2] J. Buhler and M. Tompa. Fndng motfs usng random projectons. Journal of Computatonal Bology, 9(2): , [3] A. M. Carvalho, A. T. Fretas, A. L. Olvera, and M.-F. Sagot. A hghly scalable algorthm for the extracton of cs-regulatory regons. In APBC, pages , [4] C. Chen, B. Schmdt, W. Lu, and W. Müller-Wttg. GPU-MEME: Usng graphcs hardware to accelerate motf fndng n DNA sequences. In PRIB, pages , [5] F. Y. L. Chn and H. C. M. Leung. Votng algorthms for dscoverng long motfs. In APBC, pages , [6] M. K. Das and H.-K. Da. A survey of DNA motf fndng algorthms. BMC Bonformatcs, 8(S-7), [7] N. S. Dasar, R. Desh, and Z. M. An effcent multcore mplementaton of planted motf problem. In Proceedngs of the Internatonal Conference On Hgh Performance Computng and Smulaton, pages 9 15, [8] J. Davla, S. Balla, and S. Rajasekaran. Space and tme effcent algorthms for planted motf search. In Internatonal Conference on Computatonal Scence (2), pages , [9] J. Davla, S. Balla, and S. Rajasekaran. Fast and practcal algorthms for planted (l, d) motf search. IEEE/ACM Transactons on Computatonal Bology and Bonformatcs, 4: , [10] E. Eskn and P. A. Pevzner. Fndng composte regulatory patterns n DNA sequences. In ISMB, pages , [11] L. Marsan and M.-F. Sagot. Extractng structured motfs usng a suffx tree - algorthms and applcaton to promoter consensus dentfcaton. In RECOMB,

7 pages , [12] P. A. Pevzner and S.-H. Sze. Combnatoral approaches to fndng subtle sgnals n DNA sequences. In ISMB, pages , [13] N. Psant, A. M. Carvalho, L. Marsan, and M.-F. Sagot. Rsotto: Fast extracton of motfs wth msmatches. In LATIN, pages , [14] A. L. Prce, S. Ramabhadran, and P. A. Pevzner. Fndng subtle motfs by branchng from sample strngs. In ECCB, pages , [15] S. Rajasekaran, S. Balla, and C.-H. Huang. Exact algorthms for planted motf problems. Journal of Computatonal Bology, 12(8): , [16] M.-F. Sagot. Spellng approxmate repeated or common motfs usng a suffx tree. In LATIN, pages , [17] M. Tompa. An exact method for fndng short motfs n sequences, wth applcaton to the rbosome bndng ste problem. In ISMB, pages , [18] M. Tompa, N. L, T. Baley, G. Church, B. De Moor, E. Eskn, A. Favorov, M. Frth, Y. Fu, W. Kent, et al. Assessng computatonal tools for the dscovery of transcrpton factor bndng stes. Nature botechnology, 23(1): , [19] L. Yu and Y. Xu. A parallel Gbbs samplng algorthm for motf fndng on GPU. Parallel and Dstrbuted Processng wth Applcatons, Internatonal Symposum on, 0: , 2009.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr