Achieving High Throughput Sequencing with Graphics Processing Units

Size: px

Start display at page:

Download "Achieving High Throughput Sequencing with Graphics Processing Units"

Harriet Greer
5 years ago
Views:

1 Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State University, Jonesboro, AR 72467, USA 2 Department of Internal Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 7225, USA Abstract High throughput sequencing has become a powerful technique for genome analysis after this concept was raised in recent years. Currently, there is a huge demand from patients that have genetic diseases which cannot be satisfied due to the limitation of computation power. Though several softwares are developed using currently most efficient algorithm to deal with various types of sequencing problems, the CPU seems to be too expensive to process endless data economically because CPUs are not designed adaptive for data parallel problem. The latest Fermi architecture released by NVIDIA provides considerable number of streaming processors, bigger size of register file and 1 MB cache, which makes it very competitive for data parallel processing. This paper tries a simple sequence alignment method on GPU and compared the real world performance between CPU and GPU. Experiment shows that GPU may have a good potential with similar problems. Keywords: High Throughput Sequencing, Graphics Processing Unit 1. Introduction Nowadays, people are paying more and more attention to health care and advanced devices are designed to analyze the samples from patients. When it comes to the molecular level, the data amount becomes extremely large, which needs more computational power to work on it. Recently, the emerging High Throughput Sequencing (HTS) technology [6], [7] shows bioinfomatists a way to deal with this problem better and many multithreaded programs such Bowtie [3], BWA [4] and SOAP2 [7], have been raised for practical use. However, for sequential CPUs, sequence alignment is somehow too easy to deal with, which makes it too expensive to use smart chips like CPUs. As NVIDIA released its new Fermi architecture which provide 512 cores in one chip and gigabytes of memory, GPU seems to have great potential in taking over this job and doing it faster and more economically. In this paper, a simple way is proposed to do exact matching between massive DNA target fragments and mrna reference sequences, and performance comparisons between its CPU and GPU version are discussed. The paper is organized as follows: Section 2 gives our method, including indexing and searching phases, to do sequencing. Section 3 discussed about the detailed designs considering architectures. In Section 4, we will discuss on the experimental results. Section 5 is the related work and conclusions will be drawn in Section Algorithm Design The algorithm idea used in this paper comes from Burrows-Wheeler Transformation, which was first raised for data compression and was later developed to make an efficient index for sequence alignment. Fig. 1 illustrates how the original transformation works. caacg$a aacg$ac acg$aca cg$acaa g$acaac $acaacg $acaac g aacg$a c acaacg $ acg$ac a caacg$ a cg$aca a g$acaa c Fig. 1: Burrows-Wheeler Transformation gc$aaac The concept of BWT is to make an index of reference sequence by hashing the elements within sequence to a special order, which will benefit later searching phase and reduce searching time complexity from O(nlg(n)) of bruteforce method to O(lg(n)). Concrete implementation of BWT can be described as follows: 1) Put a $ at the end of reference sequence. 2) Copy the current sequence and shift the new sequence to right by 1 and put it below the last one for n times, given the original sequence length is n. 3) Sort the new generated block by the order of $, a, c, g, t for each column. 4) Get the last column of the sorted matrix. 2.1 A New Indexing Method for Test Inspired by BWT, we designed another way to make the index. Procedure of the new method is shown in Fig. 2. Next, we will explain it in a more detailed way. 1) We still add a $ at the end of the reference sequence. 2) Generate the same block as what BWT does. This time, we put order numbers for a, c, g and t separately for the first column.

2 3) In this approach, we only sort the first column of the matrix and make sure the small order numbers of a, c, g and t are on the top of the larger ones. 4) We get the last column as the new index. 1 2 aacg$ac 3 acg$aca 1 $acaac g 1 aacg$a c 2 acg$ac a 3 acaacg $ 2 cg$aca a 1 caacg$ a 1 g$acaa c 2 aacg$ac 3 acg$aca 1 gc$aaac Fig. 2: New indexing method 2.2 Searching Algorithm Search for: aac a ac aa Fig. 3: Searching with the new index 1 aacg$ac 2 acg$aca 3 aac The new proposed method has a brute-force searching nature, but by using the index well, several improvements can be achieved. The searching procedure is given in Fig. 3, which is very straightforward. 2.3 Making A Secondary Index Now, we are going to talk about how to improve the performance of our searching algorithm. We can make a secondary index based on the first level index generated by the method mentioned above. For the first column, since we will refer to the beginning and end of a, c, g and t many times, we can save some space and just record these position numbers for the four types of letter. This saves not only the searching time but also a lot of space for the index file. For the last column, since the a, c, g and t here are not clustered, we a: [1, 3] a: [4, 5] g: [6, 6] a: {2, 4, 5} c: {1, 6} g: {} t: {} Fig. 4: Secondary index generation can create four arrays for each of them and remember the occurance positions in the last column for each element. This can prevent the searching algorithm go to positions of wrong letters, for example, if we want a, we just go for 2, 4 and 5 positions in the last column and skip letter of other types. Generally, though it does not fundamentally reduces time complexity of searching algorithm, this indexing method saves much unnecessary time by generating a simple index in O(n) time, which is time-saving. In the experiment part, performances of CPU and GPU that we will be discussing about are based on this algorithm. 3. I/O Involved Program Design 3.1 Single-threaded Code Design for CPU Since the indexing phase of our algorithm costs less time compared with searching phase, in which unpredictable number of target sequences will be throughput as inputs, we add the indexing time to total searching time in this paper. Another important advantage of this is that we can save I/O time and load indices from hard-disk, which cost much more time than the indexing phase when the reference sequence file is very large. When we do everything in memory and never go to hard-disk, searching usually becomes faster. Fig. 5 illustrates how the data pertains our program flows between memory and hard-disk. 1) Load reference sequence file from hard-disk. 2) Generate index for reference sequence in 3) Remove original sequence file from memory, only leave the index there. 4) Load the next target sequence file from hard-disk to memory 5) Do searching for the current batch of target sequences and save results. 6) Remove the first batch of target sequences. 7) Repeat 4) to 6) for all target files. 3.2 CUDA C code design for single GPU Fig. 6 shows the procedure for a machine that has a CPU dealing with our problem. There are altogether nine steps of

3 (6) (4) s (1) (7) CPU (5) DRAM Ref-Index... D I S K (3) s (1) Reference Fig. 5: Data locality control for CPU implementation execution and data transfer for both indexing and searching phases, which will be explained more specifically next. H-Disk Tar... CPU (3) Host RAM (1) (6) (7) (12) Tar (8) (12) Dev RAM (4) Index Tar (5) (1) (9) (13) (13) G P U Fig. 6: Work and data scheduling for GPU implementation 1) Load reference sequence file from hard-disk to CPU 2) Copy reference sequences from CPU memory to GPU 3) Remove reference sequences from CPU 4) Generate index for reference sequence using GPU. 5) Remove original sequence file from GPU memory, only leave index there. 6) Load a target sequence file from hard-disk to CPU 7) Copy current batch of target sequences in CPU memory to GPU 8) Remove present target sequences in CPU Load the next batch of target sequences. 9) GPU does searching and save result in its 1) Remove current batch of target sequences from GPU 11) Repeat 6) to 1) for all target files. 12) Copy back result to CPU memory and save it to disk. 13) Remove results in GPU and CPU 3.3 Noteworthy Differences between CPU and GPU implementations 1) The GPU one has an initializing time for the first booting of the device, usually taking up to 2-3 seconds, where CPU one does not. So for small cases that can be run very fast on CPUs, GPUs have no advantage. 2) Data transfer time between host and device memory should be considered since data amount in our case is usually very large. 3) GPUs can do simple calculations very fast if programs are designed well, so indexing and searching phase can also be considered to do in GPUs, if the data transfer time can be ignored. If the indexing time requires only a little, there is no much need to do it in GPUs. Searching phases usually can be taken well on GPUs since target sequence numbers are always very large. Acceleration rate of dozens to hundreds can be expected for the searching phase if GPUs are adopted. 4. Experimental s Sequential code was written in C and tested on a machine with two Intel Xeon E554 Quad-Core CPUs (2.GHz, 4MB cache), where GPU code was written in CUDA C and tested on the same machine with two GPUs of NVIDIA Tesla 2-Seris C25. In the following part, performance comparison between these two will be given and speedup rate for GPU will be calculated out. Also, time proportion for each part of whole algorithm on CPUs and GPUs will be illustrated and discussed separately. 4.1 CPU vs. GPU Searching Time Block sorting is the most time consuming part in making index for reference strings. Fig. 7 gives the relational curves about time cost and combination number of reference strings (one reference string length = 3, ). From Fig. 7 we can see that for the algorithm proposed in this paper, searching time takes a big portion of total execution time on the CPU side while on the GPU side, it takes relatively smaller portion. This is because GPU runs much faster on the searching part compared to CPU, so given the I/O and data transfer time changes proportionally as the target sequence number increases, GPU saves more absolute time as the problem scale becomes larger and larger.

4 18 6 Execution Time (second) 16 GPU with I/O 14 CPU with I/O GPU search 12 CPU search Speedup Rate Speedup with I/O Number (Length = 87) x 1 Number (Length = 87) x 1 Fig. 7: CPU & GPU timing with and without I/O Fig. 9: GPU speedup rate with I/O 4.2 Speedup on GPU Fig. 8 and Fig. 9 illustrate two speedup curves about the pure searching time and searching time with I/O and data transfer. We can see that for pure searching algorithm, the GPU one can beat the CPU one for up to 14 times, where about 5 times speedup can be achieved when I/O and data transfer is taken into consideration. Actually, since the algorithm is not ultimately optimized, there should still be potential for GPUs to speed up this problem. Speedup Rate Speedup for searching Number (Length = 87) x 1 Fig. 8: GPU speedup rate without I/O 4.3 Overhead Breakdown with CPU & GPU Approaches 1) I/O from hard-disk For both CPU and GPU implementations, this part should take the same time, which is inevitable. The bandwidth from hard-disk to memory has always been a bottleneck for similar problems. However, if we are not using the local hard-disk but using InfiniBand to load data from remote database in parallel, the performance for both CPU and GPU once can be improved, where GPU one might benefit more because it processes data much faster than CPU one and need more data in a given time to meet its stronger computation power. 2) Data transfer between host and device memory Currently, NVIDIA GPUs are using PCIe bus to transfer data from and back between host and device memory, whose capacity is up to 4GB/s for one way transmission and 8GB/s for two way. This speed usually can satisfy GPU s computation power and will not be a bottleneck for now. A noteworthy thing about this is that asynchronous memory copy technique should be used when target sequence is too large to load for once by GPU Asynchronous copy between host and device memory can overlap with GPU computation, so either copy or computing time can be hidden by this overlapping. Which portion will be hidden depends on their time costs. 3) Time for indexing For the algorithm presented in this paper, indexing time can nearly be ignored since I/O and searching time dominate. However, in real applications, such as BWT, indices are usually made more efficient to use. But it also takes more time on indexing and the overhead cannot be ignored. In that case, indexing time should also be considered as an important portion of the whole system. 4) Time for searching This portion of time relies on many factors including indexing efficiency, I/O speed, choose of device and task partitioning design. Basically, more efficient indexing can reduce searching time whereas higher I/O speed can positively influence the performance. For device choosing, we can say GPU is better than CPU from the angle of economy since it provides more powerful tools for searching. However, whether a partitioning design is good or not is hard to tell if we just look at the surface of a specific problem. Calculations should be carefully done to find out the optimum selection for it.

5 5. Related Work RNA sequencing was one of the earliest forms of nucleotide sequencing. The major landmark of RNA sequencing is the sequence of the first complete gene and the complete genome of Bacteriophage MS2, identified and published by Walter Fiers et al. in 1972[8] and 1976[2]. In late 2 decade, high-throughput sequencing (HPS) emerged. Li R (28, 29) proposed several papers about BWT applications on short read alignment [6], [7]. Li H (28, 29) [5], [4] and Langmead (29) [3] also published several works about memory-efficient alignment. In recent years, several alignment programs such as Bowtie [3], BWA [4] and SOAP2 [7] were released. In 29, Sinnott-Armstrong et al. presented a paper about accelerating epistasis analysis in human genetics with Nvidia GeForce GTX-28 and PyCUDA programming tool [9]. Nicholas et al. (211) made a real-world performance comparison of SNPrank across programming platforms such as Python, Java and Matlab, and hardware environments: single threaded, multiple threaded and GPU, where GPU languages are restricted to Matlab and Python [1] and GPU brand is Nvidia Tesla-M16. They declared for small cases, CPU always performs better because of the data transfer to and from device 6. Conclusions and Future Work This paper proposes a way to implement fast sequence alignment on the latest version of NVIDIA GPU. From the experimental result, we can see that GPU speeds up more on the searching phase compared with CPU but delays a constant length of time on its necessary data transfer phase. This feature of GPU manifested that it has a good potential for high throughput sequencing. If the bandwidth bottleneck of loading data from hard-disk can be improved, the performance still has a great potential to keep growing; where for single threaded CPU, the computation power may not guarantee that. In future, we will try to parallelize the most advanced sequence alignment algorithm on GPU and keep investigating the GPU s capability on more applications that receive urgent concerns from medical and biological fields. References [1] Nicolas A. Davis, Ahwan Pandey, and B. A. McKinney. Real-world comparison of cpu and gpu implementations of snprand: a network analysis tool for gwas. Bioinfomatics, 27: , 211. [2] W Fiers, R Contreras, and F Duerinck. Complete nucleotide sequence of bacteriophage ms2 rna: primary and secondary structure of the replicase gene. Nature, 26:5 57, [3] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 1(3), 29. [4] H Li and R Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinfomatics, 25(14): , 29. [5] H Li, J Ruan, and Durbin R. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Research, 18: , 28. [6] R. Li. Soap: short oligonucleotide alignment program. Bioinfomatics, 24(5): , 28. [7] R. Li. Soap2: an improved ultrafast toll for short read alignment. Bioinformatics, 25(15): , 29. [8] Jou W. Min, G. Haegeman, M. Ysebaert, and Fiers W. Nucleotide sequence of the gene coding for the bacteriophage ms2 coat protein. Nature, 237(369654):82, [9] Nicolas A Sinnott-Armstrong, Casey S Greene, Fabio Cancare, and Jason H Moore. Accelerating epistasis analysis in human genetics with consumer graphics hardware. Technical report, Dartmouth Medical School, NH, USA Politecnico di Milano, Milano, Italia, 29.

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies