Sequence Alignment Using Graphics Processing Units. Dzivi PS

Size: px

Start display at page:

Download "Sequence Alignment Using Graphics Processing Units. Dzivi PS"

Julia Spencer
5 years ago
Views:

1 Sequence Alignment Using Graphics Processing Units Dzivi PS This report is submitted as partial fulfilment of the requirements for the Honours Programme of the School of Computer Science and Software Engineering, The University of Western Australia, 2008

2 Abstract In bioinformatics, use of global sequence alignment algorithms such as Needleman- Wunsch (NW) becomes a time-consuming process because of their tendency to find all the possible matches between two sequences. Heuristic approaches such as BLAST, and FASTA, are often used for sequence alignment for their faster processing capability over their trade-off of being less optimal. Many implementations of the sequence alignment algorithms have been tried in different hardware architectures, and with different algorithmic approaches. The advent of CUDA compatible graphics cards has given a new dimension for implementation of general purpose applications. We explore the efficiency of GPUs for dynamic programming problems by implementing the NW algorithm in GeForce8800 GTX. We focus on the scorefill step of the NW algorithm as it is considered the most computationally expensive part. Our implementation follows the wave-front algorithmic approach as done in most of the past implementations of sequence alignment algorithms. The efficiency of GPU, for the NW algorithm was determined by comparing the execution times in GPU and CPU for varying lengths of sequences. Our first approach proved the suitability of GPUs for sequence alignment of long sequences. However, CPU processing time was faster for small sequence lengths. Slight modification in the approach resulted in speed improvement for all sequence lengths. Speedup of about 84 times was obtained for the maximum possible length of sequences. The maximum lengths of the sequences that could be aligned at once were found to be dependent on the device memory size. The results revealed that the efficiency of GPUs for a particular problem is dependent on the utilization of the allocated hardware, and the amount of parallelism exhibited by the problem. Similarly, the efficiency was dependent on the hardware configuration used such as number of threads, blocks, and memory spaces. Our implementations confirm that the problems based on dynamic programming can benefit from the use of graphics cards. The producer-consumer relationship can be managed via multiple kernel calls. The best case of efficiency was achieved by experimenting with varying GPU configurations Keywords: Sequence Alignment, Needleman-Wunsch Algorithm, GeForce8800 GTX ii

3 CR Categories: D.1 [Programming Techniques], D.1.3 [Concurrent Programming] iii

4 Acknowledgements Firstly, I would like to thank my research supervisor, Associate Professor Amitava Datta at the School of Computer Science and Software Engineering for all his guidance and encouragement throughout the whole year in completing this research project. He was always there to give consultation, and was ever boosting my overall moral. This project would not have been completed without his guidance and support. I would like to acknowledge the efforts of my husband, Mr. Ashis Parajuli for proofreading this thesis for correcting grammatical mistakes despite his busy schedule. He was always there for lifting my spirits in achieving this goal. I would like to thank my family, as without their constant support throughout my entire education I would never have made it here. Finally, a special thanks to my colleagues in Master of Computer Science for their help and support. iv

5 Contents Abstract Acknowledgements iii iv 1 Introduction Overview of Sequence Alignment Needleman-Wunsch Sequence Alignment Algorithm Initialization Step Scorefill Step Traceback Thesis Outline Review of the Literature Algorithmic Approach Wavefront Method Hardware Approaches Central Processing Units Single Instruction Multi Data Streams (SIMD) and other architectures Graphical Processing Units (GPUs) Implication to Thesis Methodology CUDA Threads Model Memory Model GeForce 8800 GTX v

6 3.3 Our Proposed Algorithm Experimental Approach CPU Implementation GPU Implementation Experimental Results and Analysis of First Approach Environmental Details Performance Analysis of CPU version and GPU version Analysis of GPU version by varying the configuration Discussion Modification to the NW Algorithm Modified Approach Pseudocode Analysis of the Modified NW Algorithm Performance of the new GPU Implementation Changing the Configuration Compare with Traceback Step Discussion Conclusion and Future Work Conclusion Future Work A Original Honours Proposal 44 B Timing Results for First Approach 48 C Timing Results for Modified NW Algorithm 50 vi

7 List of Tables 2.1 Execution Times(s) and Speedups for 3 Sequences GPU Implementations of SW Algorithm New GPU vs CPU Effect of changing number of threads Maximum length of Reference Sequence to Corresponding Query Sequence Length Comparison of CPU with GPU including Traceback step A.1 Tasks along with Targeted Deadlines B.1 Performance Analysis of CPU version B.2 Performance Analysis of GPU version B.3 Comparison of CPU and GPU B.4 Results for GPU for using Different Memory B.5 Timing Results of GPU for varying Number of Threads C.1 Performance Analysis of GPU version using Global Memory C.2 Performance Analysis of GPU version using Shared Memory C.3 Comparison of new GPU and previous GPU versions vii

8 List of Figures 1.1 Initialization and Dependency Traceback Step and Anti-Diagonal Parallelism Similarity Matrix CUDA Thread Model CUDA Memory Model GTX Hardware Model Data dependecies in a Scorematrix Scorematrix and Block CPU performance GPU performance CPU vs GPU performance Shared vs Global performance Timing Results by Varying number of threads Mapping of blocks Timing for new GPU Timing for new GPU compared to old GPU Ratio of old GPU to new GPU Effect of changing Memory A.1 Matrix Fill Step for Sequences AAA and AAAC viii

9 CHAPTER 1 Introduction Sequence alignment is a routine task in computational biology where two sequences of amino acids or DNA base pairs are compared to find evolutionary relationships with the aid of computers. A query sequence is aligned with a reference sequence to identify the regions of similarity. The sequence similarity helps in classification of genes and proteins along with detection of mutation points. Similarly, it is used to predict biological function, secondary and tertiary protein structure, and construct evolutionary trees [5]. It is the procedure of comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. It is important to obtain the best possible or optimal alignment to discover the evolutionary information [13]. 1.1 Overview of Sequence Alignment There are basically two types of sequence alignment algorithms, local and global. Both local and global alignment algorithms are based on the dynamic programming approach. Smith-Waterman (SW) local alignment matches two sequences for a fixed length for which they show a high degree of similarity. However, the Needleman-Wunsch (NW) algorithm is a global alignment algorithm which compares two sequences over their entire lengths. It seeks to match as many nucleotides or amino acids as possible between the sequences to give the optimal alignment. It is considered suitable for finding the best alignment for two sequences which are of similar length [9]. The February 2004 release of the Genbank molecular database contained 32,549,000 DNA sequences which are further composed of approximately 37, 893,- 844,733 bases [15]. The Genbank statistics on DNA sequences demonstrate exponential growth rates [16]. As a result, the process of aligning a query sequence with each of the database sequences incurs a significant amount of time, i.e. high computational cost [14]. Both algorithms are based on the idea of dynamic pro- 1

10 gramming and have quadratic complexity which depends on the lengths of the sequences. Alignment of long sequences demand higher computation power and space than a single computational resource can offer [8,22]. Because of the cost, heuristic approaches like FASTA [6], and BLAST [1] tend to be preferred for faster speed over the trade-off of being less optimal. Both FASTA and BLAST are less sensitive than the SW algorithm, and may fail to report distant sequence relationships [11, 13]. The use of multiple processors and special-purpose architectures for high performance solutions has always been a field of research in bioinformatics [21]. Different approaches to parallelize sequence alignment algorithms in different architectures have resulted in interesting speed improvement. Most implementations of sequence alignment are based on the Smith-Waterman (SW) local alignment algorithm [11, 21]. For long sequences, the use of the NW algorithm in a single processor(cpu) becomes a very time-consuming process as it compares two sequences lengthwise [14]. In this dissertation, we aim to align long sequences using the NW algorithm in GPUs to observe its performance. The steps involved in the NW algorithm are discussed in Section 1.2. The CUDA (Computer Unified Device Architecture) compatible graphic cards such as GeForce 8800 GTX gives users both read and write access to device memory, and is considered suitable for general purpose applications requiring heavy computations [17]. The previous cards programmed using OpenGL lacked the ability to randomly access any memory cell for writing. Suitability of graphics card for sequence alignment depends on various factors such as the amount of parallel computation involved, hardware configuration, and the available memory size [11, 20, 21]. Considering that the best hardware configuration is used and the algorithm involves parallel computation, the implementation on GPUs should be faster compared to Central Processing Units (CPUs). For the experimental analysis, we aim to implement the NW algorithm in both CPU and GPU. The execution time for these implementations is compared by varying query sequence and reference sequence lengths. 1.2 Needleman-Wunsch Sequence Alignment Algorithm The NW algorithm [13] consists of three steps - initialization, scorefill, and traceback. The four nucleotides used in DNA sequencing are represented as A(Adenine), G(Guanine), C(Cytosine), and T(Thymine). The following example illustrates how the algorithm works. 2

11 Let the two sequences to be globally aligned be - AGC (sequence 1, s, with length, m=3) AAAC (sequence 2, t, with length, n=4) If the symbol at position i of s is the same as the symbol at position j of t, +1 is given as a match score; otherwise, -1 as a mismatch score and -2 as a gap penalty, i.e. aligning with a gap (-). Gaps are inserted in either of the sequences to match maximum nucleotides between them. In bioinformatics, gaps represent the point of mutation where individual bases in a DNA sequence gets mutated or changed Initialization Step The first step is to create a matrix (namely Scorematrix) with m columns and n rows where sequence s is written vertically above the matrix and sequence t horizontally. It is assumed that there is always a possibility of starting an alignment with a gap. So, the scores in the top row are- 0 (aligning a gap with a gap), -2(aligning A with a single gap), -4(aligning AG with two gaps) and - 6(aligning AGC with three gaps). Similarly, the scores in the first column are calculated as shown in Figure 1.1(a). (a) Initialization and Calculation of M(1,1) (b) Calculation of M(1,2) and M(2,1) Figure 1.1: Initialization and Dependency 3

12 1.2.2 Scorefill Step In this step, the maximum global alignment score, M(i,j), for each position in the Scorematrix, M(m,n), is calculated. The starting point in the matrix is the upper left hand corner. In order to find M(i,j) for any pair of i and j, the scores for M(i-1,j), M(i,j-1) and M(i-1,j-1) must be known. For each position, M(i,j) is defined to be the maximum score at position (i,j), i.e., M(i,j) = Maximum of { M(i-1,j) -2, aligning s[i] with a gap, M(i,j-1) -2, aligning t[j] with a gap, M(i-1,j-1) ± 1, aligning s[i] with t[j] } Using this information, the score at position (1,1) in the matrix can be calculated. For example, in the matrix shown in Figure 1.1(a), M(1,1)= max{ M(0,1) -2, M(1,0) -2, M(0,0) +1}= max{ -4,-4,1}= 1 Further, the scores for the positions M(1,2) and M(2,1) are calculated as in Figure 1.1(b). The final score table is as shown in Figure 1.2(a). (a) Complete Scorematrix (b) Pattern of Parallelism Figure 1.2: Traceback Step and Anti-Diagonal Parallelism 4

13 1.2.3 Traceback The traceback step involves tracing back through the score table from M(m,n) pointing the arrows to any of the three entries whichever gave the optimal score for the position. The three entries can be M(i,j-1), M(i-1,j) and M(i-1,j-1). The traversing is done until we reach the left most position of the table. Tracing back gives the optimal alignment for the two sequences. For the best alignment, the arrows are tracked from M(m,n); a gap is inserted in the reference sequence where there is a vertical arrow, and a gap is inserted in the query sequence where there is a horizontal arrow. In this example, there are three possible optimal alignments. - A G C A - G C A G - C Sequence1 A A A C A A A C A A A C Sequence2 Similarly, the optimal sequence alignments can be calculated for any length of query sequence and reference sequence. AGGT-ACGTACTACTACAATACATATAAATCCATATATACGTACGT Sequence 1 AACTGACATAG-ACAATGCAAAGT-TCCGTACACAGTACAGTAACT Sequence Thesis Outline Chapter 1 provided the general background information and thesis outline. Chapter 2 presents the summary of the previous related works, and we describe the methodology in Chapter 3. Chapter 4 covers the analysis of the different implementations of the NW algorithm in both CPU and GPU. In Chapter 5, we recommend some modifications to the NW algorithm to enhance its performance in Chapter 4. We evaluate the performance of the implementation in Chapter 6. Chapter 7 presents the summary of the thesis work, and suggests some future research work. 5

14 CHAPTER 2 Review of the Literature This chapter focuses on previous research in sequence alignment on the basis of methods undertaken, hardware architecture, and speed enhancement achieved. 2.1 Algorithmic Approach The sequence alignment algorithms, Needleman-Wunsch (NW), and Smith- Waterman (SW) [13], consist of three steps - Initialization, Matrix Fill, and Traceback. The steps in both the algorithms are same, however, with some differences in the scoring scheme. Computation of the score matrix in a Central Processing Unit (CPU) takes time and space complexity of O(mn), where m and n are the lengths of query and reference sequences respectively. FASTA and BLAST share the same complexity but are 40 times faster than Central Processing Unit(CPU) implementation of the SW algorithm as they follow heuristic approaches [7, 11]. The concept behind the computer implementation of these algorithms is to compare two sequences as two strings and highly similar sequences share the same sub-string [20] Wavefront Method The computation of a score in scorematrix is dependent on previous scores. Given the data dependencies presented by the matrix fill step, we can observe that the scores can be calculated column by column, row by row, or anti-diagonal by antidiagonal. The problem with the first two approaches is that most of the scores are dependent on other scores in the same row or column. Thus, the approaches limit the number of scores that can be calculated independently, making it hard to parallelize the algorithms. As illustrated in Figure 1.1(b) and Figure 2.1, the matrix fill step shows a pattern of parallelism where scores in each anti-diagonal can be calculated simultaneously. This approach of calculating all the scores in each anti-diagonal at once for faster computation is termed the wave-front or 6

15 wave-level method [7]. Almost all implementations of sequence alignment are based on this method [3, 4, 7, 11, 14, 19, 20]. Figure 2.1: Similarity Matrix [12] 2.2 Hardware Approaches This section discusses various hardware realizations of sequence alignment on Central Processing Units (CPUs), GPUs, and other architectures. The focus is on their performances based on the length of sequence, speed gain, and algorithmic approach undertaken Central Processing Units In 2005, Boukerchea et al. [3] implemented a parallel version of the SW algorithm in C using the software, Distributed Shared Memory(DSM) JIAJIA v.2.1, in a cluster of eight workstations connected via Ethernet. The calculation of scores was divided to the processors such that each of them computed a set of columns in a row and waited for other processors to read the scores before proceeding again. Synchronisation among the processors was managed by means of locks and conditional variables in JIAJIA system. The communication between processors was managed through shared memory abstraction provided by DSM. The first attempt without dividing sequences into bands resulted in some increased speed. However, it proved to be insignificant for small sequences as the overhead of 7

16 communication time among processors dominated the processing time. Dividing the sequences in bands and allocating them to individual processors resulted in significant results for all sequences as shown in Table 2.1. Size Size Bands Serial 2 Proc/speedup 4 Proc/speedup 8 Proc/speedup 8K x 8K 40x / / / K x 15K 40 x / / / K x 50K 40 x / / /7.21 Table 2.1: Execution Times(s) and Speedups for 3 Sequences [3] Table 2.1 illustrates that breaking the sequences in different size bands affects the runtime. The increase in number of processors implies linear increment in speed. The dependency of calculation of a block (input to a processor) on previous blocks (output from other processors) can be managed in parallel processors by means of synchronization. Overall performance can be improved by using shared memory and minimizing the communication time among parallel processes. Similarly, Naveed et al. [14] implemented the NW algorithm in a cluster of CPUs using Alchemi framework for grid computing. The processing time was reduced from O(mn) to O(m+n), where m and n are the lengths of query and reference sequences respectively. Their implementation faced the problem of increased network traffic and, for long sequences, a need of more processors was felt. Both of the above implementations prove that more processors can better performance. However, use of many workstations appears to be an expensive approach Single Instruction Multi Data Streams (SIMD) and other architectures With a search for a re-configurable solution, the SW algorithm was executed on Field Programmable Gate Arrays (FPGAs) by replacing the score fill portion with a FPGA custom circuit [7]. The processing run time was improved by 287 percent. However, the configuration for each algorithm has to be changed and it was considered to be more complicated than writing a new code. SIMD implementations of the SW algorithm on Micro-Grained Array Processor (MGAP) and Krestel-Parallel processor provided some speed gain [7]. SIMD are multiple processing units which operate on multiple sets of operands in the same instruction cycle. However, it is hard to compare the results as they are not run against any baseline and lengths of sequences are different in different cases. The 8

17 above discussed architectures are some accelerators for computationally intensive problems like sequence alignment but are expensive and hard to upgrade [21] Graphical Processing Units (GPUs) GPUs are based on the SIMD architecture, and basically, used for graphics related applications. GPUs are commodity components compared to most of the hardware mentioned above. Being low cost hardware with capability of high throughput, it is considered to be a good match for bioinformatics applications involving many calculations [7, 11, 20]. Table 2.2 illustrates the different GPU implementations of the SW algorithm. The wavefront method was followed in all of the realizations, and the performance is measured against popular approaches in practice, such as OSEARCH [18], FASTA [6], and BLAST [1]. Graphics Cards Sequence Length Language Speed wrt Ratio Reference GeForce 6800 GTO 64 to 4096 Cg OSEARCH 2.8 to 7.46 Liu et al. [10] FASTA 1.5 to 4.55 Liu et al. [10] GeForce 7800 GTX 64 to 4096 Cg OSEARCH 4.66 to 9.45 Liu et al. [10] FASTA 2.49 to 9.42 Liu et al. [10] GeForce 6800 GTO 64 to 2048 GLSL CPU 2.1 to 4.0 Voss et al. [21] Radeon X850XT 64 to 2048 GLSL CPU 2.3 to 4.4 Voss et al. [21] GeForce 8800 GTX 64 to 362 CUDA FASTA to Manavaski et al. [11] (One Card) BLAST 1.24 to 0.93 Manavaski et al. [11] GeForce 8800 GTX 64 to 362 CUDA FASTA to Manavaski et al. [11] (Two Cards) BLAST 2.39 to 1.79 Manavaski et al. [11] Table 2.2: GPU Implementations of SW Algorithm Implementations on Previous Versions of Graphics Cards The graphics cards (GeForce 6800 GTO, GeForce 7800 GTX, and Radeon X850XT) required the reformulation of the algorithm in terms of graphics primitives. For this purpose, Liu et al. [10], and Voss et al. [21] made use of graphical programming languages Cg and Open GL Shading Language (GLSL) respectively. Both GPU computations had to be programmed in terms of the graphics pipeline with three steps - vertex processing, rasterization, and fragment processing. The sequences were stored in texture cache memory to minimize read access times as it 9

18 takes a longer time for the threads to access global memory than cached texture memory [17]. The maximum sequence length that could be aligned at once was limited by texture buffer size. We can observe from the first two graphics cards executions that memory size, GPU clock speed, and the number of vertex and fragment processors influence the performance. Further, the results from Voss et al. [21] showed that the optimum performance was obtained when the ratio of query length to reference length equalled to one. This shows that for the efficient GPU computation, the implemented algorithm should be able to maximise the utilisation of the available or allocated hardware configuration. Although reformulation of the algorithms in graphical terms is a tedious task, the results show that programming GPU cards make a better alternative compared to other SIMD systems, and high-cost supercomputers Implementations on CUDA Compatible Graphics Cards Sequence alignment is a process of matching common sub-strings between two strings. Schatz et al. [19] implemented a string matching algorithm, Cmatch, in G80 and it was based on a seed-and-extend algorithm. Suffix trees of each reference sequence were created, and the kernel code matched the query sequence with each of the trees. Compute Unified Device Architecture (CUDA) language was used to program the GeForce 8800 GTX card for the purpose. The language eliminates the need of restatement of algorithms in graphics primitives and facilitates coding for general-purpose applications [17]. As in the previous GPU versions, texture memory was used for reads. The performance for smaller queries was improved by 34 times compared to its CPU version. However, comparatively small gain of about 2 was achieved for longer reads. Cmatch when utilized in sequence alignment, boosted speed by ten times compared to its CPU counterpart [20]. The authors [19, 20] explained that use of conditional statements in kernel code causes thread to follow different execution paths. This divergence makes processors to serialize the instructions and each thread has to run more than one version of the code thus increasing run time. Similarly, they pointed out that lack of parallel computations in suffix trees could not fully utilise the available hardware. Manavaski et al. [11] explored the use of two GeForce 8800 GTX cards for the SW algorithm which used the substitution matrix, BLOSUM-50 [2]. For the optimal performance, the best hardware configuration was found to be 64 threads per block, 450 blocks, and threads per grid. As in the previous graphics card 10

19 versions [7,19 21], they made use of texture memory for the storage of query sequences, and shared memory to overcome the overhead of read access times. The reference sequences were pre-ordered in terms of their length to localise thread reads as in Cmatch. The result showed that using two graphic cards doubled the overall performance. Compared to Liu s version, the performance enhancement is significant for small sequences. However, the possibility for longer sequences is not considered. As sequence length increases, the ratio of performance compared to BLAST decreases. 2.3 Implication to Thesis All the above discussed GPU implementations, except for Cmatch, make use of the wave-front method in their algorithms. The matrix fill step which involves many calculations has been executed in parallel with an attempt to reduce computation time. In all cases, the results show that use of more processors means improvement in the performance. Similarly, use of shared memory and texture memory to minimise read access times, and processor communication times is common. However, limited research has been conducted so far in refining the wave-front method, and additional research seems to be beneficial for parallel architecture implementations. At present, G80 presents a good platform for general purpose application requiring intensive calculations. Algorithms need to be optimized to fully utilise the available hardware configuration. The algorithmic approach followed in the graphics card till now can be considered equivalent to the CPU version which did not use blocking. The blocked method used by Boukerchea et al. [3] in CPU proves to be a worthy alternative to be experimented in the G80 architecture considering the speedup and suitability for all sequence lengths. At the same time, migration of the method from CPU to GPU requires adjustment of the algorithm in GPU configuration. The new implementation should take into account - the programming issues such as least use of conditions in kernel code, minimisation of thread divergence; hardware issues such as maximum number of threads, onchip memory size. Similarly, the suitability of the method for different sequence lengths remains to be explored. 11

20 CHAPTER 3 Methodology This chapter briefly describes the architecture of the graphics card, GeForce 8800 GTX, and explains our proposed approach for the sequence alignment to be implemented in the graphics card. 3.1 CUDA CUDA (Compute Unified Device Architecture) is an extension of the C programming language, and an API used for programming general purpose applications in GeForce 8 series, Quadro FX 5600/4600, and Tesla products [4]. Access to the GPU hardware is managed by the operating system by running CUDA and graphics applications concurrently. CUDA eliminates the need to learn domainspecific languages to program the GPU, thus resulting in a minimum learning curve Threads Model In CUDA, the GPU is treated as a computer device capable of executing a high number of threads concurrently. The data-independent and compute-intensive portion of an application which has to be executed many times in the CPU (host), can be isolated as a function. The function, also known as a kernel, is compiled and offloaded to the instruction set of the device which is then executed on the device (GPU) by many threads. The data transfer between host memory and device memory takes place via the device s DMA (Direct Memory Access) engines. The batch of threads that execute a kernel is organized as a grid of block of threads. Threads in each block can communicate efficiently by using shared memory, and synchronize their execution to coordinate memory accesses. Each thread is identified by a unique id (threadid), which is the thread number in 12

the block. Each block can be specified as two- or three- dimensional arrays having threads with two- or three- component index. The maximum number of threads in a block is limited to 512.

21 the block. Each block can be specified as two- or three- dimensional arrays having threads with two- or three- component index. The maximum number of threads in a block is limited to 512. Similarly, all the thread blocks executing the same kernel are grouped as a grid of thread blocks. Each block is identified by its blockid, i.e. block number in the grid. Blocks may be indexed in onedimensional or two-dimensional array. The thread model of CUDA is as shown in Figure 3.1. Each thread block is assigned in its entirety to one streaming multiprocessor (SM) and runs as a unit to completion without pre-emption. All threads in the thread block execute the kernel in a fine-grained, time-sliced manner. The resources of the block cannot be reclaimed until the complete block finishes execution. Threads in different blocks within the same grid cannot communicate and synchronize with each other during execution. A device may run all the blocks in the grid in parallel if it has high parallel capability, sequentially if it has low parallel capability, or as a combination of both. If the number of thread blocks exceeds the available hardware significantly, waiting thread blocks are assigned to SMs as the previous ones finish executing. Figure 3.1: CUDA Thread Model [17] 13

3.1.2 Memory Model Each thread executing in the device can only access the device s DRAM, and the on-chip memory through a set of memory-spaces. The memory model is shown in Figure 3.

22 3.1.2 Memory Model Each thread executing in the device can only access the device s DRAM, and the on-chip memory through a set of memory-spaces. The memory model is shown in Figure 3.2. Figure 3.2: CUDA Memory Model [17] Data can be shared between thread blocks using global memory. However, the cost of accessing global memory is high. Each thread block also has per-block shared memory(pbsm) which allows threads in a thread block to communicate efficiently with low latency. PBSM is implemented using SRAM, and the size for each block is limited to 16 KB. The local and global memory spaces are not cached. So, each memory access to global or local memory generates an explicit memory access. A multiprocessor takes four clock cycles to issue one memory instruction, and incurs extra 400 to 600 clock cycles of memory latency to access global memory. So, use of shared memory to minimize global memory accesses 14

23 and utilize the data within local multiprocessor memories is preferred by CUDA developers. The read and write capability of local multiprocessor memory spaces is summarised as following - 1. Registers The fastest form of memory. Can only be accessed by a thread. Has the lifetime of the thread. 2. Shared Memory Memory space as fast as a register depending on whether there are bank conflicts or not. Can only be accessed by the threads in a block from which it was created. Has the lifetime of the block. 3. Global Memory About 150 times slower than the shared memory or the register. Can be accessed from both the device and the host. Has the lifetime of the application. 4. Local Memory A part of the global memory, and slower than the shared memory or the register. Can only be accessed by the thread. Has the lifetime of the thread. The synchronization in each thread block is managed by the hardware itself. However, synchronization among blocks requires completion of a kernel and launching of new kernel. The order in which blocks are executed is nondeterministic, so appropriate global barrier mechanism should be implemented in cases where the blocks have a producer-consumer relationship to prevent deadlock. 15

24 3.2 GeForce 8800 GTX The G80 architecture is designed to support implementation of both the graphics, and the general purpose applications. The device is a set of 16 streaming multiprocessors (SM) of SIMD (Single Instruction Multiple Data) architecture. Figure 3.3 shows the hardware model of the device. Each SM has 8 processing elements called stream processors. The threads in each thread block are timesliced into these processing elements in groups of 32 threads also known as warps. Each warp is executed by the multiprocessor in a SIMD fashion where a thread scheduler periodically switches from one warp to another to utilize the available computational resources. Hardware masking is used to handle the cases of divergent threads. If threads in the same warp diverge due to conditional statements, only threads following the same path can be executed concurrently. If all the 32 threads in the warp diverge without reconverging, then a sequential execution for each of the 32 threads will occur resulting in 32 times more computational time. So, optimizing the algorithm to minimize SIMD divergence certainly benefits the performance. As soon as a kernel is launched, the device driver notifies the Figure 3.3: GTX Hardware Model [17] GPU s work distributor of the kernel s grid configuration. Each thread block is 16

25 then assigned to a SM upon availability of sufficient thread and shared memory. The pattern in which the hardware scheduler allocates a block is random, and the hardware controller of the assigned SM initializes the state for all the threads in the block. The architecture does not provide large hardware caches shared among multiple cores as in modern CPUs. The variables that do not fit in the thread s register are spilled to global memory. Each SM has two small private data caches, texture cache, and constant cache, which only hold read-only data. Data must be explicitly allocated to the constant, and texture memory spaces. Shared memory of 16 KB size is available for each block. It has both read and write access which facilitates for faster thread communication. The texture cache allows arbitrary access patterns, and is useful for coalesced access pattern with random offsets. The constant cache is optimized for broadcasting values to all the processing elements in a SM. However, the performance degrades if threads in the same block request multiple addresses in the same cycle. The card requires pre-allocation of hardware in terms of number of threads and blocks to be utilized for the computation. Similarly, all the input and output data structures should be allocated and transferred to device memory before any operation can be carried out on them. 3.3 Our Proposed Algorithm As discussed in previous chapters, the Needleman-Wunsch algorithm is based on dynamic programming and consists of three steps - initialization, scorefill, and traceback. In this dissertation, we concentrate on the scorefill step as this is the most computationally expensive part [10 12, 21]. The new approach for the algorithm needs to handle the dependencies presented by the score calculation, and perform the operations as independently as possible. The scores in scorefill matrix can be calculated row by row, column by column, or anti-diagonal by anti-diagonal. The first two approaches limit the number of scores that can be calculated at once to only one as the scores in any column or row depend on the scores on the same row or column. However, the elements in the same anti-diagonal are independent of each other, and depend on the values in the previous two anti-diagonals, as shown in Figure 3.4. The number of scores that can be calculated in one step keeps on increasing until it reaches the left half of anti-diagonal, and then again starts decreasing towards the right half. Although the anti-diagonals have uneven size, we implement the wave-front 17

26 method for the NW algorithm with some novel modifications. Figure 3.4: Data dependecies in a Scorematrix In our approach, we divide the complete Scorematrix in square blocks as shown in Figure 3.5. Each small block inside the Scorematrix can be viewed equivalent to the block as in Figure 3.4 which has dependency among the elements. Further, we can observe similar pattern of dependency among the small blocks inside bigger block(scorematrix). As with the case of elements inside each block, we exploit the pattern of anti-diagonalism among the blocks. In the first attempt, only the elements of Block 1 can be calculated. The calculation of elements in Block 2 and Block 5 depends on the availability of the boundary scores of Block 1. Block 6 also needs to read the last element of Block 1 for the calculation of the first element in the block. The complete Scorematrix presents two levels of parallelism, element level and block level. Our GPU implementation, discussed in section 3.4.2, implements the element level parallelism by manipulating threads (using threadids), and block level using blocks (using blockids). If each block has q rows and r columns, then the computation of a single block requires q+r+1 steps. For all elements in an anti-diagonal, the sum of the row index and column index is always constant. For blocks of size 16 by 16, 18

27 Figure 3.5: Scorematrix and Block 19

28 the computation of 256 elements requires 33 parallel steps. Each block takes 33 boundary elements as input and computes 256 elements. So, the communicationto-computation ratio becomes 33:256. Similarly, all the blocks in the Scorematrix exhibit the similar anti-diagonal pattern of parallelism as the elements inside each block. 3.4 Experimental Approach The NW algorithm is implemented on both GPU and CPU. The implementations are evaluated on the basis of the processing time taken until the scorefill step for different lengths of reference sequence and query sequence. In CPU, only the initialization and the scorefill steps are timed. In GPU, the processing time includes time taken to transfer data structure to and from device memory along with the time to execute the kernels. In all the implementations, inputs (reference, and query sequences), and outputs (final aligned sequences) are read and written to files. The processing time is taken as an average of running the same program ten times CPU Implementation The CPU implementation is sequential where only one score of the matrix is calculated at once. It takes qlength dlength iterations to calculate the final Scorematrix where qlength and dlength are lengths of the query sequence and the reference sequence respectively GPU Implementation Both the initialization and the scorefill steps are implemented in the GPU. Two kernels, one for each step, are launched. In the initialization step, the values of the boundary elements, and the match or mismatch score for all the remaining elements of the Scorematrix are computed in one execution of the initialization kernel. For the scorefill step, we adopt the above discussed approach. Each thread is allocated to compute one element in the Scorematrix. The kernel code simulates the anti-diagonal parallelism inside each block. However, the anti-diagonal parallelism among the blocks has to be implemented from the host function itself. 20

29 The latter involves providing appropriate global-barrier synchronization mechanism to the blocks. The synchronization becomes necessary as there is producerconsumer relationship among the blocks, and the end results of the producer blocks need to be communicated to the consumer blocks before they can start doing any efficient computation. Similarly, the threads in each block need to be synchronized as there is dependency among the anti-diagonals, which is managed using GPU function, syncthreads. The effect of using global memory and shared memory for thread communication, changing number of threads, and blocks is analysed. Pseudocode Let us consider the two sequences to be aligned as Q[qlength], and R[dlength] where qlength and dlength are the lengths of query and reference sequence respectively. The scorematrix is denoted by Score[qlength][dlength], and the x- and y- dimensions of each block are denoted by LEN. The number of blocks (numblocks) in the largest anti-diagonal is given by length/len where length is the maximum of qlength, and dlength. Appropriate padding is used for the cases where qlength and dlength are not multiples of LEN. CPU Part 1. Allocate and transfer Q, R, and Score to device memory. 2. Define threads and number of blocks for initialization kernel. dim3 threads(len, LEN) dim3 blocks(qlength/len, dlength/len) 3. Call initialization kernel (GPU Part). initialize <<<blocks, threads>>>(q, R, Score, qlength, dlength) 4. For loop1 varies from 0 to 2 (numblocks-1), call scorematrix kernel (GPU Part). calculate<<<blocks, threads>>>(q, R, Score, qlength, dlength, loop) Note: Same hardware configuration as of initialization kernel is used. 21

30 5. Transfer Score from device to host memory. 6. Traceback Step GPU Part Let tx and ty be the x-index and y-index of threads inside block. Similarly, bx and by denote the x-index and y-index of blocks in a grid. Let txgrid and tygrid be the variables which give the x and y positions of thread in a grid, and calculated as - txgrid = LEN x bx + tx tygrid = LEN x by + ty Based on the variables, we can calculate the position of the element the thread is allocated to calculate in Score as, idx = txgrid + tygrid x dlength 1. Initialization kernel (a) If txgrid < qlength and tygrid < dlength, do If idx < qlength, Score[idx] = txgrid -2 Else if idx divisible by qlength, Score[idx] = tygrid -2 Else if Q[txgrid] matches R[tygrid], Score[idx]= match score Else Score[idx] = mismatch score 2. Scorematrix kernel (a) If txgrid < qlength and tygrid < dlength, do For loop2 varies from 0 to 2 (LEN - 1), do i. Calculate sum of block indices as bsum = bx + by ii. If bsum equals loop1, do Calculate sum of thread indices as tsum = tx + ty If txgrid > 0, tygrid > 0, and tsum equals loop2, do Data1 = Score[idx-qlength-1] + Score[idx] Data2 = Score[idx-qlength] - 2 Data3 = Score[idx-1] -2 Score[idx] = maximum of (Data1, Data2, Data3) iii. Synchronize threads 22

31 In the above Scorematrix kernel, global memory is used for thread communication. Threads read and write all the scores from and to global memory. The same score is read thrice while calculating the complete scorematrix. Shared memory has both read and write access, and it should take less time to access shared memory than global memory depending on the nature of problem implemented. Shared memory provides us the opportunity to read and store only the required boundary scores from previous blocks. So, only the boundary elements need to be accessed, and only once. Use of shared memory should also improve the performance of the implementation compared to global memory. The above Scorematrix kernel is modified as below to use shared memory. 3. Scorematrix kernel (using shared memory) (a) Allocate shared memory for block as, Scoreshared [LEN LEN] (b) If txgrid < qlength and tygrid < dlength, do i. If bsum equals loop1, do For loop2 varies from 0 to 2 (LEN - 1), do Calculate sum of thread indices as tsum = tx + ty Calculate the shared index as sharedidx = tx + ty LEN If txgrid > 0, tygrid > 0, and tsum equals loop2, do If tx equals to 0 or ty equals to 0, do Data1 = Score[idx-qlength-1] + Score[idx] Data2 = Score[idx-qlength] - 2 Data3 = Score[idx-1] - 2 Else Data1 = Scoreshared[sharedidx-LEN-1] + Score[idx] Data2 = Scoreshared[sharedidx-LEN] - 2 Data3 = Scoreshared[sharedidx-1] - 2 ii. Scoreshared[sharedidx] = maximum of (Data1, Data2, Data3) iii. Score[idx] = Scoreshared[sharedidx] (c) Synchronize threads 4. Traceback Step The traceback step does not exhibit any pattern of parallelism to be implemented in the GPU. The straightforward implementation of traceback step in the GPU is same as in the CPU. Only one block with one thread becomes sufficient for the step. Further in Chapter 6, we analyse the effect of using only one thread and one block as the hardware allocation is considered less suitable to benefit from the available hardware configuration. 23

32 Let us consider Q1 and R1 as the final aligned sequences. Let q1index, and r1index be the variables that keep track of the lengths of the final sequences. (a) Initialize q1index and r1index to 0; i and j to qlength-1, and dlength-1. (b) While i and j are greater than 0 do i. Calculate the match score for the position. ii. If Score[idx] equals Score[idx-qlength-1] + match score, then Copy Q[i] to Q1[q1index], and R[j] to R1[r1index] Increment q1index and r1index by 1. Decrement i and j by 1. iii. Else if Score[idx] equals Score[idx-length] + gap penalty, then Copy Q[i] to Q1[q1index], and insert a gap (-) in R1[r1index] Increment q1index and r1index by 1. Decrement i by 1. iv. Else if Score[idx] equals Score[idx-1] + gap penalty, then Insert a gap (-) in Q1[q1index], and copy R[j] to R1[r1index] Increment q1index and r1index by 1. Decrement j by 1. (c) While i is greater than 0, Copy Q[i] to Q1[q1index], and insert a gap (-) in R1[r1index] Increment q1index and r1index by 1. Decrement i by 1. (d) While j is greater than 0, Copy R[j] to R1[r1index] and insert a gap (-) in Q1[q1index] Increment q1index and r1index by 1. Decrement j by 1. (e) Reverse both the strings Q1, and R1. 24

33 CHAPTER 4 Experimental Results and Analysis of First Approach In this chapter, we analyse the results obtained by running both the CPU and the GPU versions. Also, we analyse the effect of varying the hardware configuration of the GPU. The timing results used to plot the graphs in this chapter are presented in Appendix B. 4.1 Environmental Details The CPU version is executed on Intel(R) Core(TM)2 Qaud CPU 2.40 GHz processor running Fedora Core 6, and C programming language (gcc compiler) is used. The GPU implementation of the algorithm is executed in the GeForce 8800 GTX, and the programming language used is CUDA in the Linux environment. 4.2 Performance Analysis of CPU version and GPU version The CPU version is a straightforward implementation of the NW algorithm where only one element in the Scorematrix is calculated in each iteration. Since we are interested in long sequences, only sequence lengths greater than 1000 are considered, and the processing time is measured in milliseconds (ms). For GPU, we used 256 threads and the number of blocks allocated was (qlength/16) (dlength/16) where qlength and dlength are the lengths of query and reference sequences respectively. Both the threads and blocks were indexed using two dimensional arrays. From the graphs in Figure 4.1 and Figure 4.2, we can observe that the time for the sequence alignment increases as we increase the length of the query 25

34 sequence and the reference sequence, i.e. an increase in the number of elements in the Scorematrix. The graphs show the time taken for aligning a query sequence with varying lengths of reference sequence in CPU and GPU respectively. 16 x Time(ms) Sequence Length Figure 4.1: CPU performance In Figure 4.3, the graph illustrates the processing time taken by the CPU and the GPU for aligning query sequence and reference sequence of equal length. For sequence length less than 9600, the GPU takes longer processing time of upto twice the CPU time for sequence length of However, for lengths approximately 9600 or greater, the GPU performs almost nine times faster for the sequence length of The maximum length of sequences that could be aligned at once was found to be Analysis of GPU version by varying the configuration Figure 4.4 illustrates the effect of using shared memory and global memory to read the score values. In the shared memory, the new consumer blocks need to read only the boundary values of the previous producer blocks. However, while using global memory, the threads have to read all the elements from the global memory. For the timing, we observed that the global memory outperforms 26

35 Time(ms) Reference Sequence Length Figure 4.2: GPU performance 16 x CPU GPU Time(ms) Sequence Lengths Figure 4.3: CPU vs GPU performance 27

36 the shared memory in the current configuration (256 threads and qlength/16 dlength/16 blocks). 2 x Global Shared Time(ms) Sequence Length Figure 4.4: Shared vs Global performance 8 x Global 64 Shared 64 Global 256 Shared Time(ms) Sequence Lengths Figure 4.5: Timing Results by Varying number of threads 28

Fast Sequence Alignment Method Using CUDA-enabled GPU

Fast Sequence Alignment Method Using CUDA-enabled GPU Yeim-Kuan Chang Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan ykchang@mail.ncku.edu.tw De-Yu