Sequence Alignment Using Graphics Processing Units. Dzivi PS

Size: px
Start display at page:

Download "Sequence Alignment Using Graphics Processing Units. Dzivi PS"

Transcription

1 Sequence Alignment Using Graphics Processing Units Dzivi PS This report is submitted as partial fulfilment of the requirements for the Honours Programme of the School of Computer Science and Software Engineering, The University of Western Australia, 2008

2 Abstract In bioinformatics, use of global sequence alignment algorithms such as Needleman- Wunsch (NW) becomes a time-consuming process because of their tendency to find all the possible matches between two sequences. Heuristic approaches such as BLAST, and FASTA, are often used for sequence alignment for their faster processing capability over their trade-off of being less optimal. Many implementations of the sequence alignment algorithms have been tried in different hardware architectures, and with different algorithmic approaches. The advent of CUDA compatible graphics cards has given a new dimension for implementation of general purpose applications. We explore the efficiency of GPUs for dynamic programming problems by implementing the NW algorithm in GeForce8800 GTX. We focus on the scorefill step of the NW algorithm as it is considered the most computationally expensive part. Our implementation follows the wave-front algorithmic approach as done in most of the past implementations of sequence alignment algorithms. The efficiency of GPU, for the NW algorithm was determined by comparing the execution times in GPU and CPU for varying lengths of sequences. Our first approach proved the suitability of GPUs for sequence alignment of long sequences. However, CPU processing time was faster for small sequence lengths. Slight modification in the approach resulted in speed improvement for all sequence lengths. Speedup of about 84 times was obtained for the maximum possible length of sequences. The maximum lengths of the sequences that could be aligned at once were found to be dependent on the device memory size. The results revealed that the efficiency of GPUs for a particular problem is dependent on the utilization of the allocated hardware, and the amount of parallelism exhibited by the problem. Similarly, the efficiency was dependent on the hardware configuration used such as number of threads, blocks, and memory spaces. Our implementations confirm that the problems based on dynamic programming can benefit from the use of graphics cards. The producer-consumer relationship can be managed via multiple kernel calls. The best case of efficiency was achieved by experimenting with varying GPU configurations Keywords: Sequence Alignment, Needleman-Wunsch Algorithm, GeForce8800 GTX ii

3 CR Categories: D.1 [Programming Techniques], D.1.3 [Concurrent Programming] iii

4 Acknowledgements Firstly, I would like to thank my research supervisor, Associate Professor Amitava Datta at the School of Computer Science and Software Engineering for all his guidance and encouragement throughout the whole year in completing this research project. He was always there to give consultation, and was ever boosting my overall moral. This project would not have been completed without his guidance and support. I would like to acknowledge the efforts of my husband, Mr. Ashis Parajuli for proofreading this thesis for correcting grammatical mistakes despite his busy schedule. He was always there for lifting my spirits in achieving this goal. I would like to thank my family, as without their constant support throughout my entire education I would never have made it here. Finally, a special thanks to my colleagues in Master of Computer Science for their help and support. iv

5 Contents Abstract Acknowledgements iii iv 1 Introduction Overview of Sequence Alignment Needleman-Wunsch Sequence Alignment Algorithm Initialization Step Scorefill Step Traceback Thesis Outline Review of the Literature Algorithmic Approach Wavefront Method Hardware Approaches Central Processing Units Single Instruction Multi Data Streams (SIMD) and other architectures Graphical Processing Units (GPUs) Implication to Thesis Methodology CUDA Threads Model Memory Model GeForce 8800 GTX v

6 3.3 Our Proposed Algorithm Experimental Approach CPU Implementation GPU Implementation Experimental Results and Analysis of First Approach Environmental Details Performance Analysis of CPU version and GPU version Analysis of GPU version by varying the configuration Discussion Modification to the NW Algorithm Modified Approach Pseudocode Analysis of the Modified NW Algorithm Performance of the new GPU Implementation Changing the Configuration Compare with Traceback Step Discussion Conclusion and Future Work Conclusion Future Work A Original Honours Proposal 44 B Timing Results for First Approach 48 C Timing Results for Modified NW Algorithm 50 vi

7 List of Tables 2.1 Execution Times(s) and Speedups for 3 Sequences GPU Implementations of SW Algorithm New GPU vs CPU Effect of changing number of threads Maximum length of Reference Sequence to Corresponding Query Sequence Length Comparison of CPU with GPU including Traceback step A.1 Tasks along with Targeted Deadlines B.1 Performance Analysis of CPU version B.2 Performance Analysis of GPU version B.3 Comparison of CPU and GPU B.4 Results for GPU for using Different Memory B.5 Timing Results of GPU for varying Number of Threads C.1 Performance Analysis of GPU version using Global Memory C.2 Performance Analysis of GPU version using Shared Memory C.3 Comparison of new GPU and previous GPU versions vii

8 List of Figures 1.1 Initialization and Dependency Traceback Step and Anti-Diagonal Parallelism Similarity Matrix CUDA Thread Model CUDA Memory Model GTX Hardware Model Data dependecies in a Scorematrix Scorematrix and Block CPU performance GPU performance CPU vs GPU performance Shared vs Global performance Timing Results by Varying number of threads Mapping of blocks Timing for new GPU Timing for new GPU compared to old GPU Ratio of old GPU to new GPU Effect of changing Memory A.1 Matrix Fill Step for Sequences AAA and AAAC viii

9 CHAPTER 1 Introduction Sequence alignment is a routine task in computational biology where two sequences of amino acids or DNA base pairs are compared to find evolutionary relationships with the aid of computers. A query sequence is aligned with a reference sequence to identify the regions of similarity. The sequence similarity helps in classification of genes and proteins along with detection of mutation points. Similarly, it is used to predict biological function, secondary and tertiary protein structure, and construct evolutionary trees [5]. It is the procedure of comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. It is important to obtain the best possible or optimal alignment to discover the evolutionary information [13]. 1.1 Overview of Sequence Alignment There are basically two types of sequence alignment algorithms, local and global. Both local and global alignment algorithms are based on the dynamic programming approach. Smith-Waterman (SW) local alignment matches two sequences for a fixed length for which they show a high degree of similarity. However, the Needleman-Wunsch (NW) algorithm is a global alignment algorithm which compares two sequences over their entire lengths. It seeks to match as many nucleotides or amino acids as possible between the sequences to give the optimal alignment. It is considered suitable for finding the best alignment for two sequences which are of similar length [9]. The February 2004 release of the Genbank molecular database contained 32,549,000 DNA sequences which are further composed of approximately 37, 893,- 844,733 bases [15]. The Genbank statistics on DNA sequences demonstrate exponential growth rates [16]. As a result, the process of aligning a query sequence with each of the database sequences incurs a significant amount of time, i.e. high computational cost [14]. Both algorithms are based on the idea of dynamic pro- 1

10 gramming and have quadratic complexity which depends on the lengths of the sequences. Alignment of long sequences demand higher computation power and space than a single computational resource can offer [8,22]. Because of the cost, heuristic approaches like FASTA [6], and BLAST [1] tend to be preferred for faster speed over the trade-off of being less optimal. Both FASTA and BLAST are less sensitive than the SW algorithm, and may fail to report distant sequence relationships [11, 13]. The use of multiple processors and special-purpose architectures for high performance solutions has always been a field of research in bioinformatics [21]. Different approaches to parallelize sequence alignment algorithms in different architectures have resulted in interesting speed improvement. Most implementations of sequence alignment are based on the Smith-Waterman (SW) local alignment algorithm [11, 21]. For long sequences, the use of the NW algorithm in a single processor(cpu) becomes a very time-consuming process as it compares two sequences lengthwise [14]. In this dissertation, we aim to align long sequences using the NW algorithm in GPUs to observe its performance. The steps involved in the NW algorithm are discussed in Section 1.2. The CUDA (Computer Unified Device Architecture) compatible graphic cards such as GeForce 8800 GTX gives users both read and write access to device memory, and is considered suitable for general purpose applications requiring heavy computations [17]. The previous cards programmed using OpenGL lacked the ability to randomly access any memory cell for writing. Suitability of graphics card for sequence alignment depends on various factors such as the amount of parallel computation involved, hardware configuration, and the available memory size [11, 20, 21]. Considering that the best hardware configuration is used and the algorithm involves parallel computation, the implementation on GPUs should be faster compared to Central Processing Units (CPUs). For the experimental analysis, we aim to implement the NW algorithm in both CPU and GPU. The execution time for these implementations is compared by varying query sequence and reference sequence lengths. 1.2 Needleman-Wunsch Sequence Alignment Algorithm The NW algorithm [13] consists of three steps - initialization, scorefill, and traceback. The four nucleotides used in DNA sequencing are represented as A(Adenine), G(Guanine), C(Cytosine), and T(Thymine). The following example illustrates how the algorithm works. 2

11 Let the two sequences to be globally aligned be - AGC (sequence 1, s, with length, m=3) AAAC (sequence 2, t, with length, n=4) If the symbol at position i of s is the same as the symbol at position j of t, +1 is given as a match score; otherwise, -1 as a mismatch score and -2 as a gap penalty, i.e. aligning with a gap (-). Gaps are inserted in either of the sequences to match maximum nucleotides between them. In bioinformatics, gaps represent the point of mutation where individual bases in a DNA sequence gets mutated or changed Initialization Step The first step is to create a matrix (namely Scorematrix) with m columns and n rows where sequence s is written vertically above the matrix and sequence t horizontally. It is assumed that there is always a possibility of starting an alignment with a gap. So, the scores in the top row are- 0 (aligning a gap with a gap), -2(aligning A with a single gap), -4(aligning AG with two gaps) and - 6(aligning AGC with three gaps). Similarly, the scores in the first column are calculated as shown in Figure 1.1(a). (a) Initialization and Calculation of M(1,1) (b) Calculation of M(1,2) and M(2,1) Figure 1.1: Initialization and Dependency 3

12 1.2.2 Scorefill Step In this step, the maximum global alignment score, M(i,j), for each position in the Scorematrix, M(m,n), is calculated. The starting point in the matrix is the upper left hand corner. In order to find M(i,j) for any pair of i and j, the scores for M(i-1,j), M(i,j-1) and M(i-1,j-1) must be known. For each position, M(i,j) is defined to be the maximum score at position (i,j), i.e., M(i,j) = Maximum of { M(i-1,j) -2, aligning s[i] with a gap, M(i,j-1) -2, aligning t[j] with a gap, M(i-1,j-1) ± 1, aligning s[i] with t[j] } Using this information, the score at position (1,1) in the matrix can be calculated. For example, in the matrix shown in Figure 1.1(a), M(1,1)= max{ M(0,1) -2, M(1,0) -2, M(0,0) +1}= max{ -4,-4,1}= 1 Further, the scores for the positions M(1,2) and M(2,1) are calculated as in Figure 1.1(b). The final score table is as shown in Figure 1.2(a). (a) Complete Scorematrix (b) Pattern of Parallelism Figure 1.2: Traceback Step and Anti-Diagonal Parallelism 4

13 1.2.3 Traceback The traceback step involves tracing back through the score table from M(m,n) pointing the arrows to any of the three entries whichever gave the optimal score for the position. The three entries can be M(i,j-1), M(i-1,j) and M(i-1,j-1). The traversing is done until we reach the left most position of the table. Tracing back gives the optimal alignment for the two sequences. For the best alignment, the arrows are tracked from M(m,n); a gap is inserted in the reference sequence where there is a vertical arrow, and a gap is inserted in the query sequence where there is a horizontal arrow. In this example, there are three possible optimal alignments. - A G C A - G C A G - C Sequence1 A A A C A A A C A A A C Sequence2 Similarly, the optimal sequence alignments can be calculated for any length of query sequence and reference sequence. AGGT-ACGTACTACTACAATACATATAAATCCATATATACGTACGT Sequence 1 AACTGACATAG-ACAATGCAAAGT-TCCGTACACAGTACAGTAACT Sequence Thesis Outline Chapter 1 provided the general background information and thesis outline. Chapter 2 presents the summary of the previous related works, and we describe the methodology in Chapter 3. Chapter 4 covers the analysis of the different implementations of the NW algorithm in both CPU and GPU. In Chapter 5, we recommend some modifications to the NW algorithm to enhance its performance in Chapter 4. We evaluate the performance of the implementation in Chapter 6. Chapter 7 presents the summary of the thesis work, and suggests some future research work. 5

14 CHAPTER 2 Review of the Literature This chapter focuses on previous research in sequence alignment on the basis of methods undertaken, hardware architecture, and speed enhancement achieved. 2.1 Algorithmic Approach The sequence alignment algorithms, Needleman-Wunsch (NW), and Smith- Waterman (SW) [13], consist of three steps - Initialization, Matrix Fill, and Traceback. The steps in both the algorithms are same, however, with some differences in the scoring scheme. Computation of the score matrix in a Central Processing Unit (CPU) takes time and space complexity of O(mn), where m and n are the lengths of query and reference sequences respectively. FASTA and BLAST share the same complexity but are 40 times faster than Central Processing Unit(CPU) implementation of the SW algorithm as they follow heuristic approaches [7, 11]. The concept behind the computer implementation of these algorithms is to compare two sequences as two strings and highly similar sequences share the same sub-string [20] Wavefront Method The computation of a score in scorematrix is dependent on previous scores. Given the data dependencies presented by the matrix fill step, we can observe that the scores can be calculated column by column, row by row, or anti-diagonal by antidiagonal. The problem with the first two approaches is that most of the scores are dependent on other scores in the same row or column. Thus, the approaches limit the number of scores that can be calculated independently, making it hard to parallelize the algorithms. As illustrated in Figure 1.1(b) and Figure 2.1, the matrix fill step shows a pattern of parallelism where scores in each anti-diagonal can be calculated simultaneously. This approach of calculating all the scores in each anti-diagonal at once for faster computation is termed the wave-front or 6

15 wave-level method [7]. Almost all implementations of sequence alignment are based on this method [3, 4, 7, 11, 14, 19, 20]. Figure 2.1: Similarity Matrix [12] 2.2 Hardware Approaches This section discusses various hardware realizations of sequence alignment on Central Processing Units (CPUs), GPUs, and other architectures. The focus is on their performances based on the length of sequence, speed gain, and algorithmic approach undertaken Central Processing Units In 2005, Boukerchea et al. [3] implemented a parallel version of the SW algorithm in C using the software, Distributed Shared Memory(DSM) JIAJIA v.2.1, in a cluster of eight workstations connected via Ethernet. The calculation of scores was divided to the processors such that each of them computed a set of columns in a row and waited for other processors to read the scores before proceeding again. Synchronisation among the processors was managed by means of locks and conditional variables in JIAJIA system. The communication between processors was managed through shared memory abstraction provided by DSM. The first attempt without dividing sequences into bands resulted in some increased speed. However, it proved to be insignificant for small sequences as the overhead of 7

16 communication time among processors dominated the processing time. Dividing the sequences in bands and allocating them to individual processors resulted in significant results for all sequences as shown in Table 2.1. Size Size Bands Serial 2 Proc/speedup 4 Proc/speedup 8 Proc/speedup 8K x 8K 40x / / / K x 15K 40 x / / / K x 50K 40 x / / /7.21 Table 2.1: Execution Times(s) and Speedups for 3 Sequences [3] Table 2.1 illustrates that breaking the sequences in different size bands affects the runtime. The increase in number of processors implies linear increment in speed. The dependency of calculation of a block (input to a processor) on previous blocks (output from other processors) can be managed in parallel processors by means of synchronization. Overall performance can be improved by using shared memory and minimizing the communication time among parallel processes. Similarly, Naveed et al. [14] implemented the NW algorithm in a cluster of CPUs using Alchemi framework for grid computing. The processing time was reduced from O(mn) to O(m+n), where m and n are the lengths of query and reference sequences respectively. Their implementation faced the problem of increased network traffic and, for long sequences, a need of more processors was felt. Both of the above implementations prove that more processors can better performance. However, use of many workstations appears to be an expensive approach Single Instruction Multi Data Streams (SIMD) and other architectures With a search for a re-configurable solution, the SW algorithm was executed on Field Programmable Gate Arrays (FPGAs) by replacing the score fill portion with a FPGA custom circuit [7]. The processing run time was improved by 287 percent. However, the configuration for each algorithm has to be changed and it was considered to be more complicated than writing a new code. SIMD implementations of the SW algorithm on Micro-Grained Array Processor (MGAP) and Krestel-Parallel processor provided some speed gain [7]. SIMD are multiple processing units which operate on multiple sets of operands in the same instruction cycle. However, it is hard to compare the results as they are not run against any baseline and lengths of sequences are different in different cases. The 8

17 above discussed architectures are some accelerators for computationally intensive problems like sequence alignment but are expensive and hard to upgrade [21] Graphical Processing Units (GPUs) GPUs are based on the SIMD architecture, and basically, used for graphics related applications. GPUs are commodity components compared to most of the hardware mentioned above. Being low cost hardware with capability of high throughput, it is considered to be a good match for bioinformatics applications involving many calculations [7, 11, 20]. Table 2.2 illustrates the different GPU implementations of the SW algorithm. The wavefront method was followed in all of the realizations, and the performance is measured against popular approaches in practice, such as OSEARCH [18], FASTA [6], and BLAST [1]. Graphics Cards Sequence Length Language Speed wrt Ratio Reference GeForce 6800 GTO 64 to 4096 Cg OSEARCH 2.8 to 7.46 Liu et al. [10] FASTA 1.5 to 4.55 Liu et al. [10] GeForce 7800 GTX 64 to 4096 Cg OSEARCH 4.66 to 9.45 Liu et al. [10] FASTA 2.49 to 9.42 Liu et al. [10] GeForce 6800 GTO 64 to 2048 GLSL CPU 2.1 to 4.0 Voss et al. [21] Radeon X850XT 64 to 2048 GLSL CPU 2.3 to 4.4 Voss et al. [21] GeForce 8800 GTX 64 to 362 CUDA FASTA to Manavaski et al. [11] (One Card) BLAST 1.24 to 0.93 Manavaski et al. [11] GeForce 8800 GTX 64 to 362 CUDA FASTA to Manavaski et al. [11] (Two Cards) BLAST 2.39 to 1.79 Manavaski et al. [11] Table 2.2: GPU Implementations of SW Algorithm Implementations on Previous Versions of Graphics Cards The graphics cards (GeForce 6800 GTO, GeForce 7800 GTX, and Radeon X850XT) required the reformulation of the algorithm in terms of graphics primitives. For this purpose, Liu et al. [10], and Voss et al. [21] made use of graphical programming languages Cg and Open GL Shading Language (GLSL) respectively. Both GPU computations had to be programmed in terms of the graphics pipeline with three steps - vertex processing, rasterization, and fragment processing. The sequences were stored in texture cache memory to minimize read access times as it 9

18 takes a longer time for the threads to access global memory than cached texture memory [17]. The maximum sequence length that could be aligned at once was limited by texture buffer size. We can observe from the first two graphics cards executions that memory size, GPU clock speed, and the number of vertex and fragment processors influence the performance. Further, the results from Voss et al. [21] showed that the optimum performance was obtained when the ratio of query length to reference length equalled to one. This shows that for the efficient GPU computation, the implemented algorithm should be able to maximise the utilisation of the available or allocated hardware configuration. Although reformulation of the algorithms in graphical terms is a tedious task, the results show that programming GPU cards make a better alternative compared to other SIMD systems, and high-cost supercomputers Implementations on CUDA Compatible Graphics Cards Sequence alignment is a process of matching common sub-strings between two strings. Schatz et al. [19] implemented a string matching algorithm, Cmatch, in G80 and it was based on a seed-and-extend algorithm. Suffix trees of each reference sequence were created, and the kernel code matched the query sequence with each of the trees. Compute Unified Device Architecture (CUDA) language was used to program the GeForce 8800 GTX card for the purpose. The language eliminates the need of restatement of algorithms in graphics primitives and facilitates coding for general-purpose applications [17]. As in the previous GPU versions, texture memory was used for reads. The performance for smaller queries was improved by 34 times compared to its CPU version. However, comparatively small gain of about 2 was achieved for longer reads. Cmatch when utilized in sequence alignment, boosted speed by ten times compared to its CPU counterpart [20]. The authors [19, 20] explained that use of conditional statements in kernel code causes thread to follow different execution paths. This divergence makes processors to serialize the instructions and each thread has to run more than one version of the code thus increasing run time. Similarly, they pointed out that lack of parallel computations in suffix trees could not fully utilise the available hardware. Manavaski et al. [11] explored the use of two GeForce 8800 GTX cards for the SW algorithm which used the substitution matrix, BLOSUM-50 [2]. For the optimal performance, the best hardware configuration was found to be 64 threads per block, 450 blocks, and threads per grid. As in the previous graphics card 10

19 versions [7,19 21], they made use of texture memory for the storage of query sequences, and shared memory to overcome the overhead of read access times. The reference sequences were pre-ordered in terms of their length to localise thread reads as in Cmatch. The result showed that using two graphic cards doubled the overall performance. Compared to Liu s version, the performance enhancement is significant for small sequences. However, the possibility for longer sequences is not considered. As sequence length increases, the ratio of performance compared to BLAST decreases. 2.3 Implication to Thesis All the above discussed GPU implementations, except for Cmatch, make use of the wave-front method in their algorithms. The matrix fill step which involves many calculations has been executed in parallel with an attempt to reduce computation time. In all cases, the results show that use of more processors means improvement in the performance. Similarly, use of shared memory and texture memory to minimise read access times, and processor communication times is common. However, limited research has been conducted so far in refining the wave-front method, and additional research seems to be beneficial for parallel architecture implementations. At present, G80 presents a good platform for general purpose application requiring intensive calculations. Algorithms need to be optimized to fully utilise the available hardware configuration. The algorithmic approach followed in the graphics card till now can be considered equivalent to the CPU version which did not use blocking. The blocked method used by Boukerchea et al. [3] in CPU proves to be a worthy alternative to be experimented in the G80 architecture considering the speedup and suitability for all sequence lengths. At the same time, migration of the method from CPU to GPU requires adjustment of the algorithm in GPU configuration. The new implementation should take into account - the programming issues such as least use of conditions in kernel code, minimisation of thread divergence; hardware issues such as maximum number of threads, onchip memory size. Similarly, the suitability of the method for different sequence lengths remains to be explored. 11

20 CHAPTER 3 Methodology This chapter briefly describes the architecture of the graphics card, GeForce 8800 GTX, and explains our proposed approach for the sequence alignment to be implemented in the graphics card. 3.1 CUDA CUDA (Compute Unified Device Architecture) is an extension of the C programming language, and an API used for programming general purpose applications in GeForce 8 series, Quadro FX 5600/4600, and Tesla products [4]. Access to the GPU hardware is managed by the operating system by running CUDA and graphics applications concurrently. CUDA eliminates the need to learn domainspecific languages to program the GPU, thus resulting in a minimum learning curve Threads Model In CUDA, the GPU is treated as a computer device capable of executing a high number of threads concurrently. The data-independent and compute-intensive portion of an application which has to be executed many times in the CPU (host), can be isolated as a function. The function, also known as a kernel, is compiled and offloaded to the instruction set of the device which is then executed on the device (GPU) by many threads. The data transfer between host memory and device memory takes place via the device s DMA (Direct Memory Access) engines. The batch of threads that execute a kernel is organized as a grid of block of threads. Threads in each block can communicate efficiently by using shared memory, and synchronize their execution to coordinate memory accesses. Each thread is identified by a unique id (threadid), which is the thread number in 12

21 the block. Each block can be specified as two- or three- dimensional arrays having threads with two- or three- component index. The maximum number of threads in a block is limited to 512. Similarly, all the thread blocks executing the same kernel are grouped as a grid of thread blocks. Each block is identified by its blockid, i.e. block number in the grid. Blocks may be indexed in onedimensional or two-dimensional array. The thread model of CUDA is as shown in Figure 3.1. Each thread block is assigned in its entirety to one streaming multiprocessor (SM) and runs as a unit to completion without pre-emption. All threads in the thread block execute the kernel in a fine-grained, time-sliced manner. The resources of the block cannot be reclaimed until the complete block finishes execution. Threads in different blocks within the same grid cannot communicate and synchronize with each other during execution. A device may run all the blocks in the grid in parallel if it has high parallel capability, sequentially if it has low parallel capability, or as a combination of both. If the number of thread blocks exceeds the available hardware significantly, waiting thread blocks are assigned to SMs as the previous ones finish executing. Figure 3.1: CUDA Thread Model [17] 13

22 3.1.2 Memory Model Each thread executing in the device can only access the device s DRAM, and the on-chip memory through a set of memory-spaces. The memory model is shown in Figure 3.2. Figure 3.2: CUDA Memory Model [17] Data can be shared between thread blocks using global memory. However, the cost of accessing global memory is high. Each thread block also has per-block shared memory(pbsm) which allows threads in a thread block to communicate efficiently with low latency. PBSM is implemented using SRAM, and the size for each block is limited to 16 KB. The local and global memory spaces are not cached. So, each memory access to global or local memory generates an explicit memory access. A multiprocessor takes four clock cycles to issue one memory instruction, and incurs extra 400 to 600 clock cycles of memory latency to access global memory. So, use of shared memory to minimize global memory accesses 14

23 and utilize the data within local multiprocessor memories is preferred by CUDA developers. The read and write capability of local multiprocessor memory spaces is summarised as following - 1. Registers The fastest form of memory. Can only be accessed by a thread. Has the lifetime of the thread. 2. Shared Memory Memory space as fast as a register depending on whether there are bank conflicts or not. Can only be accessed by the threads in a block from which it was created. Has the lifetime of the block. 3. Global Memory About 150 times slower than the shared memory or the register. Can be accessed from both the device and the host. Has the lifetime of the application. 4. Local Memory A part of the global memory, and slower than the shared memory or the register. Can only be accessed by the thread. Has the lifetime of the thread. The synchronization in each thread block is managed by the hardware itself. However, synchronization among blocks requires completion of a kernel and launching of new kernel. The order in which blocks are executed is nondeterministic, so appropriate global barrier mechanism should be implemented in cases where the blocks have a producer-consumer relationship to prevent deadlock. 15

24 3.2 GeForce 8800 GTX The G80 architecture is designed to support implementation of both the graphics, and the general purpose applications. The device is a set of 16 streaming multiprocessors (SM) of SIMD (Single Instruction Multiple Data) architecture. Figure 3.3 shows the hardware model of the device. Each SM has 8 processing elements called stream processors. The threads in each thread block are timesliced into these processing elements in groups of 32 threads also known as warps. Each warp is executed by the multiprocessor in a SIMD fashion where a thread scheduler periodically switches from one warp to another to utilize the available computational resources. Hardware masking is used to handle the cases of divergent threads. If threads in the same warp diverge due to conditional statements, only threads following the same path can be executed concurrently. If all the 32 threads in the warp diverge without reconverging, then a sequential execution for each of the 32 threads will occur resulting in 32 times more computational time. So, optimizing the algorithm to minimize SIMD divergence certainly benefits the performance. As soon as a kernel is launched, the device driver notifies the Figure 3.3: GTX Hardware Model [17] GPU s work distributor of the kernel s grid configuration. Each thread block is 16

25 then assigned to a SM upon availability of sufficient thread and shared memory. The pattern in which the hardware scheduler allocates a block is random, and the hardware controller of the assigned SM initializes the state for all the threads in the block. The architecture does not provide large hardware caches shared among multiple cores as in modern CPUs. The variables that do not fit in the thread s register are spilled to global memory. Each SM has two small private data caches, texture cache, and constant cache, which only hold read-only data. Data must be explicitly allocated to the constant, and texture memory spaces. Shared memory of 16 KB size is available for each block. It has both read and write access which facilitates for faster thread communication. The texture cache allows arbitrary access patterns, and is useful for coalesced access pattern with random offsets. The constant cache is optimized for broadcasting values to all the processing elements in a SM. However, the performance degrades if threads in the same block request multiple addresses in the same cycle. The card requires pre-allocation of hardware in terms of number of threads and blocks to be utilized for the computation. Similarly, all the input and output data structures should be allocated and transferred to device memory before any operation can be carried out on them. 3.3 Our Proposed Algorithm As discussed in previous chapters, the Needleman-Wunsch algorithm is based on dynamic programming and consists of three steps - initialization, scorefill, and traceback. In this dissertation, we concentrate on the scorefill step as this is the most computationally expensive part [10 12, 21]. The new approach for the algorithm needs to handle the dependencies presented by the score calculation, and perform the operations as independently as possible. The scores in scorefill matrix can be calculated row by row, column by column, or anti-diagonal by anti-diagonal. The first two approaches limit the number of scores that can be calculated at once to only one as the scores in any column or row depend on the scores on the same row or column. However, the elements in the same anti-diagonal are independent of each other, and depend on the values in the previous two anti-diagonals, as shown in Figure 3.4. The number of scores that can be calculated in one step keeps on increasing until it reaches the left half of anti-diagonal, and then again starts decreasing towards the right half. Although the anti-diagonals have uneven size, we implement the wave-front 17

26 method for the NW algorithm with some novel modifications. Figure 3.4: Data dependecies in a Scorematrix In our approach, we divide the complete Scorematrix in square blocks as shown in Figure 3.5. Each small block inside the Scorematrix can be viewed equivalent to the block as in Figure 3.4 which has dependency among the elements. Further, we can observe similar pattern of dependency among the small blocks inside bigger block(scorematrix). As with the case of elements inside each block, we exploit the pattern of anti-diagonalism among the blocks. In the first attempt, only the elements of Block 1 can be calculated. The calculation of elements in Block 2 and Block 5 depends on the availability of the boundary scores of Block 1. Block 6 also needs to read the last element of Block 1 for the calculation of the first element in the block. The complete Scorematrix presents two levels of parallelism, element level and block level. Our GPU implementation, discussed in section 3.4.2, implements the element level parallelism by manipulating threads (using threadids), and block level using blocks (using blockids). If each block has q rows and r columns, then the computation of a single block requires q+r+1 steps. For all elements in an anti-diagonal, the sum of the row index and column index is always constant. For blocks of size 16 by 16, 18

27 Figure 3.5: Scorematrix and Block 19

28 the computation of 256 elements requires 33 parallel steps. Each block takes 33 boundary elements as input and computes 256 elements. So, the communicationto-computation ratio becomes 33:256. Similarly, all the blocks in the Scorematrix exhibit the similar anti-diagonal pattern of parallelism as the elements inside each block. 3.4 Experimental Approach The NW algorithm is implemented on both GPU and CPU. The implementations are evaluated on the basis of the processing time taken until the scorefill step for different lengths of reference sequence and query sequence. In CPU, only the initialization and the scorefill steps are timed. In GPU, the processing time includes time taken to transfer data structure to and from device memory along with the time to execute the kernels. In all the implementations, inputs (reference, and query sequences), and outputs (final aligned sequences) are read and written to files. The processing time is taken as an average of running the same program ten times CPU Implementation The CPU implementation is sequential where only one score of the matrix is calculated at once. It takes qlength dlength iterations to calculate the final Scorematrix where qlength and dlength are lengths of the query sequence and the reference sequence respectively GPU Implementation Both the initialization and the scorefill steps are implemented in the GPU. Two kernels, one for each step, are launched. In the initialization step, the values of the boundary elements, and the match or mismatch score for all the remaining elements of the Scorematrix are computed in one execution of the initialization kernel. For the scorefill step, we adopt the above discussed approach. Each thread is allocated to compute one element in the Scorematrix. The kernel code simulates the anti-diagonal parallelism inside each block. However, the anti-diagonal parallelism among the blocks has to be implemented from the host function itself. 20

29 The latter involves providing appropriate global-barrier synchronization mechanism to the blocks. The synchronization becomes necessary as there is producerconsumer relationship among the blocks, and the end results of the producer blocks need to be communicated to the consumer blocks before they can start doing any efficient computation. Similarly, the threads in each block need to be synchronized as there is dependency among the anti-diagonals, which is managed using GPU function, syncthreads. The effect of using global memory and shared memory for thread communication, changing number of threads, and blocks is analysed. Pseudocode Let us consider the two sequences to be aligned as Q[qlength], and R[dlength] where qlength and dlength are the lengths of query and reference sequence respectively. The scorematrix is denoted by Score[qlength][dlength], and the x- and y- dimensions of each block are denoted by LEN. The number of blocks (numblocks) in the largest anti-diagonal is given by length/len where length is the maximum of qlength, and dlength. Appropriate padding is used for the cases where qlength and dlength are not multiples of LEN. CPU Part 1. Allocate and transfer Q, R, and Score to device memory. 2. Define threads and number of blocks for initialization kernel. dim3 threads(len, LEN) dim3 blocks(qlength/len, dlength/len) 3. Call initialization kernel (GPU Part). initialize <<<blocks, threads>>>(q, R, Score, qlength, dlength) 4. For loop1 varies from 0 to 2 (numblocks-1), call scorematrix kernel (GPU Part). calculate<<<blocks, threads>>>(q, R, Score, qlength, dlength, loop) Note: Same hardware configuration as of initialization kernel is used. 21

30 5. Transfer Score from device to host memory. 6. Traceback Step GPU Part Let tx and ty be the x-index and y-index of threads inside block. Similarly, bx and by denote the x-index and y-index of blocks in a grid. Let txgrid and tygrid be the variables which give the x and y positions of thread in a grid, and calculated as - txgrid = LEN x bx + tx tygrid = LEN x by + ty Based on the variables, we can calculate the position of the element the thread is allocated to calculate in Score as, idx = txgrid + tygrid x dlength 1. Initialization kernel (a) If txgrid < qlength and tygrid < dlength, do If idx < qlength, Score[idx] = txgrid -2 Else if idx divisible by qlength, Score[idx] = tygrid -2 Else if Q[txgrid] matches R[tygrid], Score[idx]= match score Else Score[idx] = mismatch score 2. Scorematrix kernel (a) If txgrid < qlength and tygrid < dlength, do For loop2 varies from 0 to 2 (LEN - 1), do i. Calculate sum of block indices as bsum = bx + by ii. If bsum equals loop1, do Calculate sum of thread indices as tsum = tx + ty If txgrid > 0, tygrid > 0, and tsum equals loop2, do Data1 = Score[idx-qlength-1] + Score[idx] Data2 = Score[idx-qlength] - 2 Data3 = Score[idx-1] -2 Score[idx] = maximum of (Data1, Data2, Data3) iii. Synchronize threads 22

31 In the above Scorematrix kernel, global memory is used for thread communication. Threads read and write all the scores from and to global memory. The same score is read thrice while calculating the complete scorematrix. Shared memory has both read and write access, and it should take less time to access shared memory than global memory depending on the nature of problem implemented. Shared memory provides us the opportunity to read and store only the required boundary scores from previous blocks. So, only the boundary elements need to be accessed, and only once. Use of shared memory should also improve the performance of the implementation compared to global memory. The above Scorematrix kernel is modified as below to use shared memory. 3. Scorematrix kernel (using shared memory) (a) Allocate shared memory for block as, Scoreshared [LEN LEN] (b) If txgrid < qlength and tygrid < dlength, do i. If bsum equals loop1, do For loop2 varies from 0 to 2 (LEN - 1), do Calculate sum of thread indices as tsum = tx + ty Calculate the shared index as sharedidx = tx + ty LEN If txgrid > 0, tygrid > 0, and tsum equals loop2, do If tx equals to 0 or ty equals to 0, do Data1 = Score[idx-qlength-1] + Score[idx] Data2 = Score[idx-qlength] - 2 Data3 = Score[idx-1] - 2 Else Data1 = Scoreshared[sharedidx-LEN-1] + Score[idx] Data2 = Scoreshared[sharedidx-LEN] - 2 Data3 = Scoreshared[sharedidx-1] - 2 ii. Scoreshared[sharedidx] = maximum of (Data1, Data2, Data3) iii. Score[idx] = Scoreshared[sharedidx] (c) Synchronize threads 4. Traceback Step The traceback step does not exhibit any pattern of parallelism to be implemented in the GPU. The straightforward implementation of traceback step in the GPU is same as in the CPU. Only one block with one thread becomes sufficient for the step. Further in Chapter 6, we analyse the effect of using only one thread and one block as the hardware allocation is considered less suitable to benefit from the available hardware configuration. 23

32 Let us consider Q1 and R1 as the final aligned sequences. Let q1index, and r1index be the variables that keep track of the lengths of the final sequences. (a) Initialize q1index and r1index to 0; i and j to qlength-1, and dlength-1. (b) While i and j are greater than 0 do i. Calculate the match score for the position. ii. If Score[idx] equals Score[idx-qlength-1] + match score, then Copy Q[i] to Q1[q1index], and R[j] to R1[r1index] Increment q1index and r1index by 1. Decrement i and j by 1. iii. Else if Score[idx] equals Score[idx-length] + gap penalty, then Copy Q[i] to Q1[q1index], and insert a gap (-) in R1[r1index] Increment q1index and r1index by 1. Decrement i by 1. iv. Else if Score[idx] equals Score[idx-1] + gap penalty, then Insert a gap (-) in Q1[q1index], and copy R[j] to R1[r1index] Increment q1index and r1index by 1. Decrement j by 1. (c) While i is greater than 0, Copy Q[i] to Q1[q1index], and insert a gap (-) in R1[r1index] Increment q1index and r1index by 1. Decrement i by 1. (d) While j is greater than 0, Copy R[j] to R1[r1index] and insert a gap (-) in Q1[q1index] Increment q1index and r1index by 1. Decrement j by 1. (e) Reverse both the strings Q1, and R1. 24

33 CHAPTER 4 Experimental Results and Analysis of First Approach In this chapter, we analyse the results obtained by running both the CPU and the GPU versions. Also, we analyse the effect of varying the hardware configuration of the GPU. The timing results used to plot the graphs in this chapter are presented in Appendix B. 4.1 Environmental Details The CPU version is executed on Intel(R) Core(TM)2 Qaud CPU 2.40 GHz processor running Fedora Core 6, and C programming language (gcc compiler) is used. The GPU implementation of the algorithm is executed in the GeForce 8800 GTX, and the programming language used is CUDA in the Linux environment. 4.2 Performance Analysis of CPU version and GPU version The CPU version is a straightforward implementation of the NW algorithm where only one element in the Scorematrix is calculated in each iteration. Since we are interested in long sequences, only sequence lengths greater than 1000 are considered, and the processing time is measured in milliseconds (ms). For GPU, we used 256 threads and the number of blocks allocated was (qlength/16) (dlength/16) where qlength and dlength are the lengths of query and reference sequences respectively. Both the threads and blocks were indexed using two dimensional arrays. From the graphs in Figure 4.1 and Figure 4.2, we can observe that the time for the sequence alignment increases as we increase the length of the query 25

34 sequence and the reference sequence, i.e. an increase in the number of elements in the Scorematrix. The graphs show the time taken for aligning a query sequence with varying lengths of reference sequence in CPU and GPU respectively. 16 x Time(ms) Sequence Length Figure 4.1: CPU performance In Figure 4.3, the graph illustrates the processing time taken by the CPU and the GPU for aligning query sequence and reference sequence of equal length. For sequence length less than 9600, the GPU takes longer processing time of upto twice the CPU time for sequence length of However, for lengths approximately 9600 or greater, the GPU performs almost nine times faster for the sequence length of The maximum length of sequences that could be aligned at once was found to be Analysis of GPU version by varying the configuration Figure 4.4 illustrates the effect of using shared memory and global memory to read the score values. In the shared memory, the new consumer blocks need to read only the boundary values of the previous producer blocks. However, while using global memory, the threads have to read all the elements from the global memory. For the timing, we observed that the global memory outperforms 26

35 Time(ms) Reference Sequence Length Figure 4.2: GPU performance 16 x CPU GPU Time(ms) Sequence Lengths Figure 4.3: CPU vs GPU performance 27

36 the shared memory in the current configuration (256 threads and qlength/16 dlength/16 blocks). 2 x Global Shared Time(ms) Sequence Length Figure 4.4: Shared vs Global performance 8 x Global 64 Shared 64 Global 256 Shared Time(ms) Sequence Lengths Figure 4.5: Timing Results by Varying number of threads 28

Fast Sequence Alignment Method Using CUDA-enabled GPU

Fast Sequence Alignment Method Using CUDA-enabled GPU Fast Sequence Alignment Method Using CUDA-enabled GPU Yeim-Kuan Chang Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan ykchang@mail.ncku.edu.tw De-Yu

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

GPU Accelerated Smith-Waterman

GPU Accelerated Smith-Waterman GPU Accelerated Smith-Waterman Yang Liu 1,WayneHuang 1,2, John Johnson 1, and Sheila Vaidya 1 1 Lawrence Livermore National Laboratory 2 DOE Joint Genome Institute, UCRL-CONF-218814 {liu24, whuang, jjohnson,

More information

THE Smith-Waterman (SW) algorithm [1] is a wellknown

THE Smith-Waterman (SW) algorithm [1] is a wellknown Design and Implementation of the Smith-Waterman Algorithm on the CUDA-Compatible GPU Yuma Munekawa, Fumihiko Ino, Member, IEEE, and Kenichi Hagihara Abstract This paper describes a design and implementation

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Sequence Alignment with GPU: Performance and Design Challenges

Sequence Alignment with GPU: Performance and Design Challenges Sequence Alignment with GPU: Performance and Design Challenges Gregory M. Striemer and Ali Akoglu Department of Electrical and Computer Engineering University of Arizona, 85721 Tucson, Arizona USA {gmstrie,

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Accelerating Smith-Waterman

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

On the Efficacy of Haskell for High Performance Computational Biology

On the Efficacy of Haskell for High Performance Computational Biology On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

GPU-Supercomputer Acceleration of Pattern Matching

GPU-Supercomputer Acceleration of Pattern Matching CHAPTER GPU-Supercomputer Acceleration of Pattern Matching 13 Ali Khajeh-Saeed, J. Blair Perot This chapter describes the solution of a single very large pattern-matching search using a supercomputing

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

MD-CUDA. Presented by Wes Toland Syed Nabeel

MD-CUDA. Presented by Wes Toland Syed Nabeel MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators

Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators Shannon I. Steinfadt Doctoral Research Showcase III Room 17 A / B 4:00-4:15 International Conference for High Performance

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Hardware Accelerator for Biological Sequence Alignment using Coreworks Processing Engine

Hardware Accelerator for Biological Sequence Alignment using Coreworks Processing Engine Hardware Accelerator for Biological Sequence Alignment using Coreworks Processing Engine José Cabrita, Gilberto Rodrigues, Paulo Flores INESC-ID / IST, Technical University of Lisbon jpmcabrita@gmail.com,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Performance Analysis of Parallelized Bioinformatics Applications

Performance Analysis of Parallelized Bioinformatics Applications Asian Journal of Computer Science and Technology ISSN: 2249-0701 Vol.7 No.2, 2018, pp. 70-74 The Research Publication, www.trp.org.in Dhruv Chander Pant 1 and OP Gupta 2 1 Research Scholar, I. K. Gujral

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Sequence Alignment. part 2

Sequence Alignment. part 2 Sequence Alignment part 2 Dynamic programming with more realistic scoring scheme Using the same initial sequences, we ll look at a dynamic programming example with a scoring scheme that selects for matches

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Directive-based General-Purpose GPU Programming. Tian Yi David Han

Directive-based General-Purpose GPU Programming. Tian Yi David Han Directive-based General-Purpose GPU Programming by Tian Yi David Han A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77 Dynamic Programming Part I: Examples Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77 Dynamic Programming Recall: the Change Problem Other problems: Manhattan

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search

Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search Ashwin M. Aji and Wu-chun Feng The Synergy Laboratory Department of Computer Science Virginia Tech {aaji,feng}@cs.vt.edu Abstract

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array Accelerating Smith Waterman (SW) Algorithm on Altera yclone II Field Programmable Gate Array NUR DALILAH AHMAD SABRI, NUR FARAH AIN SALIMAN, SYED ABDUL MUALIB AL JUNID, ABDUL KARIMI HALIM Faculty Electrical

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

High Performance Technique for Database Applications Using a Hybrid GPU/CPU Platform

High Performance Technique for Database Applications Using a Hybrid GPU/CPU Platform High Performance Technique for Database Applications Using a Hybrid GPU/CPU Platform M. Affan Zidan, Talal Bonny, and Khaled N. Salama Electrical Engineering Program King Abdullah University of Science

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

A GPU Algorithm for Comparing Nucleotide Histograms

A GPU Algorithm for Comparing Nucleotide Histograms A GPU Algorithm for Comparing Nucleotide Histograms Adrienne Breland Harpreet Singh Omid Tutakhil Mike Needham Dickson Luong Grant Hennig Roger Hoang Torborn Loken Sergiu M. Dascalu Frederick C. Harris,

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Sequencee Analysis Algorithms for Bioinformatics Applications

Sequencee Analysis Algorithms for Bioinformatics Applications Zagazig University Faculty of Engineering Computers and Systems Engineering Department Sequencee Analysis Algorithms for Bioinformatics Applications By Mohamed Al sayed Mohamed Ali Issa B.Sc in Computers

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018) Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018) Yatish Turakhia EE PhD candidate Stanford University Prof. Bill Dally (Electrical Engineering

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

A Scalable Coprocessor for Bioinformatic Sequence Alignments

A Scalable Coprocessor for Bioinformatic Sequence Alignments A Scalable Coprocessor for Bioinformatic Sequence Alignments Scott F. Smith Department of Electrical and Computer Engineering Boise State University Boise, ID, U.S.A. Abstract A hardware coprocessor for

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information