hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform

Size: px

Start display at page:

Download "hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform"

Kelley McGee
5 years ago
Views:

1 146 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform Yu-Cheng Liao, Yarsun Hsu Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan 313, R.O.C. Abstract Burrow-Wheeler Transform (BWT) algorithm is widely used in data compression and bioinformatics. Mathematically, BWT can be derived from the constructed suffix array. In this work, we analyze the current parallel implementations of SACAs and introduce the first heterogeneous implementation of the SA-DS algorithm on GPU. In order to achieve better performance, we also optimize the radix sort on GPU for our platform. As the result, the optimized radix sort on GPU can significantly decrease processing time compared with the latest Thrust library for sorting millions of keys. Our heterogeneous SA-DS demonstrates up to 4x speedup over the sequential version of SA-DS and has a performance gain up to 2x than the parallel BWT provided by the CUDPP library. Keywords: GPGPU, CUDA, Burrow-Wheeler Transform, Compression, SA-DS 1 Introduction 1.1 Motivation Burrow-Wheeler Transform (BWT) [1] is an algorithm used in data compression techniques like bzip2 [2]. Mathematically the transform can be obtained from constructing suffix array [3] in linear time [1]. The research of many previous studies on optimizing suffix array construction algorithms (SACAs) in both time and space also greatly improves the Burrow-Wheeler Transform. For heterogeneous platform, in these days, the prevalence of flexible, programmable and inexpensive general-proposed graphics processing unit (GPGPU) opens a new era of SIMD programming. Consequently, the heterogeneous architecture with GPGPUs has been widely adopted in the field of high performance computing. By reviewing recent works [4] [5] [6], we know these parallel SACAs adopted a well-known linear time SACA called skew algorithm. The famous linear time sequential SACAs are the skew algorithm, KA algorithm [7] and Ge Nong et al. s SA-IS and SA-DS [8]. The skew algorithm has the worst time/space performance among these algorithms. It is interesting to compare between the parallel skew algorithm and the parallel Ge Nong et al. s work. Further more, on the ground of concerning appropriate BWT block sizes for efficient compression, the conventional bzip2 selects the block size from 1K characters to 9K characters for BWT [2] to get the high compression rate along with moderate transforming time. We are motivated to find a better implementation for block size between 1K to 2M characters in this study. 1.2 Goal and Contribution The objective of this work is to present a heterogeneous version of the SA-DS algorithm accelerated by NVIDIA GPU using CUDA programming model. We package the memory transactions between host and device to obviate the data transfer overhead of the heterogeneous platform. Also, the heterogeneous SA-DS is appended with a kernel to compute the final encoded string for Burrow-Wheeler Transform. An additional contribution is a custom radix sort based on the Thrust library [9] on GPU. Since our heterogeneous platform equipped with a Tesla k2c graphics card, rather than calling existing Thrust primitives, we simplify the redundant sorting procedures, optimize the kernel and improve the load balance in device to achieve higher throughput. After the optimizations, our radix sort kernels outperform the Thrust library s by 6%. Comparing the full characters and integers sorting to the latest Thrust library, our method shows up to 77% decrease in time with respect to small sequences comprised of thousands of elements, and up to 23% decrease in time with respect to large sequences comprised of millions of elements. Our heterogeneous SA- DS demonstrates up to 4x speedup over the C++ sequential version and 2x faster compared to the implementation of BWT using the CUDPP library [6] with the block size ranging from 1K to 2M characters in this study. 1.3 Organization This paper is organized as follow. Section 2 discusses related works. Section 3 introduces Burrow-Wheeler Transform, and suffix array construction algorithms. Section 4 states the design and the implementations. Then, the performance evaluations are presented in section 5. Finally, section 6 describes conclusion. 2 Related Work As a part of a greater ambition to research the feasibility of lossless data compression on GPU, R.A. Patel et al. provide a new approach for suffix array construction based on merge sort [1]. They first use a bitonic sort to sort eight suffixes

2 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' within a thread in GPU. Each thread fetches four characters of each suffix in a comparison. If two suffixes reside in one thread have the same prefix and are unable to be sorted, the thread would fetch the next four characters on-the-fly from the global memory. Once all of the threads complete sorting their suffixes, the threads in a block work cooperatively to merge the partitioned suffix array into one complete suffix array. As a result, they report severe degradation in performance while merging large sequence because of branch divergence and frequent global memory access. Further, it cannot take advantage of the relationship between suffixes. For the implementation on GPU, they report a 3x slower than the single thread CPU implementation by Seward et al. [11]. In 213, M. Deo et al. brought the parallel DC3 algorithm to GPU. Their work is implemented on discrete GPU and APU respectively using OpenCL. It is inspired by the pdc3 but considered to be the first implementation on modern GPU architecture. They resolve some issues which are encountered while adapting the original pdc3 from the distributed system to heterogeneous platform and optimize the performance of pdc3 on GPU. Their paper also includes a brief explication that we can safely choose to ignore BWT and only discuss SA and its implementation since we are able to derive BWT trivially in one pass in parallel with computing SA. In 214, CUDPP library added a new primitive to compute the suffix array of a string. They use the recursive skew algorithm, similar to M. Deo et al., for the suffix array construction on GPU using CUDA. The primitive has the same sorting procedure analogous to M. Deo et al. s work, but in the final merging step, they adopted another merging technique, call merge path, presented by O. Green et al. [12] which is different from M. Deo et al. s work. According to the author s note, their parallel skew algorithm is 1.35x faster than the fastest implementation on GPU. Their work is the latest implementation we are going to compare with. In conclusion, to the best of our knowledge, we do not see the implementation of other linear suffix array construction algorithms such as the KA, SA-IS and SA-DS algorithm utilizing the computational power of GPGPU. 3 Background 3.1 Burrow-Wheeler Transform Burrow-Wheeler Transform (BWT) is discovered by Wheeler in BWT first produces a list, also called a block, of strings consisting of all the cyclical rotations of the original string. The block is then sorted lexicographically and the last character of each rotation forms the permuted string. $ is the terminal symbol denoting the end of current string and is the lexicographically smallest character. BWT is aimed at gathering the same characters and the transformed string must be capable of reversing back. For the serial implementation of BWT, Burrow and Wheeler suggested performing a radix sort on first character and second character of every rotations to obtain the preliminary order [1]. On their observation, most of the rotations can be sorted within the preliminary order. The order then followed by a quick sort to distinguish the strings sharing the same prefix. 3.2 BWT and Suffix Array We first review the content of a suffix array (SA). Consider size-n string S = s 1 s 2...s n 1 $, and $ is the terminal symbol. Let S i denote the suffix of S ranging from the i- th character to the ending character $. The suffix array (SA) stores the integers of starting index i that represents S i for all suffixes in lexicographical order. Which means if the entry SA[j] is i, S i is the j-th smallest suffix in the string S and hence k [1,j]:S SA[k] S i. From the given SA, BWT result can be conducted simply as the equation: BWT[i] =S[SA[i] 1]. The process is trivial and the BWT can be derived in one pass in parallel for each entry. This relationship allows us to neglect BWT and focus on SA s construction. 3.3 Suffix Array Construction Algorithms According to the property above, it is safe for us to only study the implementations of SACAs and their parallel implementations on GPGPU. Linear-time SACAs can easily prevail the original implementation in large scale strings. The up-to-date well-known linear-time SACAs use two genres of framework: skew algorithm and two-stage induction Skew Algorithm Skew algorithm is an algorithm recursively constructing SA in linear time. It follows the pattern proposed for suffix tree construction by Farach et al. in 1997 [13]. The following is the brief review of the skew algorithm, or DC3 algorithm, from J. Kärkkäinen et al.. The skew algorithm consists of three steps. First, considering a string S, it is first reduced by excluding the suffixes starting at position i mod 3 =, and the new problem size is 2/3 of the original input. To construct the SA, sorting is performed by scanning the first three characters of each suffix in the reduced problem and renaming the sorted suffixes with their ranks. If all the names are different, step one is finished and we attain the SA; otherwise, skew algorithm would be recursively applied to the reduced arrays. Secondly, every suffix is the concatenation of the character of the starting position, and the suffix starting at the next position. The remaining SA is attained by radix sorting the first character followed by the entry in constructed SA that represents the following suffix. The last step is to merge the two SAs. Skew algorithm compares the lexicographical order of each suffix in two SAs, and put them in the final complete suffix array.

3 148 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 J. Kärkkäinen et al. constructing two suffix arrays with asymmetric length provides the simplicity in step 3. The implementation of skew algorithm by J. Kärkkäinen et al. is succinct, many researchers exploit the parallelism of skew algorithm based on their scheme [4] [5] [6]. Skew algorithm is simple, however, the algorithm can only reduce the problem size to one-third of the input size in each recursion Two-Stage Induction Recent two-stage induction algorithms are variants of the SACA proposed by H. Itoh [14]. Since G. Nong et al. are dedicated to ameliorate the intrinsic sorting bottleneck of using S-distance lists method in KA algorithm [7], we would discuss their algorithm briefly in the following. SA-IS and SA-DS are the twin algorithms using the same framework. SA-IS consists of more sequential structures, so we concentrate and analyze thoroughly on the implementation of SA- DS algorithm as well as exploiting the potential parallelism in the algorithm. SA-DS is based on KA algorithm. In addition to classifying two types, we need to further separate the leftmost S-type suffixes (LMS) among type S suffixes. Besides, these LMS characters are used to locate the intervals of LMS-substrings. As a result, the original S is replaced by a shorter string only comprised of LMS-substrings. The input problem is simpler than KA algorithm because arrays having consecutive type S suffixes are curtailed. Despite the abbreviated problem size, suffixes in it are variable-length. They propose a new approach established on the radix sort and fixed-length substrings called D-critical substrings. A character is a d-critical character if and only if it is an LMS-character; or the character d length after is a d-critical, and no character between them is d-critical where d 2. The suffix starts from the character is called a d-critical suffix. SA-DS constructs the SA using the framework consisting of three steps. First, we can reduce the problem into an array containing the pointer of all the d-critical characters. The distance between any two neighboring d-critical characters is proven to be in [2,d+1]. Next we perform radix sort on the leading d +2 characters of each d-critical suffix after these suffixes are sorted by their types. If all the names are unique, step one is accomplished and we attain the suffix array of d-critical; otherwise, we have to recursively apply the SA-DS algorithm. Secondly, bucket all of the suffixes in S according to their first character, then initialize a new array for storing the final SA. We assign the buckets orderly to the SA array and record the ending position and the starting position of each bucket. The algorithm puts the sorted S-type suffixes into the correct entries in the final SA. Lastly, SA-DS incorporates one more induction process than KA algorithm for the position of other type S suffixes and remaining type L suffixes. For inducing positions of type MegaSuffixes Per Second Performance of SA-DS Algorithm 8k 16k 32k 64k 128k256k512k 1M 2M 4M 8M 16M 32M 64M Problem Size k 16k 32k parallelable 64k sequential 128k 256k SADA total Fig. 1: Performance of two portions of SA-DS algorithm, clearly the overall performance is bounded by the parallelizable part. L suffixes, the procedure is described in the KA algorithm [7]. For the remaining type S suffixes, each entry SA[i] encountered during the scan, if the suffix S SA[i] 1 is S-type, move the suffix to the recorded ending position of its bucket in SA. The LMS-suffix originally resides at the end of the bucket is swapped. Apparently SA-DS algorithm has better performance in large scale strings owing to the high reduction rate in recursive sorting step. Between KA algorithm and SA-DS, SA-DS provides a framework that gives the simplicity of the sorting step and is suitable for parallelization on GPU. Now, we extract the possible parallelizable parts of SA- DS algorithm for the intent of mapping these portions onto GPU. Classification of suffixes type is self-reliant. The type of a suffix is determined by itself and its next suffix. Comparison for all of the suffixes in the input string S can be done parallelly on GPU. LMS-characters are parallelly distinguishable by simply checking every previous suffix s type for type S suffixes. Since blocks in LMS-substrings are disjoint, d-critical substrings within each LMS-substrings can be located in parallel. Fast fixed-length radix sorting on GPU already exists. Assigning LMS-suffixes is also parallelizable since we only have to maintain the order in the same bucket. Lastly, the induction step contains dependencies in each iteration, therefore this part remains sequential and can not be parallelized. Fig. 1 shows the performance of different portions. It s clear that the overall performance of the SA-DS algorithm is bounded by the parallelizable part in all ranges of the input size. From the evaluation, more than four-fifth of the execution time is parallelizable. According to Amdahl s law [15], it is possible to improve performance up to 5x. 4 Design and Implementation In this section, we describe the method we used to parallelize the original SA-DS algorithm. Table 1 presents the pseudo code of the SA-DS algorithm and the steps that are packed into kernels for GPU are also marked.

4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Table 1: The SA-DS algorithm pseudo code SA-DS(S, SA) kernel //S is the input string //SA is the output suffix array for S 1 Find d-critical substrings in S (1-3) 2 Reduce the original problem into a shortened P 1 (4-6) 3 Radix sort the d-critical substrings in P 1 (7-1) 4 Name each d-critical substring by its rank to get S DC (11-14) 5 if (alluniquenames) SA DC = S DC 6 else SA-DS(S DC, SA DC ) //recursion 7 Induce SA from SA DC step 1 (15-19) 8 Induce SA from SA DC step 2,3 //on CPU 9 end 4.1 Parallelizing SA-DS In table 1, there are six primary sections culled from the sequential SA-DS algorithm. We do not describe the process of parallel radix sort since it is already explained in detail in D.G. Merrill et al. s work [16] Locating D-Critical Substrings It includes three kernels responsible for classifying types of characters, identifying leftmost S-type characters, and assigning additional d-critical substrings between any two adjacent LMSs by the given d distance. In order to classify types of characters, each thread in the kernel is in charge of one particular character in the input string appointed by its thread index and block index. Every thread compares the current assigned character with the next character. The classification is done if the next character is lexicographically greater than the current character, which means the associated suffix is lexicographically smaller than the next suffix. If the character is equal to the next one, comparison between the next character and the next of the next is taken recursively. The kernel identifies leftmost S-type characters by fetching the type of characters, and comparing the type with the preceding character s type. If the fetched type is S-type and the preceding type is L-type, the character is a leftmost S-type character. Finally, continuous d-critical substrings inside independent blocks bounded by neighboring leftmost S-type characters are denoted parallelly using multiple threads in the kernel. After these kernels, there is an array T storing types of characters and another array C boolean in the same size of input string S storing 1 or. For any entry containing 1 in an array indicates the starting position of a d-critical substring Shrinking Problem The d-critical substrings can be assembled into a shortened array storing starting indexes of d-critical substrings called P 1. From the last section we know if an entry of C boolean holds a 1, the corresponding index is the starting position of a d-critical substring. The kernels first examine each entry in C boolean, then exclude the entries containing, and aggregate the index of the entries containing 1 into an abbreviated array P Naming and Constructing SA The sorted d-critical substrings are stored lexicographically in global memory. Initially, each thread in the kernel is responsible for one particular entry in the sorted P 1.It loads the first d +2 characters of its assigned index and the first d +2 characters of its previous index, and determine whether these two sets of d +2 characters are different. If two sets of the characters are different, the thread stores a 1 in the corresponding entry in a temporary array N; otherwise, stores. Next, we use the similar three steps described in section Instead of distributing the keys, the third kernel distributes the scanned ranks as values to each entry in N. The following kernel scatters the rank of each d-critical substring according to the index in P 1 to the final suffix array Inducing SA Step 1 In those sorted d-critical substrings, we only use the substrings starting with leftmost S-type characters. The dependency among these suffixes merely resides in an individual bucket. To extract LMS-suffixes from sorted d-critical suffixes, we use the same approach explained in section A kernel marks the LMS-suffixes by inspecting the type of the target character and its previous character s type similar to section Later the following three kernels remap the marked LMS-suffixes into a shortened array. The subsequent kernels collect the number of names/characters of each bucket, then scan the numbers once to acquire the index offsets by accumulating the size of buckets. Moreover, the LMS-suffixes are placed on the tail of each bucket. After all of the necessary variables are calculated, a kernel launches threads seeded by the offset of its bucket to fill the LMSsuffixes in the correct entries in the suffix array for inducing. 4.2 Optimization In the heterogeneous SA-DS, the sorting stage using radix sort on GPU consumes most of the execution time. Rather than using the existing radix sort provided by the Thrust library, we rewrite the three steps of radix sort for customization on our Tesla k2c GPU. We fix the decode digits in one iteration to be four digits and implement a simple dynamic terminal policy on host that stops the radix sort kernels when our keys are sorted. The termination is determined by how many kernels that sorting four digits at a time are required for covering the first non-zero bit from MSB in all of the keys. We first find the position of the first non-zero bit, calculate the length of bits from the position we found to the LSB, and then divide the length by four. This terminal policy can prevent a plethora of useless kernel calls. The radix sort kernels in Thrust library launch blocks or threads according to static profile created by themselves.

5 15 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Table 2: Hardware specifications CPU Intel Xeon CPU 2.4GHz 4C/8T 2 RAM 4GB DDR3-166MHz 6 GPU Nvidia Tesla K2c 1 HDD WD 2TB 72RPM 2 (RAID 1) Fig. 2: Performance and speedup of sorting integers Performance of hsa-ds SADA ADA MegaSuffixes Per Second k 16k 32k 64k 128k 256k 512k 1M 2M 4M 8M 16M 32M 64M Problem Size paralleled sequential total Fig. 3: Performance of the heterogeneous SA-DS, the parallelized portion gains improvement. 5.2 Heterogeneous SA-DS However, the load balance in each block among different input array sizes should be also considered. We configure the number of blocks using the size of current input array in our radix sort. With this strategy, our radix sort can have better response time for variable size of input array. 5 Evaluation We investigate the implementations including the original sequential SA-DS algorithm, the skew algorithm implemented in the CUDPP library, and our heterogeneous SA- DS using CUDA. These implementation are performed and compared on the same hardware platform listed in table 2. The evaluations focus on the execution time of a set of kernels, including memory transfer between host and device. 5.1 Radix Sort We evaluate the execution time for sorting characters of the three kernels including kernel configuration and execution time but excluding data transfer time between host and device. The input keys are already transfered to GPU s global memory since the overhead of data transfer for the two implementations is the same. We benchmark the kernels by using nvprof profiling tool. The result is the average of one hundred executions. The outcome shows our radix sort is 1.31x, 1.37x, and 1.61x faster than the Thrust library in upsweep, toplevelscan, and downsweep kernel respectively when processing one million characters. We test the scalability of the complete radix sort. Fig. 2 shows the performance of sorting integers with varying sizes of input array. In the Thrust library, the radix sort encounters a performance drop for the problem size around 64K to 2M keys. Comparing with Thrust library, our customized radix sort achieves 4.3x speedup at input of 64K keys, 2x on average with character keys, 2.3x at 4K keys and 1.7x on average with integer keys. We first analyze the performance curves of parallelized portion and the intact sequential portion. The parallelized portion accounts for any additional overheads of memory transfer between host and device on kernel launch. Fig. 3 shows the performance of each part in heterogeneous SA-DS. As we can see, the performance curve shows the parallelized portion gains vast improvement compared with the performance of the parallelizable part depicted in Fig. 1. With the utilization of GPU, the overall performance of SA-DS no longer suffers from the bound set by the sorting steps. The speedup compared with original SA-DS is 3.7x faster on average when the input problem is large enough. We choose four datasets downloaded from the Internet with different properties. The content in a text file directly impacts the performance of SA construction. Although these datasets do not take into account of every condition of sorting suffixes, the four representative datasets provide us comprehensive measurements for the normal usage of SA construction. Enwiki dataset is downloaded from the wikidepia website [17]. It is dumped from the English Wikipedia. Linux kernel tarball is the latest Linux-4.1 kernel, and it contains the source code of Linux kernel. The content can be regarded as random characters. Enwiki abstract is different from the case 1, it contains the abstracts of English wikipedia as million lines of websites. Lastly, we generate strings with different sizes varying from 8K to 32M characters. Fig. 4 shows that the speedup of the hsa-ds rises from negative 7x at small inputs to a steady positive 3.7x at large input. For random characters, because our heterogeneous SA-DS is benefited form the parallel sorting stage, it outperforms sequential SA-DS in most of the problem sizes. However, the inducing stage requires more iterations to construct the complete suffix array. Consequently, the overhead of CPU to construct long suffix array degrades the performance at large input sequence.

6 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Speedup Speedup of hsa-ds k -4 16k 32k 64k 128k 256k 512k k k 32k k 128k k 512k M 2M M 8M 16M M Problem Size enwik kernel abstract random Fig. 4: Speedup of the hsa-ds for four datasets. Performance of Three Algorithms 4M 8M 16M 32M 64M MegaSuffixes Per Second 3 2 4M 8M 16M 32M 64M k 16k 32k 64k 128k256k512k 1M 2M 4M 8M 16M 32M 64M Problem Size hsads SADS CUDPP Fig. 5: Comparison between the SA-DS, DC3 on GPU and the hsa-ds. 5.3 Comparisons with CUDPP Library CUDPP library utilizing NVIDIA GPU with CUDA programming model is considered to be the fastest implementation of parallel DC3 algorithm. The result is shown in Fig. 5. The readme file of the CUDPP library describes that their BWT can not process strings larger than 1M characters, but we test up to 2M characters to generate the curve since we are interested in performance for strings with sizes smaller than 2M characters. Finally, the figure shows that our hsa- DS has the best performance for size of strings smaller than 2M characters. 6 Conclusion Our hsa-ds improves the performance of the original SA-DS by parallelizing its slowest portion. The heterogeneous platform using both GPU and CPU is the best choice for our algorithm since the sequential portion must be performed on a powerful CPU. The customized radix sort further optimizes the distributed workloads of each processing elements, and incorporates a dynamic terminal strategy for keys with different length. As the result, our customized radix sort on GPU gains up to 2.3x and 4x speedup with respect to integer keys and character keys compared to the Thrust library. The hsa-ds algorithm can obviate the performance bound incurred by sequential sorting overhead, and gain up to 3.7x speedup over the sequential SA-DS, and up to 2x speedup over the parallel skew-algorithm-based BWT. The hsa-ds has the best performance for block sizes ranging from 1K to 2M characters. 7 Acknowledgement The authors thank the support from MOST under grant E References [1] M. Burrows and D. J. Wheeler, A block-sorting lossless data compression algorithm, [2] bzip2-1..6, 215. [Online]. Available: [3] U. Manber and G. Myers, Suffix arrays: a new method for on-line string searches, siam Journal on Computing, vol. 22, no. 5, pp , [4] M. Deo and S. Keely, Parallel suffix array and least common prefix for the gpu, in ACM SIGPLAN Notices, vol. 48, no. 8. ACM, 213, pp [5] F. Kulla and P. Sanders, Scalable parallel suffix array construction, Parallel Computing, vol. 33, no. 9, pp , 27. [6] Cudpp-2.2, 214. [Online]. Available: [7] P. Ko and S. Aluru, Space efficient linear time construction of suffix arrays, in Combinatorial Pattern Matching. Springer, 23, pp [8] G. Nong, S. Zhang, and W. H. Chan, Two efficient algorithms for linear time suffix array construction, Computers, IEEE Transactions on, vol. 6, no. 1, pp , 211. [9] J. Hoberock and N. Bell, Thrust: A parallel template library, 21, version [Online]. Available: [1] R. Patel, Y. Zhang, J. Mak, A. Davidson, J. D. Owens, et al., Parallel lossless data compression on the GPU. IEEE, 212. [11] J. Seward, On the performance of bwt sorting algorithms, in Data Compression Conference, 2. Proceedings. DCC 2. IEEE, 2, pp [12] O. Green, R. McColl, and D. A. Bader, Gpu merge path: a gpu merging algorithm, in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 212, pp [13] M. Farach, Optimal suffix tree construction with large alphabets, in Foundations of Computer Science, Proceedings., 38th Annual Symposium on. IEEE, 1997, pp [14] H. Itoh and H. Tanaka, An efficient method for in memory construction of suffix arrays, in String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware. IEEE, 1999, pp [15] G. M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in Proceedings of the April 18-2, 1967, spring joint computer conference. ACM, 1967, pp [16] D. G. Merrill and A. S. Grimshaw, Revisiting sorting for gpgpu stream architectures, in Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 21, pp [17] Enwiki, 215. [Online]. Available: enwiki/

SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES

SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES INTRODUCTION Rahmaddiansyah 1 and Nur aini Abdul Rashid 2 1 Universiti Sains Malaysia (USM), Malaysia, new_rahmad@yahoo.co.id 2 Universiti