hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform

Size: px
Start display at page:

Download "hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform"

Transcription

1 146 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform Yu-Cheng Liao, Yarsun Hsu Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan 313, R.O.C. Abstract Burrow-Wheeler Transform (BWT) algorithm is widely used in data compression and bioinformatics. Mathematically, BWT can be derived from the constructed suffix array. In this work, we analyze the current parallel implementations of SACAs and introduce the first heterogeneous implementation of the SA-DS algorithm on GPU. In order to achieve better performance, we also optimize the radix sort on GPU for our platform. As the result, the optimized radix sort on GPU can significantly decrease processing time compared with the latest Thrust library for sorting millions of keys. Our heterogeneous SA-DS demonstrates up to 4x speedup over the sequential version of SA-DS and has a performance gain up to 2x than the parallel BWT provided by the CUDPP library. Keywords: GPGPU, CUDA, Burrow-Wheeler Transform, Compression, SA-DS 1 Introduction 1.1 Motivation Burrow-Wheeler Transform (BWT) [1] is an algorithm used in data compression techniques like bzip2 [2]. Mathematically the transform can be obtained from constructing suffix array [3] in linear time [1]. The research of many previous studies on optimizing suffix array construction algorithms (SACAs) in both time and space also greatly improves the Burrow-Wheeler Transform. For heterogeneous platform, in these days, the prevalence of flexible, programmable and inexpensive general-proposed graphics processing unit (GPGPU) opens a new era of SIMD programming. Consequently, the heterogeneous architecture with GPGPUs has been widely adopted in the field of high performance computing. By reviewing recent works [4] [5] [6], we know these parallel SACAs adopted a well-known linear time SACA called skew algorithm. The famous linear time sequential SACAs are the skew algorithm, KA algorithm [7] and Ge Nong et al. s SA-IS and SA-DS [8]. The skew algorithm has the worst time/space performance among these algorithms. It is interesting to compare between the parallel skew algorithm and the parallel Ge Nong et al. s work. Further more, on the ground of concerning appropriate BWT block sizes for efficient compression, the conventional bzip2 selects the block size from 1K characters to 9K characters for BWT [2] to get the high compression rate along with moderate transforming time. We are motivated to find a better implementation for block size between 1K to 2M characters in this study. 1.2 Goal and Contribution The objective of this work is to present a heterogeneous version of the SA-DS algorithm accelerated by NVIDIA GPU using CUDA programming model. We package the memory transactions between host and device to obviate the data transfer overhead of the heterogeneous platform. Also, the heterogeneous SA-DS is appended with a kernel to compute the final encoded string for Burrow-Wheeler Transform. An additional contribution is a custom radix sort based on the Thrust library [9] on GPU. Since our heterogeneous platform equipped with a Tesla k2c graphics card, rather than calling existing Thrust primitives, we simplify the redundant sorting procedures, optimize the kernel and improve the load balance in device to achieve higher throughput. After the optimizations, our radix sort kernels outperform the Thrust library s by 6%. Comparing the full characters and integers sorting to the latest Thrust library, our method shows up to 77% decrease in time with respect to small sequences comprised of thousands of elements, and up to 23% decrease in time with respect to large sequences comprised of millions of elements. Our heterogeneous SA- DS demonstrates up to 4x speedup over the C++ sequential version and 2x faster compared to the implementation of BWT using the CUDPP library [6] with the block size ranging from 1K to 2M characters in this study. 1.3 Organization This paper is organized as follow. Section 2 discusses related works. Section 3 introduces Burrow-Wheeler Transform, and suffix array construction algorithms. Section 4 states the design and the implementations. Then, the performance evaluations are presented in section 5. Finally, section 6 describes conclusion. 2 Related Work As a part of a greater ambition to research the feasibility of lossless data compression on GPU, R.A. Patel et al. provide a new approach for suffix array construction based on merge sort [1]. They first use a bitonic sort to sort eight suffixes

2 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' within a thread in GPU. Each thread fetches four characters of each suffix in a comparison. If two suffixes reside in one thread have the same prefix and are unable to be sorted, the thread would fetch the next four characters on-the-fly from the global memory. Once all of the threads complete sorting their suffixes, the threads in a block work cooperatively to merge the partitioned suffix array into one complete suffix array. As a result, they report severe degradation in performance while merging large sequence because of branch divergence and frequent global memory access. Further, it cannot take advantage of the relationship between suffixes. For the implementation on GPU, they report a 3x slower than the single thread CPU implementation by Seward et al. [11]. In 213, M. Deo et al. brought the parallel DC3 algorithm to GPU. Their work is implemented on discrete GPU and APU respectively using OpenCL. It is inspired by the pdc3 but considered to be the first implementation on modern GPU architecture. They resolve some issues which are encountered while adapting the original pdc3 from the distributed system to heterogeneous platform and optimize the performance of pdc3 on GPU. Their paper also includes a brief explication that we can safely choose to ignore BWT and only discuss SA and its implementation since we are able to derive BWT trivially in one pass in parallel with computing SA. In 214, CUDPP library added a new primitive to compute the suffix array of a string. They use the recursive skew algorithm, similar to M. Deo et al., for the suffix array construction on GPU using CUDA. The primitive has the same sorting procedure analogous to M. Deo et al. s work, but in the final merging step, they adopted another merging technique, call merge path, presented by O. Green et al. [12] which is different from M. Deo et al. s work. According to the author s note, their parallel skew algorithm is 1.35x faster than the fastest implementation on GPU. Their work is the latest implementation we are going to compare with. In conclusion, to the best of our knowledge, we do not see the implementation of other linear suffix array construction algorithms such as the KA, SA-IS and SA-DS algorithm utilizing the computational power of GPGPU. 3 Background 3.1 Burrow-Wheeler Transform Burrow-Wheeler Transform (BWT) is discovered by Wheeler in BWT first produces a list, also called a block, of strings consisting of all the cyclical rotations of the original string. The block is then sorted lexicographically and the last character of each rotation forms the permuted string. $ is the terminal symbol denoting the end of current string and is the lexicographically smallest character. BWT is aimed at gathering the same characters and the transformed string must be capable of reversing back. For the serial implementation of BWT, Burrow and Wheeler suggested performing a radix sort on first character and second character of every rotations to obtain the preliminary order [1]. On their observation, most of the rotations can be sorted within the preliminary order. The order then followed by a quick sort to distinguish the strings sharing the same prefix. 3.2 BWT and Suffix Array We first review the content of a suffix array (SA). Consider size-n string S = s 1 s 2...s n 1 $, and $ is the terminal symbol. Let S i denote the suffix of S ranging from the i- th character to the ending character $. The suffix array (SA) stores the integers of starting index i that represents S i for all suffixes in lexicographical order. Which means if the entry SA[j] is i, S i is the j-th smallest suffix in the string S and hence k [1,j]:S SA[k] S i. From the given SA, BWT result can be conducted simply as the equation: BWT[i] =S[SA[i] 1]. The process is trivial and the BWT can be derived in one pass in parallel for each entry. This relationship allows us to neglect BWT and focus on SA s construction. 3.3 Suffix Array Construction Algorithms According to the property above, it is safe for us to only study the implementations of SACAs and their parallel implementations on GPGPU. Linear-time SACAs can easily prevail the original implementation in large scale strings. The up-to-date well-known linear-time SACAs use two genres of framework: skew algorithm and two-stage induction Skew Algorithm Skew algorithm is an algorithm recursively constructing SA in linear time. It follows the pattern proposed for suffix tree construction by Farach et al. in 1997 [13]. The following is the brief review of the skew algorithm, or DC3 algorithm, from J. Kärkkäinen et al.. The skew algorithm consists of three steps. First, considering a string S, it is first reduced by excluding the suffixes starting at position i mod 3 =, and the new problem size is 2/3 of the original input. To construct the SA, sorting is performed by scanning the first three characters of each suffix in the reduced problem and renaming the sorted suffixes with their ranks. If all the names are different, step one is finished and we attain the SA; otherwise, skew algorithm would be recursively applied to the reduced arrays. Secondly, every suffix is the concatenation of the character of the starting position, and the suffix starting at the next position. The remaining SA is attained by radix sorting the first character followed by the entry in constructed SA that represents the following suffix. The last step is to merge the two SAs. Skew algorithm compares the lexicographical order of each suffix in two SAs, and put them in the final complete suffix array.

3 148 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 J. Kärkkäinen et al. constructing two suffix arrays with asymmetric length provides the simplicity in step 3. The implementation of skew algorithm by J. Kärkkäinen et al. is succinct, many researchers exploit the parallelism of skew algorithm based on their scheme [4] [5] [6]. Skew algorithm is simple, however, the algorithm can only reduce the problem size to one-third of the input size in each recursion Two-Stage Induction Recent two-stage induction algorithms are variants of the SACA proposed by H. Itoh [14]. Since G. Nong et al. are dedicated to ameliorate the intrinsic sorting bottleneck of using S-distance lists method in KA algorithm [7], we would discuss their algorithm briefly in the following. SA-IS and SA-DS are the twin algorithms using the same framework. SA-IS consists of more sequential structures, so we concentrate and analyze thoroughly on the implementation of SA- DS algorithm as well as exploiting the potential parallelism in the algorithm. SA-DS is based on KA algorithm. In addition to classifying two types, we need to further separate the leftmost S-type suffixes (LMS) among type S suffixes. Besides, these LMS characters are used to locate the intervals of LMS-substrings. As a result, the original S is replaced by a shorter string only comprised of LMS-substrings. The input problem is simpler than KA algorithm because arrays having consecutive type S suffixes are curtailed. Despite the abbreviated problem size, suffixes in it are variable-length. They propose a new approach established on the radix sort and fixed-length substrings called D-critical substrings. A character is a d-critical character if and only if it is an LMS-character; or the character d length after is a d-critical, and no character between them is d-critical where d 2. The suffix starts from the character is called a d-critical suffix. SA-DS constructs the SA using the framework consisting of three steps. First, we can reduce the problem into an array containing the pointer of all the d-critical characters. The distance between any two neighboring d-critical characters is proven to be in [2,d+1]. Next we perform radix sort on the leading d +2 characters of each d-critical suffix after these suffixes are sorted by their types. If all the names are unique, step one is accomplished and we attain the suffix array of d-critical; otherwise, we have to recursively apply the SA-DS algorithm. Secondly, bucket all of the suffixes in S according to their first character, then initialize a new array for storing the final SA. We assign the buckets orderly to the SA array and record the ending position and the starting position of each bucket. The algorithm puts the sorted S-type suffixes into the correct entries in the final SA. Lastly, SA-DS incorporates one more induction process than KA algorithm for the position of other type S suffixes and remaining type L suffixes. For inducing positions of type MegaSuffixes Per Second Performance of SA-DS Algorithm 8k 16k 32k 64k 128k256k512k 1M 2M 4M 8M 16M 32M 64M Problem Size k 16k 32k parallelable 64k sequential 128k 256k SADA total Fig. 1: Performance of two portions of SA-DS algorithm, clearly the overall performance is bounded by the parallelizable part. L suffixes, the procedure is described in the KA algorithm [7]. For the remaining type S suffixes, each entry SA[i] encountered during the scan, if the suffix S SA[i] 1 is S-type, move the suffix to the recorded ending position of its bucket in SA. The LMS-suffix originally resides at the end of the bucket is swapped. Apparently SA-DS algorithm has better performance in large scale strings owing to the high reduction rate in recursive sorting step. Between KA algorithm and SA-DS, SA-DS provides a framework that gives the simplicity of the sorting step and is suitable for parallelization on GPU. Now, we extract the possible parallelizable parts of SA- DS algorithm for the intent of mapping these portions onto GPU. Classification of suffixes type is self-reliant. The type of a suffix is determined by itself and its next suffix. Comparison for all of the suffixes in the input string S can be done parallelly on GPU. LMS-characters are parallelly distinguishable by simply checking every previous suffix s type for type S suffixes. Since blocks in LMS-substrings are disjoint, d-critical substrings within each LMS-substrings can be located in parallel. Fast fixed-length radix sorting on GPU already exists. Assigning LMS-suffixes is also parallelizable since we only have to maintain the order in the same bucket. Lastly, the induction step contains dependencies in each iteration, therefore this part remains sequential and can not be parallelized. Fig. 1 shows the performance of different portions. It s clear that the overall performance of the SA-DS algorithm is bounded by the parallelizable part in all ranges of the input size. From the evaluation, more than four-fifth of the execution time is parallelizable. According to Amdahl s law [15], it is possible to improve performance up to 5x. 4 Design and Implementation In this section, we describe the method we used to parallelize the original SA-DS algorithm. Table 1 presents the pseudo code of the SA-DS algorithm and the steps that are packed into kernels for GPU are also marked.

4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Table 1: The SA-DS algorithm pseudo code SA-DS(S, SA) kernel //S is the input string //SA is the output suffix array for S 1 Find d-critical substrings in S (1-3) 2 Reduce the original problem into a shortened P 1 (4-6) 3 Radix sort the d-critical substrings in P 1 (7-1) 4 Name each d-critical substring by its rank to get S DC (11-14) 5 if (alluniquenames) SA DC = S DC 6 else SA-DS(S DC, SA DC ) //recursion 7 Induce SA from SA DC step 1 (15-19) 8 Induce SA from SA DC step 2,3 //on CPU 9 end 4.1 Parallelizing SA-DS In table 1, there are six primary sections culled from the sequential SA-DS algorithm. We do not describe the process of parallel radix sort since it is already explained in detail in D.G. Merrill et al. s work [16] Locating D-Critical Substrings It includes three kernels responsible for classifying types of characters, identifying leftmost S-type characters, and assigning additional d-critical substrings between any two adjacent LMSs by the given d distance. In order to classify types of characters, each thread in the kernel is in charge of one particular character in the input string appointed by its thread index and block index. Every thread compares the current assigned character with the next character. The classification is done if the next character is lexicographically greater than the current character, which means the associated suffix is lexicographically smaller than the next suffix. If the character is equal to the next one, comparison between the next character and the next of the next is taken recursively. The kernel identifies leftmost S-type characters by fetching the type of characters, and comparing the type with the preceding character s type. If the fetched type is S-type and the preceding type is L-type, the character is a leftmost S-type character. Finally, continuous d-critical substrings inside independent blocks bounded by neighboring leftmost S-type characters are denoted parallelly using multiple threads in the kernel. After these kernels, there is an array T storing types of characters and another array C boolean in the same size of input string S storing 1 or. For any entry containing 1 in an array indicates the starting position of a d-critical substring Shrinking Problem The d-critical substrings can be assembled into a shortened array storing starting indexes of d-critical substrings called P 1. From the last section we know if an entry of C boolean holds a 1, the corresponding index is the starting position of a d-critical substring. The kernels first examine each entry in C boolean, then exclude the entries containing, and aggregate the index of the entries containing 1 into an abbreviated array P Naming and Constructing SA The sorted d-critical substrings are stored lexicographically in global memory. Initially, each thread in the kernel is responsible for one particular entry in the sorted P 1.It loads the first d +2 characters of its assigned index and the first d +2 characters of its previous index, and determine whether these two sets of d +2 characters are different. If two sets of the characters are different, the thread stores a 1 in the corresponding entry in a temporary array N; otherwise, stores. Next, we use the similar three steps described in section Instead of distributing the keys, the third kernel distributes the scanned ranks as values to each entry in N. The following kernel scatters the rank of each d-critical substring according to the index in P 1 to the final suffix array Inducing SA Step 1 In those sorted d-critical substrings, we only use the substrings starting with leftmost S-type characters. The dependency among these suffixes merely resides in an individual bucket. To extract LMS-suffixes from sorted d-critical suffixes, we use the same approach explained in section A kernel marks the LMS-suffixes by inspecting the type of the target character and its previous character s type similar to section Later the following three kernels remap the marked LMS-suffixes into a shortened array. The subsequent kernels collect the number of names/characters of each bucket, then scan the numbers once to acquire the index offsets by accumulating the size of buckets. Moreover, the LMS-suffixes are placed on the tail of each bucket. After all of the necessary variables are calculated, a kernel launches threads seeded by the offset of its bucket to fill the LMSsuffixes in the correct entries in the suffix array for inducing. 4.2 Optimization In the heterogeneous SA-DS, the sorting stage using radix sort on GPU consumes most of the execution time. Rather than using the existing radix sort provided by the Thrust library, we rewrite the three steps of radix sort for customization on our Tesla k2c GPU. We fix the decode digits in one iteration to be four digits and implement a simple dynamic terminal policy on host that stops the radix sort kernels when our keys are sorted. The termination is determined by how many kernels that sorting four digits at a time are required for covering the first non-zero bit from MSB in all of the keys. We first find the position of the first non-zero bit, calculate the length of bits from the position we found to the LSB, and then divide the length by four. This terminal policy can prevent a plethora of useless kernel calls. The radix sort kernels in Thrust library launch blocks or threads according to static profile created by themselves.

5 15 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Table 2: Hardware specifications CPU Intel Xeon CPU 2.4GHz 4C/8T 2 RAM 4GB DDR3-166MHz 6 GPU Nvidia Tesla K2c 1 HDD WD 2TB 72RPM 2 (RAID 1) Fig. 2: Performance and speedup of sorting integers Performance of hsa-ds SADA ADA MegaSuffixes Per Second k 16k 32k 64k 128k 256k 512k 1M 2M 4M 8M 16M 32M 64M Problem Size paralleled sequential total Fig. 3: Performance of the heterogeneous SA-DS, the parallelized portion gains improvement. 5.2 Heterogeneous SA-DS However, the load balance in each block among different input array sizes should be also considered. We configure the number of blocks using the size of current input array in our radix sort. With this strategy, our radix sort can have better response time for variable size of input array. 5 Evaluation We investigate the implementations including the original sequential SA-DS algorithm, the skew algorithm implemented in the CUDPP library, and our heterogeneous SA- DS using CUDA. These implementation are performed and compared on the same hardware platform listed in table 2. The evaluations focus on the execution time of a set of kernels, including memory transfer between host and device. 5.1 Radix Sort We evaluate the execution time for sorting characters of the three kernels including kernel configuration and execution time but excluding data transfer time between host and device. The input keys are already transfered to GPU s global memory since the overhead of data transfer for the two implementations is the same. We benchmark the kernels by using nvprof profiling tool. The result is the average of one hundred executions. The outcome shows our radix sort is 1.31x, 1.37x, and 1.61x faster than the Thrust library in upsweep, toplevelscan, and downsweep kernel respectively when processing one million characters. We test the scalability of the complete radix sort. Fig. 2 shows the performance of sorting integers with varying sizes of input array. In the Thrust library, the radix sort encounters a performance drop for the problem size around 64K to 2M keys. Comparing with Thrust library, our customized radix sort achieves 4.3x speedup at input of 64K keys, 2x on average with character keys, 2.3x at 4K keys and 1.7x on average with integer keys. We first analyze the performance curves of parallelized portion and the intact sequential portion. The parallelized portion accounts for any additional overheads of memory transfer between host and device on kernel launch. Fig. 3 shows the performance of each part in heterogeneous SA-DS. As we can see, the performance curve shows the parallelized portion gains vast improvement compared with the performance of the parallelizable part depicted in Fig. 1. With the utilization of GPU, the overall performance of SA-DS no longer suffers from the bound set by the sorting steps. The speedup compared with original SA-DS is 3.7x faster on average when the input problem is large enough. We choose four datasets downloaded from the Internet with different properties. The content in a text file directly impacts the performance of SA construction. Although these datasets do not take into account of every condition of sorting suffixes, the four representative datasets provide us comprehensive measurements for the normal usage of SA construction. Enwiki dataset is downloaded from the wikidepia website [17]. It is dumped from the English Wikipedia. Linux kernel tarball is the latest Linux-4.1 kernel, and it contains the source code of Linux kernel. The content can be regarded as random characters. Enwiki abstract is different from the case 1, it contains the abstracts of English wikipedia as million lines of websites. Lastly, we generate strings with different sizes varying from 8K to 32M characters. Fig. 4 shows that the speedup of the hsa-ds rises from negative 7x at small inputs to a steady positive 3.7x at large input. For random characters, because our heterogeneous SA-DS is benefited form the parallel sorting stage, it outperforms sequential SA-DS in most of the problem sizes. However, the inducing stage requires more iterations to construct the complete suffix array. Consequently, the overhead of CPU to construct long suffix array degrades the performance at large input sequence.

6 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Speedup Speedup of hsa-ds k -4 16k 32k 64k 128k 256k 512k k k 32k k 128k k 512k M 2M M 8M 16M M Problem Size enwik kernel abstract random Fig. 4: Speedup of the hsa-ds for four datasets. Performance of Three Algorithms 4M 8M 16M 32M 64M MegaSuffixes Per Second 3 2 4M 8M 16M 32M 64M k 16k 32k 64k 128k256k512k 1M 2M 4M 8M 16M 32M 64M Problem Size hsads SADS CUDPP Fig. 5: Comparison between the SA-DS, DC3 on GPU and the hsa-ds. 5.3 Comparisons with CUDPP Library CUDPP library utilizing NVIDIA GPU with CUDA programming model is considered to be the fastest implementation of parallel DC3 algorithm. The result is shown in Fig. 5. The readme file of the CUDPP library describes that their BWT can not process strings larger than 1M characters, but we test up to 2M characters to generate the curve since we are interested in performance for strings with sizes smaller than 2M characters. Finally, the figure shows that our hsa- DS has the best performance for size of strings smaller than 2M characters. 6 Conclusion Our hsa-ds improves the performance of the original SA-DS by parallelizing its slowest portion. The heterogeneous platform using both GPU and CPU is the best choice for our algorithm since the sequential portion must be performed on a powerful CPU. The customized radix sort further optimizes the distributed workloads of each processing elements, and incorporates a dynamic terminal strategy for keys with different length. As the result, our customized radix sort on GPU gains up to 2.3x and 4x speedup with respect to integer keys and character keys compared to the Thrust library. The hsa-ds algorithm can obviate the performance bound incurred by sequential sorting overhead, and gain up to 3.7x speedup over the sequential SA-DS, and up to 2x speedup over the parallel skew-algorithm-based BWT. The hsa-ds has the best performance for block sizes ranging from 1K to 2M characters. 7 Acknowledgement The authors thank the support from MOST under grant E References [1] M. Burrows and D. J. Wheeler, A block-sorting lossless data compression algorithm, [2] bzip2-1..6, 215. [Online]. Available: [3] U. Manber and G. Myers, Suffix arrays: a new method for on-line string searches, siam Journal on Computing, vol. 22, no. 5, pp , [4] M. Deo and S. Keely, Parallel suffix array and least common prefix for the gpu, in ACM SIGPLAN Notices, vol. 48, no. 8. ACM, 213, pp [5] F. Kulla and P. Sanders, Scalable parallel suffix array construction, Parallel Computing, vol. 33, no. 9, pp , 27. [6] Cudpp-2.2, 214. [Online]. Available: [7] P. Ko and S. Aluru, Space efficient linear time construction of suffix arrays, in Combinatorial Pattern Matching. Springer, 23, pp [8] G. Nong, S. Zhang, and W. H. Chan, Two efficient algorithms for linear time suffix array construction, Computers, IEEE Transactions on, vol. 6, no. 1, pp , 211. [9] J. Hoberock and N. Bell, Thrust: A parallel template library, 21, version [Online]. Available: [1] R. Patel, Y. Zhang, J. Mak, A. Davidson, J. D. Owens, et al., Parallel lossless data compression on the GPU. IEEE, 212. [11] J. Seward, On the performance of bwt sorting algorithms, in Data Compression Conference, 2. Proceedings. DCC 2. IEEE, 2, pp [12] O. Green, R. McColl, and D. A. Bader, Gpu merge path: a gpu merging algorithm, in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 212, pp [13] M. Farach, Optimal suffix tree construction with large alphabets, in Foundations of Computer Science, Proceedings., 38th Annual Symposium on. IEEE, 1997, pp [14] H. Itoh and H. Tanaka, An efficient method for in memory construction of suffix arrays, in String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware. IEEE, 1999, pp [15] G. M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in Proceedings of the April 18-2, 1967, spring joint computer conference. ACM, 1967, pp [16] D. G. Merrill and A. S. Grimshaw, Revisiting sorting for gpgpu stream architectures, in Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 21, pp [17] Enwiki, 215. [Online]. Available: enwiki/

SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES

SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES INTRODUCTION Rahmaddiansyah 1 and Nur aini Abdul Rashid 2 1 Universiti Sains Malaysia (USM), Malaysia, new_rahmad@yahoo.co.id 2 Universiti

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Application of the BWT Method to Solve the Exact String Matching Problem

Application of the BWT Method to Solve the Exact String Matching Problem Application of the BWT Method to Solve the Exact String Matching Problem T. W. Chen and R. C. T. Lee Department of Computer Science National Tsing Hua University, Hsinchu, Taiwan chen81052084@gmail.com

More information

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS October 9, 215 9:46 WSPC/INSTRUCTION FILE ssbench Parallel Processing Letters c World Scientific Publishing Company A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS BRUCE MERRY SKA South Africa,

More information

Linear Suffix Array Construction by Almost Pure Induced-Sorting

Linear Suffix Array Construction by Almost Pure Induced-Sorting Linear Suffix Array Construction by Almost Pure Induced-Sorting Ge Nong Computer Science Department Sun Yat-Sen University Guangzhou 510275, P.R.C. issng@mail.sysu.edu.cn Sen Zhang Dept. of Math., Comp.

More information

Parallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction

Parallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction Parallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction Julian Labeit Julian Shun Guy E. Blelloch Karlsruhe Institute of Technology UC Berkeley Carnegie Mellon University julianlabeit@gmail.com

More information

In-Place Suffix Sorting

In-Place Suffix Sorting In-Place Suffix Sorting G. Franceschini 1 and S. Muthukrishnan 2 1 Department of Computer Science, University of Pisa francesc@di.unipi.it 2 Google Inc., NY muthu@google.com Abstract. Given string T =

More information

UC Davis UC Davis Previously Published Works

UC Davis UC Davis Previously Published Works UC Davis UC Davis Previously Published Works Title Fast Parallel Suffix Array on the GPU Permalink https://escholarship.org/uc/item/83r7w305 Authors Wang, Leyuan Baxter, Sean Owens, John D. Publication

More information

A Batched GPU Algorithm for Set Intersection

A Batched GPU Algorithm for Set Intersection A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Accelerating Lossless Data Compression with GPUs

Accelerating Lossless Data Compression with GPUs Accelerating Lossless Data Compression with GPUs R.L. Cloud M.L. Curry H.L. Ward A. Skjellum P. Bangalore arxiv:1107.1525v1 [cs.it] 21 Jun 2011 October 22, 2018 Abstract Huffman compression is a statistical,

More information

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Prefix Scan and Minimum Spanning Tree with OpenCL

Prefix Scan and Minimum Spanning Tree with OpenCL Prefix Scan and Minimum Spanning Tree with OpenCL U. VIRGINIA DEPT. OF COMP. SCI TECH. REPORT CS-2013-02 Yixin Sun and Kevin Skadron Dept. of Computer Science, University of Virginia ys3kz@virginia.edu,

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Engineering a Lightweight External Memory Su x Array Construction Algorithm

Engineering a Lightweight External Memory Su x Array Construction Algorithm Engineering a Lightweight External Memory Su x Array Construction Algorithm Juha Kärkkäinen, Dominik Kempa Department of Computer Science, University of Helsinki, Finland {Juha.Karkkainen Dominik.Kempa}@cs.helsinki.fi

More information

Efficient Stream Reduction on the GPU

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

Massively Parallel Computations of the LZ-complexity of Strings

Massively Parallel Computations of the LZ-complexity of Strings Massively Parallel Computations of the LZ-complexity of Strings Alexander Belousov Electrical and Electronics Engineering Department Ariel University Ariel, Israel alex.blsv@gmail.com Joel Ratsaby Electrical

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler Transform

Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler Transform Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler German Tischler The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SA, United Kingdom

More information

Suffix Array Construction

Suffix Array Construction Suffix Array Construction Suffix array construction means simply sorting the set of all suffixes. Using standard sorting or string sorting the time complexity is Ω(DP (T [0..n] )). Another possibility

More information

High-performance short sequence alignment with GPU acceleration

High-performance short sequence alignment with GPU acceleration Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August

More information

A Survey on Disk-based Genome. Sequence Indexing

A Survey on Disk-based Genome. Sequence Indexing Contemporary Engineering Sciences, Vol. 7, 2014, no. 15, 743-748 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.4684 A Survey on Disk-based Genome Sequence Indexing Woong-Kee Loh Department

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Burrows Wheeler Transform

Burrows Wheeler Transform Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is an important technique for text compression, text indexing, and their combination compressed text indexing. Let T [0..n] be the text with

More information

Accelerating Blockchain Search of Full Nodes Using GPUs

Accelerating Blockchain Search of Full Nodes Using GPUs Accelerating Blockchain Search of Full Nodes Using GPUs Shin Morishima Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan Email: morisima@arc.ics.keio.ac.jp Abstract Blockchain is a

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

W O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S

W O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S W O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S mikkel ravnholt knudsen, 201303546 jens christian liingaard hansen, 201303561 master s thesis June

More information

Project Report: Needles in Gigastack

Project Report: Needles in Gigastack Project Report: Needles in Gigastack 1. Index 1.Index...1 2.Introduction...3 2.1 Abstract...3 2.2 Corpus...3 2.3 TF*IDF measure...4 3. Related Work...4 4. Architecture...6 4.1 Stages...6 4.1.1 Alpha...7

More information

Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs

Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs Ana Balevic, Lars Rockstroh, Marek Wroblewski, and Sven Simon Institute for Parallel and Distributed

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Packet Classification Using Dynamically Generated Decision Trees

Packet Classification Using Dynamically Generated Decision Trees 1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance Toan Thang Ta, Cheng-Yao Lin and Chin Lung Lu Department of Computer Science National Tsing Hua University, Hsinchu

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Just Sort. Sathish Kumar Vijayakumar Chennai, India (1)

Just Sort. Sathish Kumar Vijayakumar Chennai, India (1) Just Sort Sathish Kumar Vijayakumar Chennai, India satthhishkumar@gmail.com Abstract Sorting is one of the most researched topics of Computer Science and it is one of the essential operations across computing

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding LETTER IEICE Electronics Express, Vol.14, No.21, 1 11 Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding Rongshan Wei a) and Xingang Zhang College of Physics

More information

Limits of Data-Level Parallelism

Limits of Data-Level Parallelism Limits of Data-Level Parallelism Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India. Email: {sree@hpc.serc,govind@serc,mjt@serc}.iisc.ernet.in

More information

Cache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop

Cache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop What is Burrows-Wheeler Transform? A pre-processing step for data compression Involves sorting of all rotations

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA SEQUENCING AND MOORE S LAW Slide courtesy Illumina DRAM I/F

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Parallel Variable-Length Encoding on GPGPUs

Parallel Variable-Length Encoding on GPGPUs Parallel Variable-Length Encoding on GPGPUs Ana Balevic University of Stuttgart ana.balevic@gmail.com Abstract. Variable-Length Encoding (VLE) is a process of reducing input data size by replacing fixed-length

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16 GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional

More information

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction Yongying Gao and Hayder Radha Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48823 email:

More information

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces Evaluation report prepared under contract with LSI Corporation Introduction IT professionals see Solid State Disk (SSD) products as

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

CONTENT ADAPTIVE SCREEN IMAGE SCALING

CONTENT ADAPTIVE SCREEN IMAGE SCALING CONTENT ADAPTIVE SCREEN IMAGE SCALING Yao Zhai (*), Qifei Wang, Yan Lu, Shipeng Li University of Science and Technology of China, Hefei, Anhui, 37, China Microsoft Research, Beijing, 8, China ABSTRACT

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

A Linear-Time Burrows-Wheeler Transform Using Induced Sorting

A Linear-Time Burrows-Wheeler Transform Using Induced Sorting A Linear-Time Burrows-Wheeler Transform Using Induced Sorting Daisuke Okanohara 1 and Kunihiko Sadakane 2 1 Department of Computer Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0013,

More information

Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees

Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees Sensors 2015, 15, 23763-23787; doi:10.3390/s150923763 Article OPEN ACCESS sensors ISSN 1424-8220 www.mdpi.com/journal/sensors Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Enhancing the Efficiency of Radix Sort by Using Clustering Mechanism

Enhancing the Efficiency of Radix Sort by Using Clustering Mechanism Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Evaluation of Power Consumption of Modified Bubble, Quick and Radix Sort, Algorithm on the Dual Processor

Evaluation of Power Consumption of Modified Bubble, Quick and Radix Sort, Algorithm on the Dual Processor Evaluation of Power Consumption of Modified Bubble, Quick and, Algorithm on the Dual Processor Ahmed M. Aliyu *1 Dr. P. B. Zirra *2 1 Post Graduate Student *1,2, Computer Science Department, Adamawa State

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

A Massively Parallel Line Simplification Algorithm Implemented Using Chapel

A Massively Parallel Line Simplification Algorithm Implemented Using Chapel A Massively Parallel Line Simplification Algorithm Implemented Using Chapel Michael Scherger Department of Computer Science Texas Christian University Email: m.scherger@tcu.edu Huy Tran Department of Computing

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Hardware-Supported Pointer Detection for common Garbage Collections

Hardware-Supported Pointer Detection for common Garbage Collections 2013 First International Symposium on Computing and Networking Hardware-Supported Pointer Detection for common Garbage Collections Kei IDEUE, Yuki SATOMI, Tomoaki TSUMURA and Hiroshi MATSUO Nagoya Institute

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Y. Latha Post Graduate Scholar, Indur institute of Engineering & Technology, Siddipet K.Padmavathi Associate. Professor,

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

Real-time processing for intelligent-surveillance applications

Real-time processing for intelligent-surveillance applications LETTER IEICE Electronics Express, Vol.14, No.8, 1 12 Real-time processing for intelligent-surveillance applications Sungju Lee, Heegon Kim, Jaewon Sa, Byungkwan Park, and Yongwha Chung a) Dept. of Computer

More information

LCP Array Construction

LCP Array Construction LCP Array Construction The LCP array is easy to compute in linear time using the suffix array SA and its inverse SA 1. The idea is to compute the lcp values by comparing the suffixes, but skip a prefix

More information

arxiv: v1 [cs.ds] 8 Dec 2016

arxiv: v1 [cs.ds] 8 Dec 2016 Sorting Data on Ultra-Large Scale with RADULS New Incarnation of Radix Sort Marek Kokot, Sebastian Deorowicz, and Agnieszka Debudaj-Grabysz Institute of Informatics, Silesian University of Technology,

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL 32901 rhibbler@cs.fit.edu ABSTRACT Given an array of elements, we want to arrange those elements into

More information

An Efficient Implementation of LZW Compression in the FPGA

An Efficient Implementation of LZW Compression in the FPGA An Efficient Implementation of LZW Compression in the FPGA Xin Zhou, Yasuaki Ito and Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1, Higashi-Hiroshima, 739-8527

More information

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Parallel LZ77 Decoding with a GPU Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Outline Background (What?) Problem definition and motivation (Why?)

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 410 (2009) 3372 3390 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs An (18/11)n upper bound for sorting

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Keywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data.

Keywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data. Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient and

More information

Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm

Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm San-Chih Lin, Chi-Kuang Chang, Nai-Wei Lin National Chung Cheng University Chiayi, Taiwan 621, R.O.C. {lsch94,changck,naiwei}@cs.ccu.edu.tw

More information

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2017) Vol. 6 (3) 213 222 USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS PIOTR OŻDŻYŃSKI, DANUTA ZAKRZEWSKA Institute of Information

More information

Efficient Multiway Radix Search Trees

Efficient Multiway Radix Search Trees Appeared in Information Processing Letters 60, 3 (Nov. 11, 1996), 115-120. Efficient Multiway Radix Search Trees Úlfar Erlingsson a, Mukkai Krishnamoorthy a, T. V. Raman b a Rensselaer Polytechnic Institute,

More information

Linear Work Suffix Array Construction

Linear Work Suffix Array Construction Linear Work Suffix Array Construction Juha Karkkainen, Peter Sanders, Stefan Burkhardt Presented by Roshni Sahoo March 7, 2019 Presented by Roshni Sahoo Linear Work Suffix Array Construction March 7, 2019

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the 2013 Federated Conference on Computer Science and Information Systems pp. 515 519 Template Library for Multi- Pseudorandom Number Recursion-based Generars Dominik Szałkowski Institute

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA

More information

Improving Range Query Performance on Historic Web Page Data

Improving Range Query Performance on Historic Web Page Data Improving Range Query Performance on Historic Web Page Data Geng LI Lab of Computer Networks and Distributed Systems, Peking University Beijing, China ligeng@net.pku.edu.cn Bo Peng Lab of Computer Networks

More information