Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

Similar documents
GPU Programming Using NVIDIA CUDA

Assignment 2 (programming): Problem Description

Algorithms and Data Structures

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU-accelerated similarity searching in a database of short DNA sequences

GPU programming. Dr. Bernhard Kainz

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

An Efficient AC Algorithm with GPU

University of Bielefeld

Evaluation Of The Performance Of GPU Global Memory Coalescing

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Optimization solutions for the segmented sum algorithmic function

Given a text file, or several text files, how do we search for a query string?

Speed Up Your Codes Using GPU

Lecture 3: Introduction to CUDA

String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern)

I. INTRODUCTION. Manisha N. Kella * 1 and Sohil Gadhiya2.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

POST-SIEVING ON GPUs

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

String matching algorithms

CS/COE 1501

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Fast Hybrid String Matching Algorithms

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Gpufit: An open-source toolkit for GPU-accelerated curve fitting

CS427 Multicore Architecture and Parallel Computing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming

high performance medical reconstruction using stream programming paradigms

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

GPU-accelerated Verification of the Collatz Conjecture

Importance of String Matching in Real World Problems

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to GPGPU and GPU-architectures

Study of Selected Shifting based String Matching Algorithms

Parallelization of K-Means Clustering Algorithm for Data Mining

Simultaneous Solving of Linear Programming Problems in GPU

Fast Parallel Arithmetic Coding on GPU

Data Structures and Algorithms. Course slides: String Matching, Algorithms growth evaluation

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Inter-Block GPU Communication via Fast Barrier Synchronization

Applications of Berkeley s Dwarfs on Nvidia GPUs

HPC with Multicore and GPUs

GPU Programming Using CUDA

COSC 6374 Parallel Computations Introduction to CUDA

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

CUDA. Matthew Joyner, Jeremy Williams

Fast Knowledge Discovery in Time Series with GPGPU on Genetic Programming

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

String Matching Algorithms

String Matching Algorithms

Unrolling parallel loops

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

ECE 8823: GPU Architectures. Objectives

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CUDA Programming. Aiichiro Nakano

Module 3: CUDA Execution Model -I. Objective

A Practical Distributed String Matching Algorithm Architecture and Implementation

CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

String Matching in Scribblenauts Unlimited

Lecture 7 February 26, 2010

Tesla Architecture, CUDA and Optimization Strategies

SWIFT -A Performance Accelerated Optimized String Matching Algorithm for Nvidia GPUs

Lecture 1: Introduction and Computational Thinking

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

Supporting Data Parallelism in Matcloud: Final Report

Document Compression and Ciphering Using Pattern Matching Technique

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Cartoon parallel architectures; CPUs and GPUs

Fast Substring Matching

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

String Searching Algorithm Implementation-Performance Study with Two Cluster Configuration

Scientific discovery, analysis and prediction made possible through high performance computing.

A Parallel Decoding Algorithm of LDPC Codes using CUDA

Real-time Graphics 9. GPGPU

Multithreaded Sliding Window Approach to Improve Exact Pattern Matching Algorithms

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU for HPC. October 2010

Facial Recognition Using Neural Networks over GPGPU

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Transcription:

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Parth Shah 1 and Rachana Oza 2 1 Chhotubhai Gopalbhai Patel Institute of Technology, Bardoli, India parthpunita@yahoo.in 2 Sarvajanik College of Engineering and Technology, Surat, India oza.rachana7@gmail.com Abstract. String matching algorithms are among one of the most widely used algorithms in computer science. Traditional string matching algorithms are not enough for processing recent growth of data. Increasing efficiency of underlaying string matching algorithm will greatly increase the efficiency of any application. In recent years, Graphics processing units are emerged as highly parallel processor. They out perform best of the central processing units in scientific computation power. By combining recent advancement in graphics processing units with string matching algorithms will allows to speed up process of string matching. In this paper we proposed modified parallel version of Rabin-Karp algorithm using graphics processing unit. Based on that, result of CPU as well as parallel GPU implementations are compared for evaluating effect of varying number of threads, cores, file size as well as pattern size. Keywords: Rabin-Karp, GPU, String Matching, CUDA, Parallel Processing, Big Data, Pattern Matching 1 Introduction String matching algorithms are an important part of the string algorithms. Their task is to find all the occurrences of strings (also called patterns) within a larger string or text. These string matching algorithms are widely used in Computational Biology, Signal Processing, Text Retrieval, Computer Security, Text editors and many more applications [1]. In systems like intrusion detection or searching, string matching techniques takes more than half of the total computation time [2]. These string matching algorithms can be divided mainly into two sub category: single string matching algorithms and multiple string matching algorithms. In this paper we focus on mainly multiple string matching algorithms. In multiple string matching algorithm if we want to check only if pattern in text or not then Boyer-Moore performs best, but if we want to find all occurrence of pattern in text then Rabin-Karp provides most optimal results [3]. The final publication is available at Springer via https://doi.org/10.1007/ 978-3-319-63645-0_26

2 Parth Shah et al. As the information technology spreads fast, data is growing exponentially. This requires to scale up processing power which is not possible using CPU. CPUs were basically designed for general purpose work. So, it has higher operating frequency and lower number of cores. On the other end Graphics processing units were originally developed for preforming highly parallel operations like graphics rendering. For performing this large amount of computation GPUs have higher number of processing elements (ALUs) and lower operating frequency in compared to normal CPUs. If we able to use these highly parallel processing elements for our pattern matching task, we can greatly improve performance of string matching algorithm. In this paper, we have implemented modified parallel Rabin-Karp string matching algorithm over GPU where most of computation task is performed on GPU instead of CPU. Previous work by Nayomi at el. uses RB-Marcher technique to find hash while our approach uses Rabin Hashing as hash mechanism. The main contribution of this paper is improved parallel version of Rabin Karp algorithm that utilize inherent parallelism of Rabin Karp algorithm using GPU. In this paper, we have also evaluated performance of our modified approach over different parameters like pattern length, number of threads, filesize as well as number of cores. Section 2 of this paper describes Related Work in detail. Algorithms and Implementation, Results and Discussion, Conclusion are discussed in Sections 3, 4 and 5 respectively. 2 Related work Accelerating string matching algorithms are one of the major concern in areas where string matching algorithm used heavily. In the era of big data this will greatly speed up the processing task. With the introduction of General Purpose Graphics Processing Units (GPGPUs) we can achieve highly parallel processing capabilities [4]. Using this parallelism we can speedup the task of string matching in efficient manner. For the task of string matching various algorithms were developed in past like Brute-Force algorithm, Boyer-Moore algorithm, Knuth-Morris-Pratt algorithm, Rabin-Karp algorithm, etc. Blandn and Lombardo [3] had compared all these algorithm and found that Boyer-Moore [5] performs best when there is no pattern matched in text. Knuth-Morris-Pratt [6] algorithm heavily depends on result previous phase s calculation which makes it unsuitable to parallel implementation. That leads us to Rabin-Karp [7] algorithm which is inherent parallel in its design. In Rabin-Karp algorithm, calculation of hash for every substring does not depends on any other information then substring. So we can easily parallelize this operation over GPU which helps us to improve performance of Rabin-Karp algorithm. For parallel implementation of Rabin-Karp algorithm, Nayomi at el. [8] proposed Rabin-Karp algorithm which uses RB-Matcher [9] for generating hash value. But due to involvement of modular arithmetic in RB-Matcher perfor-

Improved Parallel Rabin-Karp Algorithm Using CUDA 3 mance improvement is limited. In our work, we have presented GPU version of Rabin-Karp algorithm that uses Rabin Hashing as hash function. Our work not only just propose modified approach, but also proves that using experimental results on real data. 3 Algorithm and Implementation In this section, we discuss about modified Rabin-Karp algorithm and our parallel CUDA implementation details. 3.1 Rabin-Karp algorithm Rabin Karp is a string searching algorithm, which supports both single and multiple pattern matching capabilities. It is widely used in applications like plagiarism detection and DNA sequence matching. In our proposed approach we have used Rabin hashing method to find out pattern strings in a text. In order to compute hash we apply left shift to ASCII value of character and add it to previously calculated hash. We apply this process repeatedly for all character of string. To find out the matching pattern we will compare hash of pattern and hash of each substring of text. If hash of both substring and pattern match then only we will compare string to find match. Pseudo code for sequential implementation function Rabin_Karp_CPU(result, T, n, P, m) hx=0,hy= 0; for (i = 0;i<m;++i) hx = ((hx << 1) + P[i]); for (x = 0;x < n - m;x++) for (hy = i = 0;i <m; ++i) hy = ((hy << 1) + T[i + x]); /* Compare hash and match pattern */ if (hx == hy && compare(p, T + x, m) == 0) print("match at:", x); result[x] = true; return result;

4 Parth Shah et al. In sequential implementation pattern matching function is called for each substring sequentially. First we read text T and pattern P from file. Text T of length n and pattern P of length m is passed as argument to matching function. Function first calculate hash of pattern hx and store it in memory. After that for each substring of length m in text T, hash is calculated. If hash of both pattern and substring matches then only we will compare string. If string is matched we will store its index in result array. And result is returned with containing matching element. Pseudo code for parallel GPU implementation function Rabin_Karp_GPU(T, P, n, m, hx, result) blockid = blockidx.x + blockidx.y * griddim.x + griddim.x * griddim.y * blockidx.z; x = blockid * blockdim.x + threadidx.x; if (x <= n - m) hy = 0; for (i = 0;i <m; ++i) hy = ((hy << 1) + T[i + x]); /* Compare hash and match pattern */ if (hx == hy && compare(p, T + x, m) == 0) print("match at:", x); result[x] = true; In parallel version, we first load text T and pattern P from file to global memory. Once pattern is loaded in memory we calculate hash hp. Once hp is calculated we pass it as argument to string matching function. At the same time T and P are transferred to GPU memory so that we can access it using GPU s processing elements. String matching function will execute for each substring in parallel on GPU. Processing elements on GPU can be access using grids. Each grid have multiple blocks where each block can execute upto 1024 threads in parallel. We will calculate offset of string index by combining id of thread, block and grid which is stored as x in our pseudo code. Based on offset we calculate hash hy of each substring on individual GPU thread. If hash of both substring and pattern matches then only string is compared. If string matches offset is stored in result and result is transferred back to CPU memory.

Improved Parallel Rabin-Karp Algorithm Using CUDA 5 3.2 CUDA Architecture Compute Unified Device Architecture (CUDA) is developed as a parallel computing and application architecture by NVIDIA corporation. It enables user to harness the computing capacity of GPUs. It allows us to send our C / C++ program directly to GPU for execution. Tools for interacting with GPUs having CUDA architecture is available as CUDA Toolkit released by NVIDIA. In this article, CUDA toolkit and CUDA enabled GPUs are used for implementation of parallel GPU version of Rabin-Karp algorithm. Some terminologies used in Fig. 1. Architecture of CUDA [10] NVIDIA CUDA programming environment is as follows: Host: The CPU will be referred to as a host. Device: An individual GPU will be referred to as a device. Kernel: Kernel is a function that will be executed in individual block. Thread: Each thread is an execution of a given kernel with a specific index. Each thread uses its index to access elements in array. Block: A group of threads that are executed together and form the unit of resource assignment is called block. Grid: Grid is a group of independently executable blocks. Our main algorithm for parallel implementation is used as a kernel function. CUDA architectures support maximum of 1024 threads per block. Size of block and grids are selected such that multiplication of number of thread, block and grid will match the total characters in text. This will helps us to effectively optimize all cores of GPU. 3.3 Hardware and Datasets All the experiment in this paper carried out using Intel Core i7 processor with 4 cores. For GPU implementation we have systems with two different number of

6 Parth Shah et al. cores as increasing number of cores increase total available processing elements. They are as follows: 1) NVIDIA GT920 with 386 CUDA cores and 4GB Memory and 2) NVIDIA GForce GT 630M with 96 CUDA cores and 2GB Memory. For evaluation of algorithm we have used Random DNA sequence generator to generate dna sequence dataset. We have used dataset of different sizes like 2MB, 10MB, 20MB, 40MB, 80MB, 160MB, 360MB, 640MB and 1GB. 4 Result and Discussion For evaluating performance of our proposed parallel algorithm over serial version we have used speed up ratio as our performance measure. Speed up ratio S is calculated as, S = T GP U T CP U (1) where T GP U and T CP U are execution time for GPU and CPU respectively. First we have compared our parallel algorithm over sequential algorithm. For that we have executed our modified algorithm with different number of threads in parallel execution. Table 1. Performance comparison for different thread size Number of Threads Sequential Parallel GPU Per Block Implementation implementation Speed Up Ratio 32 6937 4718.750 1.470093 64 6535 2708.375 2.412886 128 6839 1715.790 3.985919 256 6360 1220.096 5.212705 512 6047 971.2466 6.226019 1024 6125 847.1143 7.230430 As we can see in Table 1 as number of threads increases total execution time required for parallel implementation will reduce proportionally. To check how our algorithm scales when we increase number of cores we have implemented our parallel algorithm on GPUs with 96 and 384 cores. Result of execution on both GPU keeping constant pattern size of 7 over 320MB filesize is given in Table 2. As we can see from Fig. 2, increases in speed up ratio is directly proportional to number of cores in GPU. As we increase number of cores total time require for execution will decrease proportionally. In order to check effect of pattern size on execution time, we executed both sequential and parallel version using different pattern size. As we can see from Table 3, increasing input pattern size increases speed up ratio due to fact that GPUs can easily handle computation heavy task with their higher number of cores compared to CPU.

Improved Parallel Rabin-Karp Algorithm Using CUDA 7 Table 2. Performance comparison for different number of cores Execution time (ms) 11000 10000 Number of Threads Per Block 9000 8000 7000 6000 5000 4000 3000 2000 1000 NVIDIA GForce GT630M (96 cores) (ms) 32 10638.17 4718.75 64 5931.667 2708.375 128 3600.603 1715.79 256 2434.065 1220.096 512 1787.352 971.2466 1024 1495.54 847.1143 Execution time v/s Thread (File size=320mb) NVIDIA GForce 920 (384 cores) (ms) 0 0 100 200 300 400 500 600 700 800 900 1000 1100 Number of Threads 96 Cores 384 Cores 18000 16000 14000 Fig. 2. Performance comparison for different number of cores Execution time v/s File Size (Pattern Length = 7) CPU GPU Execution time (ms) 12000 10000 8000 6000 4000 2000 0 0 100 200 300 400 500 600 700 800 900 1000 File Size (MB) Fig. 3. Performance comparison for different file size As our main goal is to process and match pattern in real life situations where we encounter different file sizes. So to check performance, we have evaluated performance of our GPU implementation for different filesize in Fig. 3. As we can see, GPU version of parallel algorithm performs better when file size is larger as increasing file size dramatically increase computation which can easily handled by GPU. 5 Conclusion Comparing implementation of serial and CUDA version of Rabin-Karp string matching algorithm, we can get upto 23x speed-up in CUDA version over serial version. Using Rabin hashing reduces total computation required for generating hash which directly reflect in execution speed. From our experiments we can

8 Parth Shah et al. Table 3. Performance comparison for different pattern length Pattern Length Intel Core i7 (4 NVIDIA GForce 920 Speed Up Ratio Cores) (ms) (384 cores) (ms) 25 140 7.51616 18.62653 50 328 13.82605 23.72334 100 578 26.37414 21.9154 200 1187 51.32698 23.12624 800 4578 201.1771 22.75607 conclude that maximum speedup is archived when file size is minimum and as file size increase speedup decreases. Similarly by increasing number of cores and execution threads in GPU we can get maximum speedup in task of string matching as it increase total number of tasks that can be executed concurrently. References 1. Singla, N., Garg, D.: String matching algorithms and their applicability in various applications. International Journal of Soft Computing & Engineering Volume-I(6) (01 2012) 2. Xu, D., Zhang, H., Fan, Y.: The gpu-based high-performance patternmatching algorithm for intrusion detection [j]. Journal of Computational Information Systems 9(10) (2013) 3791 3800 3. Gongora-Blandon, M., Vargas-Lombardo, M., et al.: State of the art for string analysis and pattern search using cpu and gpu based programming. Journal of Information Security 3(04) (2012) 314 4. Ghorpade, J., Parande, J., Kulkarni, M., Bawaskar, A.: Gpgpu processing in cuda architecture. arxiv preprint arxiv:1202.4347 (2012) 5. Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10) (October 1977) 762 772 6. Knuth, D.E., James H. Morris, J., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2) (1977) 323 350 7. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31(2) (March 1987) 249 260 8. Dayarathne, N., Ragel, R.: Accelerating rabin karp on a graphics processing unit (gpu) using compute unified device architecture (cuda). In: 7th International Conference on Information and Automation for Sustainability,IEEE. (Dec 2014) 1 6 9. Chillar, R.S., Kochar, B.: Rb-matcher: String matching technique. World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering 2(6) (2008) 2282 2285 10. Ramey, W.: Introduction to cuda platform. NVIDIA Corporation http://developer.nvidia.com/compute/developertrainingmaterials/ presentations/general/introduction_to_cuda_platform.pptx (2016) [Accessed 6-February-2016].