Exploiting GPUs to Accelerate Clustering Algorithms

Size: px
Start display at page:

Download "Exploiting GPUs to Accelerate Clustering Algorithms"

Transcription

1 Exploiting GPUs to Accelerate Clustering Algorithms Mahmoud Al-Ayyoub, Qussai Yaseen, Moahmmed A. Shehab, Yaser Jararweh, Firas Albalas and Elhadj Benkhelifa Jordan University of Science and Technology, Irbid, Jordan s: {maalshbool, mohammed {yijararweh, Mobile Fusion Applied Research Centre, Staffordshire University, Stafford, UK Abstract Big data is a main problem for data mining methods. Fortunately, the rapid advances in affordable high performance computing platforms such as the Graphics Processing Unit (GPU) have helped researchers in reducing the execution time of many algorithms including data mining algorithms. This paper discusses the utilization of the parallelism capabilities of the GPU to improve the the performance of two common clustering algorithms, which are K-Means (KM) and Fuzzy C-Means (FCM) algorithms. Two main parallelism approaches are presented: pure and hybrid. These different versions are tested under different settings including two different GPU-equipped machines (a laptop and a server). The results show excellent improvement gains of the hybrid implementations compared with the pure parallel and sequential ones. On the laptop, the best gains of the hybrid implementations compared with the sequential ones are 11.3X for KM and 1.9X for FCM. As for the server, the best gains are 13.5X for KM and 16.3X for FCM. Moreover, the paper explores the usage of a recent memory management technique for GPU called Unified Memory (UM). The results show a decrease in the performance gain of the hybrid implementations that is equal to 44% for hybrid version of KM and 61% for FCM. On the other hand, the use of UM does introduce a small advantage for the pure parallel implementation. I. INTRODUCTION Big data has become a main challenge for many information technology fields due to the large processing time it needs. The clustering of big data, where data is separated into groups of similar features, is an example of those challenges in the data mining and machine learning fields. The applications of clustering are numerous and diverse from text analysis to robotics. Image segmentation is another popular clustering applications, where objects from natural images are segmented, or regions of interest from medical images are extracted to diagnose many diseases such as brain tumor and breast cancer[1] [3]. K-Means (KM) and Fuzzy C-Means (FCM), two very common methods for clustering data, face serious issues when they deal with big data [4] [6]. The execution times for these techniques increase as the data size increases, which makes big data clustering a major issue. Furthermore, the number of dimensions may reduce the speed of finishing the clustering operation [7]. To increase the efficiency of clustering algorithms on big data, parallel programming is used. To this end, Graphics Processing Unit (GPUs) are gaining more popularity for compute-intensive computation compared with the Central Processing Units (CPUs). The reason for this is very simple. While modern CPUs can run up to 32 threads at same time, modern GPUs can run around 4999 threads [8], [9]. Obviously, GPUs have higher capabilities to run more threads than CPUs. Therefore, many researchers utilize this advantage to improve the performance of many algorithms [1] [13]. This paper leverages the capabilities of GPUs and parallel techniques in big data clustering. This reduces the effect of increasing data size and number of dimensions, and increases the scalability of applying KM and FCM clustering algorithms. Both CPUs and GPUs are in the focus of many researchers in academia as well as the industry. Thus, they are both being rapidly improving and optimized in terms of speed, parallelism capability, memory management, etc. Unfortunately, many previous researchers fell into the pitfall of performing an unfair performance comparison between CPUs and GPUs from different settings (e.g., comparing the performance of a GPU built for heavily-loaded servers with a CPU built for lightly loaded laptops) or from different time periods (e.g., comparing the performance of a GPU with a CPU that is five or ten years older). This paper aims to perform fair comparisons using modern CPUs with modern GPUs in both laptop and server settings. The contributions of this paper are as follows. 1) It uses different parallel programming approaches (pure parallel and hybrid parallel) to test the KM and FCM clustering algorithms on big data. 2) It leverages the capabilities of GPUs to implement the aforementioned methods under variable dimensions and scalable data size. 3) It tests the Unified Memory (UM) technology against the pure parallel and hybrid parallel implementation approaches, and shows that it gives negative effect on the improvement gain for performance. The paper is organized as the follows. The next section discusses some related work. Section III discusses the proposed methodology. Section IV demonstrates and analyzes the experiments and results. Section V concludes the work and presents the future work. II. RELATED WORK This section discusses some of the existing work related to the problem at hand. Specifically, we try to cover prior efforts /16/$ IEEE

2 to improve the performance of KM or FCM algorithms using parallelism (especially, if it is based on GPUs). A limitation of clustering algorithms lies in the processing time needed for clustering and labeling data, especially in the case of big data. However, many researchers proposed new algorithms that can handle clustering of big data. Zechner et al. [14] accelerated the KM algorithm by utilizing GPU capability. They used Intel Core 2 CPU with 4GB main memory and NVIDIA GeForce 96 GT with 512MB as RAM for GPU. The operating system used is Windows XP and the algorithm was implemented using C and CUDA. The parallel implementation achieved 14X faster than the sequential version. Another work on using GPUs to accelerate the KM algorithm is that of Farivar et al. [15]. The authors used Intel Pentium D CPU and compared it with two GPUs. The first GPU was NVIDIA GeForce 86 and the second GPU was NVIDIA 88 Ultra GTX. They used a dataset of 1 million elements with one dimension. In addition, the number of clusters used was 4. The first GPU (NVIDIA 86 GT) was around 13X faster than the sequential implementation, while the second GPU (NVIDIA 88 Ultra GTX) was around 68X faster than sequential version. In [16], Soroushnia et al. discussed a parallel implementation of the KM algorithm using CUDA. In their experiments, the authors ranged the input data from 1K elements to 1M elements. At each test case, the number of clusters ranged from 8 to 124. They compared an Intel Pentium D with 1GB main memory to the GPU GTX 88 NVIDIA with 782MB RAM. The improvement achieved was around 6X with 124 clusters. Similarly, Shalom et al. [17] implemented a parallel version of the KM clustering algorithm. However, they compared the performance of a Pentium 4 CPU with NVIDIA GeForce FX 59 XT and NVIDIA GeForce 85 GT. The improvement they achieved in performance was about 5X against the sequential version. Fuzzy C-Means (FCM) is a clustering algorithm used to segment data to a number of clusters [5]. Compared with the KM algorithm, this algorithm is more complex and consumes more time in separating data to perfect clusters [7]. Many researchers have dealt with the problem of enhancing FCM s performance using parallelism [18] [25]. In [18], Shalom et al. implemented a parallel version of the FCM algorithm. They compared the performance of the parallel implementation on two GPUs with the performance of a sequential implementation on a Pentium 4 CPU. As for the GPUs they used, they were GeForce 85 GT and 88 GTX. The two GPU models achieved around 73X and 14X improvements respectively. The data set used in their models contained 1 million data points with 4 dimensions. Furthermore, they ranged the number of clusters from 3 clusters to 64 clusters. Similarly, Li and et al. [19] implemented a parallel version of the FCM algorithm for image segmentation. The data set consisted of images of natural scenes from which the authors tried to extract objects. In their work, all FCM functions were located on the GPU side. They used Intel Core 2 Duo CPU with GTX 26 NVIDIA GPU. The improvement achieved in their model was 1X faster than the sequential version. More work on FCM was performed by Zhuge et al. [2] who used the algorithm to improve the segmentation of medical images. The authors divided the data set into three image types (small, medium, large), and used Intel Xeno CPU with quad cores and Tesla C16 GPU in their experiments. They achieved improvements of about 24X, 18X and 1X for the small, medium and large images, respectively. Onchis et al. [21] implemented FCM algorithm with CUDA for image segmentation. The algorithm was used to segment/extract a Region of Interest (ROI) from images. For this purpose, the authors compared the performance of an Intel Core i7 CPU with the performance of a Tesla M27Q GPU. They achieved an improvement of about 3X. Most of the papers discussed so far showed higher improvement gain than what we obtain. One justification is the unfairness point mentioned previously where the sequential version is run on a simple or old CPU whereas the parallel version is run on a more involved or newer GPU. For our work, we try to be as fair as possible in our choices of the CPU and GPU hardware for the two settings under consideration. Specifically, for the laptop setting, we use Intel Core i7 CPU and NVIDIA GeForce GT 74M. As for the server setting, we use two socket Xeon Haswell 2.6GHz server with 16 cores in total and 2X NVIDIA K8 GPU cards (4 GPUs in total). Another advantage of this work compared with existing work is the inclusion of two different clustering techniques and the consideration of UM. To the best of our knowledge, no prior work provides such a rich set of experiments. III. METHODOLOGY This section discusses the proposed methodology. Two common clustering algorithms, namely K-Means (KM) and Fuzzy C-Means (FCM), are implemented using parallel programming on GPU. Moreover, a new NVIDIA technology called Unified Memory (UM) is considered. In UM, a virtual memory is created between the CPU and the GPU to reduce the effect of data transfer. The performance of the proposed parallel implementations is compared with the performance of the sequential version. The next subsections discuss the sequential versions followed by the parallel versions. A. Sequential Implementation 1) K-Means (KM) Clustering Algorithm: KM is one of simplest and most common clustering techniques. It is used to cluster data into K groups. The algorithm has three main functions: update centroids, calculate memberships and calculate objective functions [26]. The algorithm starts by initializing clusters centroids, where the number of clusters is set by the user. Next, the algorithm creates random values of centroids as initial centers. Then, it calculates the membership using Equation (1) for each point from input data [4]. t i,j = (X i C j ) 2 (1)

3 The centroids are updated at each iteration using Equation (2). C N j i j µ i,j = X i, (2) K where K is the number of points related to the class. The process continues until the difference between previous and current objective function becomes less than or equal to a certain threshold value. The objective function is calculated using Equation (3). θ = C j N (X i C j ) 2 (3) i 2) Fuzzy C-Means (FCM) Clustering Algorithm: FCM is one of the popular soft clustering techniques. It combines two ideas: fuzzy sets and the CMeans algorithms [5], [27]. Therefore, the FCM algorithm is more complex than the KM algorithm. The algorithm is based on three main functions which are: calculate memberships, update cluster centroids and calculate objective function. First, the algorithm needs to get the number of clusters and the degree of fuzziness as inputs and initialize random centroids at the first iteration. Then, for each iteration, it calculates the membership values using Equation (4) and update the centroids using Equation (5). u ij = 1 c k=1( xi c j x i c k V j = n i=1 um ij x i n i=1 um ij ) 2 m 1 where C is the number of clusters, x i is the object point i, m is the Fuzziness factor, n is the number of points and v j is the center of cluster j. Then, the objective function is calculated using Equation (6). Jm = n i=1 j=1 (4) (5) C u m ij x i c j 2, 1 m < (6) Finally, it calculates the difference between the objective functions of the previous and the current steps. If the difference is less than or equal to a certain preset threshold value, the algorithm stops. B. Parallel Implementations Parallel programming separates the implementation code into sub-blocks of code which can be run at the same time virtually. This technique is used to speed up the execution time of applications in which some parts of the code can be run individually and without any dependencies between them. The CPU is capable of running few parts of the code in parallel. In the other hand, customized hardware, such as GPUs, offer more parallelization capabilities. However, this process is controlled by the CPU which sends the query to the GPU. Intuitively, one can think of two types of parallel implementations on GPUs. The first type is called the Hybrid Implementation, which distributes the execution of code blocks among the CPU and the GPU. As a matter of fact, the GPU executes some functions more efficiently than a parallel implementation on the CPU. The reason for this is very simple. While modern CPUs can run up to 32 threads at the same time, modern GPUs can run around 4999 threads [8], [9]. The second type of parallel implementation is called the Pure Parallel Implementation. In this type, the code is run at the GPU, while the CPU just sends the job to the GPU and receives back the results. It is not immediately clear why would anyone consider the hybrid implementation if the pure parallel implementation allows for more parallelism and avoids any interaction between the CPU and the GPU during the execution of the code. This would be true if the code to be parallelized has limited dependencies between its sub-blocks. Obviously, this is not always the case as it is common to come across code that has high dependencies, which means that it can run faster on the CPU that it does on the GPU. As we show later in this paper, the type of dependencies might allow a hybrid implementation to outperform both pure implementation (pure CPU and pure GPU). The following subsections discusses the implementation of KM and FCM algorithms using both techniques. 1) KM Hybrid and Pure Parallel Implementations: We present two main implementations for the KM algorithm, which are the hybrid implementation and the pure parallel implementation. The paper tests each version with and without using the UM technology. The purpose of these tests is to investigate how UM can help in reducing the effect of data transfer between the GPU and the CPU. To implement the algorithm in a parallel setting, we should choose which functions would be run at the CPU side and which ones would be run on the GPU side. As discussed earlier, the KM algorithm has three main functions, which are calculate memberships, calculate centroids and calculate objective. To determine the best distribution of functions among GPU and CPU, Microsoft Visual Studio 213 profiling tool is used with the sequential version. The profiling tool show that the calculate memberships function is the heaviest function to run on CPU side. Therefore, we decide to run this function at the GPU side. This technique improves the performance by about 1X (on average) compared with the sequential version. Similarly, we transfer the other functions in order to test whether the pure parallel implementation is better than the hybrid implementation or not. However, after transferring them to the GPU side, we notice that the execution becomes longer than that of the hybrid implementation (gets only around 3X improvement compared with the sequential implementation). To detect the source of this delay, we use the CUDA profiling tool. Using this tool, we discover that the calculate centroids and calculate objective function are heavy on the GPU side. Therefore, the best implementation is to use the CPU to calculate centroids and objective function, and the GPU to calculate the membership values. This is due to the fact that the summation operation is not suitable to run in parallel

4 mode. As shown in Equation (3), the calculate objective and centroids functions have a summation operation. This type of operation creates a dependency that prohibits using parallel programming efficiently since each thread needs to write in one memory location. That is, this requires synchronization between thread workers. UM is used for both the hybrid and pure parallel implementations to measure how useful is it in improving the performance of the clustering algorithms under consideration. To do so, the memory allocation functions in the CPU and the GPU sides are replaced with UM functions. On average, the improvement in performance reached around 5X faster than the sequential implementation for the hybrid implementation and 3.5X for pure parallel implementation. 2) FCM Hybrid and Pure Parallel Versions: The same experiments discussed in the previous subsection for the KM algorithm are performed on the FCM algorithm. The sequential implementation of the algorithm is analyzed using the Microsoft Visual Studio 213 profiling tool to detect the heaviest functions. The profiler report show that the membership calculation function is the heaviest on the CPU side. Therefore, this function is selected to be run on GPU side in the hybrid implementation, while the update centroids and objective functions are selected to be run on the CPU side. This implementation achieves an improvement of about 1X on average. In the pure parallel version, all FCM functions are run on the GPU side. However, the objective value is transferred to the CPU side after it is calculated on the GPU side. This version achieves an improvement of about 5X on average. However, when using the UM technology, the performance reached 7X for the pure parallel version, and decreased to 5X for the hybrid version. IV. EXPERIMENTAL AND ANALYSIS This section presents and analyzes the results of this work. The following subsection describes the specifications of the hardware and software used in this paper, while the Subsections IV-B and IV-C show the experimental setup and the experiments results and analysis, respectively. A. Hardware Specifications Two types of equipments are used in this work, which are the simple laptop equipment and the more powerful server equipment. The specifications of each type are listed below. Simple Equipment: 2.2 GHz CPU Intel I7 fourth generation with 6GB RAM. The GPU is NVIDIA GT 74M with 2GB memory. 64-bit Windows 1 operating system, CUDA 7.5 toolkit, CUDA drivers and Microsoft visual studio 213. Powerful Equipment: Two socket Xeon Haswell 2.6GHz server with 16 cores, 128 GB Ram, equipped with 2X NVIDIA K8 GPU cards (4 GPUs in total). 64bit Linux kernel OS, distribution Red Hat-compatible 6.6 (Scientific Linux), NVIDIA driver , CUDA SDK 7. and OpenMPI TABLE I VERSIONS OF THE DATASET Dataset name Dataset size(records) Transactions1k 12,428 Transactions3k 284,284 Transactions5k 475,649 Transactions7k 665,471 Transactions8k 855,367 Total 2,41,199 B. Experimental setup This part discusses the experimental setup. The C programming language is used to implement the sequential version, while the GPU side is implemented using CUDA. The dataset from [28] is used to test the two clustering algorithms KM and FCM. The dataset is divided into several subsets in order to test the scalability of the proposed implementations. Table I shows the subsets of the dataset used in the experiments. As can be seen from the table, the datasets are rather large containing hundreds of thousands of data points. The number of dimensions for each data point is three (denoted as X, Y and Z). Each algorithm is run using the five groups of data. Each group is tested using all dimensions (X, Y and Z), where dimensions are added one by one in each iteration. I.e., testing of the KM algorithm starts with loading all elements from the first dataset (which is Transactions1k) consisting of 12,428 data points. One dimension is used in this experiment, which is X. After the algorithm finishes segmenting the data points, the second dataset is loaded, (which is Transactions3k) consisting of 284,284 data points, using one dimension only, and so on. When all groups are tested using one dimension, the same process is repeated using two dimensions (X and Y ), etc. C. Results As mentioned earlier, the experiments are conducted using two types of equipments, which are the simple equipment and the powerful equipment. Five versions of each of the KM and FCM algorithms are tested: a sequential implementation as well as four different parallel implementations (depending on whether the parallel implementation is pure or hybrid, and whether it uses UM or not). The following two subsections show the results using the simple equipment and the powerful equipment, respectively. 1) Effect of Scaling-Up the Dataset: Figure 1 show the effects of increasing the dataset size and the number of dimensions on the simple laptop equipment. Clearly, such increases leads to an increase in the execution time of the CPU. However, the GPU implementations do not exhibit the same trend. Furthermore, the hybrid version is better than the pure parallel version. The improvements in hybrid version reached 11X without using UM and 6X when using it. On the other hand, the pure parallel version achieved an improvement of about 3X and 3.6X without and with using UM, respectively. Obviously, the UM decreased the improvement gain for the hybrid version because of hardware synchronization. UM reduces the transferring time between CPU and GPU, however, the effect of the synchronization operation is greater than the

5 Time (in seconds) CPU 1D CPU 2D CPU 3D GPU Hyb-UM 1D GPU Hyb-UM 2D GPU Hyb-UM 3D GPU Pur-UM 1D GPU Pur-UM 2D GPU Pur-UM 3D GPU Hyb+UM 1D GPU Hyb+UM 2D GPU Hyb+UM 3D GPU Pur+UM 1D GPU Pur+UM 2D GPU Pur+UM 3D Time (in seconds) 1 CPU 1D CPU 2D 9 CPU 3D GPU Hyb-UM 1D GPU Hyb-UM 2D 8 GPU Hyb-UM 3D GPU Pur-UM 1D 7 GPU Pur-UM 2D GPU Pur-UM 3D GPU Hyb+UM 1D 6 GPU Hyb+UM 2D GPU Hyb+UM 3D GPU Pur+UM 1D 5 GPU Pur+UM 2D GPU Pur+UM 3D Dataset size (in hundreds of thousands) (a) KM (b) FCM Fig. 1. The effect of increasing the dataset size and the number of dimensions on the simple equipment. transfer time. Using UM, the data is locked for the CPU until the GPU completes the process. During this time, the CPU will run another process and it needs time to switch the process when it gets a signal from GPU to release the data. The same operation is performed at GPU side too. In the hybrid version, the CPU or the GPU are ready to run the code because the transfer time is slower than the execution time. Furthermore, each side has its own data that can be accessed any time. Therefore, using UM, the data that is locked is the data that is not needed at execution time either by the CPU or the GPU. The experiments on the FCM algorithm are similar to those on the KM algorithm. The execution time of the FCM algorithm is affected also by increasing the size of input data and number of dimensions. Furthermore, the hybrid implementation has better results than the pure parallel version. However, using UM creates the same delay as in KM algorithm. Running FCM algorithm in hybrid version without using UM achieved 1.8X faster than the sequential version. However, using UM, the improvement decreased to 4.7X. Similarly, implementing the algorithm in the pure parallel model achieved about 5.6X is faster than sequential version, meanwhile, using UM with pure parallel model, the performance increased to 7X faster than the sequential implementation. The best improvement we obtained is for the hybrid version without using UM. After using UM, the improvement was reduced by 44% for hybrid version of KM and 61% for FCM, whereas UM increased the performance for pure parallel version by 14% for KM and 18% for FCM. This effect is because the hardware synchronization between CPU and GPU. In the hybrid version, the synchronization rate is more than pure parallel version, because in hybrid version the CPU and GPU have to collaborate to achieve the main goal which is to divide the data into clusters. Meanwhile, with pure parallel version, the CPU lunches the GPU kernel code, then the GPU will run all algorithm steps. In this case, the probability of synchronization is less than the pure parallel version. We deduce from those results that UM does not reduce Dataset size (in hundreds of thousands) the effect of transferring data between CPU and GPU, but it used to help developer to manage data transfer. If the implementation has different data that need to transfer between GPU and CPU, the developer can use UM. This will help the developer to focus on function coding without missing any transferring data. 2) Effect of Hardware: The experiments of the previous section are all conducted on the simple basic hardware equipment. The purpose of this experiment is to study and compare the performance gains obtained by the two hardware equipments under consideration. Figure 2 shows the results of testing the hybrid version of the KM and FCM algorithms without using UM. Figure 2(a) shows that the improvements on the KN algorithm for 1D, 2D and 3D are about 9X, 13X and 17X respectively. As for the FCM algorithms, Figure 2(b) shows that the improvements are about 12X, 13X and 15X. for 1D, 2D and 3D, respectively. The performance gain that clustering algorithms get after utilize the GPU capabilities is shown on Figure IV-C2. The Tesla 8K GPU has 4999 cores, while the GT 74M GPU has 366 cores. It can be clearly seen from the figure that Tesla 8K gets 3X faster than GT 74M, but with FCM algorithm, the Tesla 8K is 6X faster than GT 74M. The complexity of FCM is higher than KM [7]. However, the FCM algorithm gets improvement gain with Tesla 8K that is better than GT74M. This means that the powerful GPU has less effects with heavy algorithm. ACKNOWLEDGMENT This research was supported by the Deanship of Research at the Jordan university of Science and Technology (Grant no 21674). We are thankful to NVIDIA for giving us the chance to try the new Tesla-K8 GPU with server environment. We also thank Mr. Carlo Nardone, the technical developer from NVIDIA, for his corporation. V. CONCLUSION This work has presented our effort to extensively study the performance improvements gained by utilizing GPUs to speed

6 2 Tesla8K GT74M 2 Tesla8K GT74M Performance Gain Performance Gain D 2D 3D 1D 2D 3D Performance Gain Tesla8K GT74M (a) KM (b) FCM Fig. 2. Comparison of the performance gains obtained by the two hardware equipments under consideration. KM FCM Fig. 3. Comparison of the performance gains obtained on the clustering algorithms by the two hardware equipments under consideration. up the performance of the two common clustering algorithms, KM and FCM. The paper has provided four different parallel implementations for each algorithm (depending on whether the parallel implementation is pure or hybrid, and whether it uses UM or not). Moreover, in this paper, extensive experiments has been conducted using two types of equipment: a simple laptop setting and a more powerful server setting. The experiments aimed to study the effect of increasing data size and number of dimensions on the performance gain of the different parallel implementations compared with the sequential ones. The paper has shown that the best improvement gain is obtained by the hybrid version, which has achieved about 11.3X for KM and about 1.9X for FCM. Without UM, the performance gain was about 6X for KM and about 5X for FCM. Moreover, the pure parallel implementation achieved about 3X improvement for KM and about 6X for FCM. However, the paper has shown that after using UM, the performance of pure parallel version was about 4X with KM and about 7X with FCM. REFERENCES [1] M. Shehab et al., Improving fcm and t2fcm algorithms performance using gpus for medical images segmentation, in ICICS. IEEE, 215. [2], Accelerating compute-intensive image segmentation algorithms using gpus, The Journal of Supercomputing, 216, to appear. [3] H. Cheng et al., Automated breast cancer detection and classification using ultrasound images: A survey, Pattern Recognition, vol. 43, no. 1, pp , 21. [4] A. Likas, N. Vlassis, and J. J. Verbeek, The global k-means clustering algorithm, Pattern recognition, vol. 36, no. 2, pp , 23. [5] L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy sets and systems, vol. 1, no. 1, pp. 3 28, [6] A. K. Jain, Data clustering: 5 years beyond k-means, Pattern Recognition, vol. 31, p , 21. [7] S. Ghosh and S. K. Dubey, Comparative analysis of k-means and fuzzy c-means algorithms, IJACSA, vol. 4, pp , 213. [8] S. Cook, CUDA programming: a developer s guide to parallel computing with GPUs. Newnes, 212. [9] NVIDIA Corporation, Tesla k8 accelerator features and benefits, Online, Dec 216, [Accessed Jan-216]. [1] M. Fakirah et al., Accelerating needleman-wunsch global alignment algorithm with gpus, in AICCSA. IEEE, 215, pp [11] L. Wang, B. Yang, Y. Chen, Z. Chen, and H. Sun, Accelerating fcm neural network classifier using graphics processing units with cuda, Applied intelligence, vol. 4, no. 1, pp , 214. [12] M. Alandoli et al., Using dynamic parallelism to speed-up clusteringbased community detection in social networks, in FiCloud. IEEE, 216. [13], Using gpus to speed-up fcm-based community detection in social networks, in CSIT. IEEE, 216, pp [14] M. Zechner and M. Granitzer, Accelerating k-means on the graphics processor via cuda, in INTENSIVE 9, 29, pp [15] R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell, A parallel implementation of k-means clustering on gpus, in PDPTA, vol. 13, no. 2, 28, pp [16] S. Soroushnia et al., Parallel implementation of fuzzified pattern matching algorithm on gpu, in PDP. IEEE, 215, pp [17] S. Shalom et al., Efficient k-means clustering using accelerated graphics processors, in DaWaK. Springer, 28, pp [18], Graphics hardware based efficient and scalable fuzzy c-means clustering, in AusDM, 28, pp [19] H. Li et al., An improved image segmentation algorithm based on gpu parallel computing, Journal of Software, vol. 9, no. 8, 214. [2] Y. Zhuge et al., Parallel fuzzy connected image segmentation on gpu, Medical physics, vol. 38, no. 7, pp , 211. [21] D. M. Onchis et al., Multi-phase identification in microstructures images using a gpu accelerated fuzzy c-means segmentation, in SYNASC. IEEE, 214, pp [22] M. Al-Ayyoub et al., A gpu-based implementations of the fuzzy c- means algorithms for medical image segmentation, The Journal of Supercomputing, vol. 71, no. 8, pp , 215. [23], A gpu-based breast cancer detection system using fuzzy c-means clustering algorithm, in ICMCS. IEEE, 216. [24] S. AlZu bi et al., Parallel implementation of fcm-based volume segmentation of 3d images, in AICCSA. IEEE, 216. [25] M. Alsmirat et al., Accelerating compute intensive medical imaging segmentation algorithms using gpus, Multimedia Tools and Applications (MTAP), 216, to appear. [26] T. Kanungo et al., An efficient k-means clustering algorithm: Analysis and implementation, TPAMI, vol. 24, no. 7, pp , 22. [27] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR), vol. 31, no. 3, pp , [28] C.-H. Chen et al., Genetic-fuzzy mining with taxonomy, IJUFKS, vol. 2, no. supp2, pp , 212.

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Accelerating K-Means Clustering with Parallel Implementations and GPU computing

Accelerating K-Means Clustering with Parallel Implementations and GPU computing Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani Electrical and Computer Engineering Dept. Northeastern University Boston, MA Email: bhimani@ece.neu.edu Miriam

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Abstract: Mass classification of objects is an important area of research and application in a variety of fields. In this

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

A Mixed Hierarchical Algorithm for Nearest Neighbor Search

A Mixed Hierarchical Algorithm for Nearest Neighbor Search A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA cdel@vt.edu ABSTRACT The k nearest neighbor (knn) search

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

PARALLEL PARTICLE SWARM OPTIMIZATION IN DATA CLUSTERING

PARALLEL PARTICLE SWARM OPTIMIZATION IN DATA CLUSTERING PARALLEL PARTICLE SWARM OPTIMIZATION IN DATA CLUSTERING YASIN ORTAKCI Karabuk University, Computer Engineering Department, Karabuk, Turkey E-mail: yasinortakci@karabuk.edu.tr Abstract Particle Swarm Optimization

More information

Facial Recognition Using Neural Networks over GPGPU

Facial Recognition Using Neural Networks over GPGPU Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms Subhi A. Bahudaila and Adel Sallam M. Haider Information Technology Department, Faculty of Engineering, Aden University.

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT

More information

Simultaneous Solving of Linear Programming Problems in GPU

Simultaneous Solving of Linear Programming Problems in GPU Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* amitgurung@nitm.ac.in Binayak Das* binayak89cse@gmail.com Rajarshi Ray* raj.ray84@gmail.com * National Institute of Technology Meghalaya

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Enhanced Image Retrieval using Distributed Contrast Model

Enhanced Image Retrieval using Distributed Contrast Model Enhanced Image Retrieval using Distributed Contrast Model Mohammed. A. Otair Faculty of Computer Sciences & Informatics Amman Arab University Amman, Jordan Abstract Recent researches about image retrieval

More information

Novel Hybrid k-d-apriori Algorithm for Web Usage Mining

Novel Hybrid k-d-apriori Algorithm for Web Usage Mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 4, Ver. VI (Jul.-Aug. 2016), PP 01-10 www.iosrjournals.org Novel Hybrid k-d-apriori Algorithm for Web

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallelization of K-Means Clustering Algorithm for Data Mining

Parallelization of K-Means Clustering Algorithm for Data Mining Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu

More information

Nvidia Tesla The Personal Supercomputer

Nvidia Tesla The Personal Supercomputer International Journal of Allied Practice, Research and Review Website: www.ijaprr.com (ISSN 2350-1294) Nvidia Tesla The Personal Supercomputer Sameer Ahmad 1, Umer Amin 2, Mr. Zubair M Paul 3 1 Student,

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Keywords Clustering, K-Means, GPU, CUDA, Data Mining, Hybrid architecture, Hybrid programming

Keywords Clustering, K-Means, GPU, CUDA, Data Mining, Hybrid architecture, Hybrid programming Volume 5, Issue 2, February 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com K-Means for

More information

Real-time processing for intelligent-surveillance applications

Real-time processing for intelligent-surveillance applications LETTER IEICE Electronics Express, Vol.14, No.8, 1 12 Real-time processing for intelligent-surveillance applications Sungju Lee, Heegon Kim, Jaewon Sa, Byungkwan Park, and Yongwha Chung a) Dept. of Computer

More information

Survey on Heterogeneous Computing Paradigms

Survey on Heterogeneous Computing Paradigms Survey on Heterogeneous Computing Paradigms Rohit R. Khamitkar PG Student, Dept. of Computer Science and Engineering R.V. College of Engineering Bangalore, India rohitrk.10@gmail.com Abstract Nowadays

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes D. Akhmedov, S. Yelubayev, T. Bopeyev, F. Abdoldina, D. Muratov, R.

More information

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Ahmet Erdem, Gianluca Palermo 6, and Cristina Silvano 6 Department of Electronics, Information and Bioengineering Politecnico di

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Lane Detection using Fuzzy C-Means Clustering

Lane Detection using Fuzzy C-Means Clustering Lane Detection using Fuzzy C-Means Clustering Kwang-Baek Kim, Doo Heon Song 2, Jae-Hyun Cho 3 Dept. of Computer Engineering, Silla University, Busan, Korea 2 Dept. of Computer Games, Yong-in SongDam University,

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu

A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using multi-gpu Qi Li, Vojislav Kecman, Raied Salman Department of Computer Science School of Engineering, Virginia Commonwealth

More information

A Parallel Decoding Algorithm of LDPC Codes using CUDA

A Parallel Decoding Algorithm of LDPC Codes using CUDA A Parallel Decoding Algorithm of LDPC Codes using CUDA Shuang Wang and Samuel Cheng School of Electrical and Computer Engineering University of Oklahoma-Tulsa Tulsa, OK 735 {shuangwang, samuel.cheng}@ou.edu

More information

Chapter 7 UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION

Chapter 7 UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION Supervised and unsupervised learning are the two prominent machine learning algorithms used in pattern recognition and classification. In this

More information

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA) NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU

More information

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA) NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU

More information

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17415 Reference Architecture Dell EMC Solutions Copyright

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Integrated IoT and Cloud Environment for Fingerprint Recognition

Integrated IoT and Cloud Environment for Fingerprint Recognition Integrated IoT and Cloud Environment for Fingerprint Recognition Ehsan Nadjaran Toosi 1, Adel Nadjaran Toosi 1, Reza Godaz 2, and Rajkumar Buyya 1 1 Cloud Computing and Distributed Systems (CLOUDS) Laboratory

More information

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

ENGINEERING MECHANICS 2012 pp Svratka, Czech Republic, May 14 17, 2012 Paper #249

ENGINEERING MECHANICS 2012 pp Svratka, Czech Republic, May 14 17, 2012 Paper #249 . 18 m 2012 th International Conference ENGINEERING MECHANICS 2012 pp. 377 381 Svratka, Czech Republic, May 14 17, 2012 Paper #249 COMPUTATIONALLY EFFICIENT ALGORITHMS FOR EVALUATION OF STATISTICAL DESCRIPTORS

More information

Fuzzy C-means Clustering with Temporal-based Membership Function

Fuzzy C-means Clustering with Temporal-based Membership Function Indian Journal of Science and Technology, Vol (S()), DOI:./ijst//viS/, December ISSN (Print) : - ISSN (Online) : - Fuzzy C-means Clustering with Temporal-based Membership Function Aseel Mousa * and Yuhanis

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Parth Shah 1 and Rachana Oza 2 1 Chhotubhai Gopalbhai Patel Institute of Technology, Bardoli, India parthpunita@yahoo.in

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Kernel level AES Acceleration using GPUs

Kernel level AES Acceleration using GPUs Kernel level AES Acceleration using GPUs TABLE OF CONTENTS 1 PROBLEM DEFINITION 1 2 MOTIVATIONS.................................................1 3 OBJECTIVE.....................................................2

More information

Neural Network Implementation using CUDA and OpenMP

Neural Network Implementation using CUDA and OpenMP Neural Network Implementation using CUDA and OpenMP Honghoon Jang, Anjin Park, Keechul Jung Department of Digital Media, College of Information Science, Soongsil University {rollco82,anjin,kcjung}@ssu.ac.kr

More information

Index Terms PSO, parallel computing, clustering, multiprocessor.

Index Terms PSO, parallel computing, clustering, multiprocessor. Parallel Particle Swarm Optimization in Data Clustering Yasin ORTAKCI Karabuk University, Computer Engineering Department, Karabuk, Turkey yasinortakci@karabuk.edu.tr Abstract Particle Swarm Optimization

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Cluster analysis of 3D seismic data for oil and gas exploration

Cluster analysis of 3D seismic data for oil and gas exploration Data Mining VII: Data, Text and Web Mining and their Business Applications 63 Cluster analysis of 3D seismic data for oil and gas exploration D. R. S. Moraes, R. P. Espíndola, A. G. Evsukoff & N. F. F.

More information

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

GeoImaging Accelerator Pansharpen Test Results. Executive Summary Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance Whitepaper), the same approach has

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J. Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering

More information

Robot localization method based on visual features and their geometric relationship

Robot localization method based on visual features and their geometric relationship , pp.46-50 http://dx.doi.org/10.14257/astl.2015.85.11 Robot localization method based on visual features and their geometric relationship Sangyun Lee 1, Changkyung Eem 2, and Hyunki Hong 3 1 Department

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6 CHAPTER 6 Parallel Algorithm for Random Forest Classifier Random Forest classification algorithm can be easily parallelized due to its inherent parallel nature. Being an ensemble, the parallel implementation

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm Rashmi C a ahigh-performance Computing Project, Department of Studies in Computer Science, University of Mysore,

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Use of Multi-category Proximal SVM for Data Set Reduction

Use of Multi-category Proximal SVM for Data Set Reduction Use of Multi-category Proximal SVM for Data Set Reduction S.V.N Vishwanathan and M Narasimha Murty Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India Abstract.

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Exploiting Depth Camera for 3D Spatial Relationship Interpretation

Exploiting Depth Camera for 3D Spatial Relationship Interpretation Exploiting Depth Camera for 3D Spatial Relationship Interpretation Jun Ye Kien A. Hua Data Systems Group, University of Central Florida Mar 1, 2013 Jun Ye and Kien A. Hua (UCF) 3D directional spatial relationships

More information

Architectures for Scalable Media Object Search

Architectures for Scalable Media Object Search Architectures for Scalable Media Object Search Dennis Sng Deputy Director & Principal Scientist NVIDIA GPU Technology Workshop 10 July 2014 ROSE LAB OVERVIEW 2 Large Database of Media Objects Next- Generation

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

Advanced Supercomputing Hub for OMICS Knowledge in Agriculture. Help to Access Discovery Studio v- 4.1

Advanced Supercomputing Hub for OMICS Knowledge in Agriculture. Help to Access Discovery Studio v- 4.1 Advanced Supercomputing Hub for OMICS Knowledge in Agriculture Help to Access Discovery Studio v- 4.1 Centre for Agricultural Bioinformatics ICAR - Indian Agricultural Statistics Research Institute Library

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

ECM A Novel On-line, Evolving Clustering Method and Its Applications

ECM A Novel On-line, Evolving Clustering Method and Its Applications ECM A Novel On-line, Evolving Clustering Method and Its Applications Qun Song 1 and Nikola Kasabov 2 1, 2 Department of Information Science, University of Otago P.O Box 56, Dunedin, New Zealand (E-mail:

More information

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Deep Learning Based Real-time Object Recognition System with Image Web Crawler , pp.103-110 http://dx.doi.org/10.14257/astl.2016.142.19 Deep Learning Based Real-time Object Recognition System with Image Web Crawler Myung-jae Lee 1, Hyeok-june Jeong 1, Young-guk Ha 2 1 Department

More information

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

New Approach for Graph Algorithms on GPU using CUDA

New Approach for Graph Algorithms on GPU using CUDA New Approach for Graph Algorithms on GPU using CUDA 1 Gunjan Singla, 2 Amrita Tiwari, 3 Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology

More information