Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU

Size: px
Start display at page:

Download "Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU"

Transcription

1 Available online at ScienceDirect Transportation Research Procedia 22 (2017) th EURO Working Group on Transportation Meeting, EWGT2016, 5-7 September 2016, Istanbul, Turkey Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU Gizem Ermiş *, Bülent Çatay Sabanci University, Faculty of Engineering and Natural Sciences, Tuzla, Istanbul 34956, Turkey Abstract Graphics processor units (GPUs) are many-core processors that perform better than central processing units (CPUs) on data parallel, throughput-oriented applications with intense arithmetic operations. Thus, they can considerably reduce the execution time of the algorithms by performing a wide range of calculations in a parallel manner. On the other hand, imprecise usage of GPU may cause significant loss in the performance. This study examines the impact of GPU resource allocations on the GPU performance. Our aim is to provide insights about parallelization strategies in CUDA and to propose strategies for utilizing GPU resources effectively. We investigate the parallelization of 2-opt and 3-opt local search heuristics for solving the travelling salesman problem. We perform an extensive experimental study on different instances of various sizes and attempt to determine an effective setting which accelerates the computation time the most. We also compare the performance of the GPU against that of the CPU. In addition, we revise the 3-opt implementation strategy presented in the literature for parallelization The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Scientific Committee of EWGT2016. Keywords: GPU computing; parallelization; optimization; GPU architecture; travelling salesperson problem. 1. Introduction With their highly parallel structure, graphics processor units (GPUs) are many-core processors that are specifically designed to perform data-parallel computation. Because of the architectural differences between the * Corresponding author. Tel.: address:ermisgizem@sabanciuniv.edu X 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Scientific Committee of EWGT /j.trpro

2 410 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) central processing units (CPUs) and GPUs, the GPU performance is better on data parallel, throughput-oriented applications with intense arithmetic operations. GPUs have the capability to accelerate algorithms that require high computational power. Thus, they can considerably reduce the execution time of such algorithms by performing a wide range of calculations in a parallel manner. Modern GPUs are many-core processors that are specifically designed to perform data-parallel computation. Data parallelism means that each processor performs the same task on different pieces of distributed data (Brodtkorb et al., 2013). Before the evolution of today s advanced GPUs, traditional, single-core processors were exploited. Computationally hard tasks were taking a great deal of time when they were solved by the help of single-core processors. Faster single-core processors were developed by the computer industry but they were still insufficient for peak performances. Around the year 2000, by fitting more cores in the same chip, single-core processors evolved to multi-core processors (Fig. 1), which work together to process instructions, and thus have higher total theoretical performance (Brodtkorb et al., 2013). Fig 1. A basic block diagram of a generic multi-core processor Because of gaming industry needs, GPUs which actually were the normal component in common PCs, developed quickly in terms of computational performance. Multi-core GPU processors evolved to massive multi-core or manycore processors which work as massively parallel stream processing accelerators or data parallel accelerators. Because of the rapid advancements in the GPU technology, they became common as accelerators in general purpose programming. Although both multi-core CPUs and GPUs can implement parallel algorithms, the architectural differences between CPUs and GPUs created different usage areas depending upon the nature of the problem. While multi-core CPUs are designed for task parallel implementations, many-core processors are specifically designed for data parallel implementations. The efficiency is as much critical factor as the solution quality when an algorithm is applied to an optimization problem such as traveling salesman problem (TSP), a well-known NP-Hard combinatorial optimization problem. Local search algorithms such as 2-opt or 3-opt are computationally difficult when they are implemented on the CPU. Because these techniques evaluate all edge exchanges on the tour to determine the exchange that reduces the tour length the most, they require a large number of computations and comparisons. So, parallel implementation can accelerate these computations significantly. GPUs which have data parallel structure can perform these simple computations in parallel. On the other hand, the design of the parallel implementation plays a crucial role in achieving effective utilization of the GPU resources, thus optimizing the system performance. CUDA is a parallel computing platform and programming model introduced by NVIDIA. It enables programmers to use GPUs for general purpose processing (Wikipedia, 2016). Van Luong et al. (2009) used GPU as a coprocessor for extensive computations where the solutions of the TSP from a given 2-exchange neighborhood are evaluated in parallel. The remaining computations are performed on the CPU. A local search has four main steps: neighborhood generation, evaluation, move selection, and solution update. The simplest method is to create the neighborhood on the CPU and transfer it to GPU each time. Van Luong et al. (2013) applied this technique; however, it requires copying of a lot of information from the CPU to the GPU. To prevent this drawback Rocki and Suda (2012) and Schulz (2013) utilized an explicit formula to explore the

3 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) neighborhood by assigning one or several moves to a thread. Since the neighborhood evaluation is the most computationally expensive task, it was generally performed on the GPU. O Neil et al. (2011), Rocki and Suda (2012), and Schulz (2013) presented some implementation details in order to execute kernel efficiently. Yet, only Schulz (2013) presented the profiling analysis of the implementations. We refer the interested reader to Brodtkorb et al. (2013) for further details on the GPU technology and its applications. Dawson and Stewart (2014) applied the first parallel implementation of Ant Colony Optimization based edge detection algorithm on GPU. They mapped individual ants to warps and executed more ants in each iteration which reduced the number of iterations to generate an edge map. O Neil and Burtscher (2015) presented random restart hill climbing with 2-opt local search for TSP in parallel. They parallelized independent climbs between blocks and 2-opt evaluations between threads within a block. Genetic Algorithms were also successfully implemented on the GPU by Capodieci and Emilia (2015), Sinha et al. (2016), and Kang et al. (2016). Recently, Coelho et al. (2016) applied the variable neighborhood search to the single vehicle routing problem with deliveries and selective pickups where the local search was executed on GPU. GPU implementations can be difficult because of the distinctive manner of work and the complicated memory structure of the GPU. Imprecise usage of GPU may cause significant loss in the performance. In this study, we analyze the effect of GPU resource allocations on the GPU performance. Our aim is to provide insights about parallelization strategies in CUDA and to propose strategies for utilizing GPU resources effectively. Following the approaches of Rocki and Suda (2012) we investigate the parallelization of 2-opt and 3-opt local search heuristics by allocating the resources of Nvidia Quadro K600 1 GB GPU device in different ways. We perform an extensive experimental study on TSP instances of various sizes and attempt to determine an effective setting which accelerates the computation time the most. We also compare the performance of the GPU against the sequential implementation on an Intel Xeon E5 CPU with 3.30 GHz speed. The main contribution of the study is to improve the work of Rocki and Suda (2012) by determining the most effective allocation of GPU resources on large size problems. In addition, we correct the parallelization formulation of Rocki and Suda (2012) proposed for the implementation of the 3-opt algorithm. 2. Parallelization 2.1. Methodology 2-opt algorithm with the best improvement calculates the effect of each possible edge exchange on the total cost of the current tour. Among all these possible exchanges, it performs the one that yields the largest improvement, in other words the exchange that decreases the total cost the most. The algorithm is repeated until no further improving exchange exists. Fig 2. A 2-opt move on the travelling salesman tour Fig. 2 demonstrates a 2-opt exchange move. To calculate the effect of a 2-opt exchange, two edges are removed from the current tour and the emerged two sub-tours are reconnected at a different position by protecting the validity of the tour. For a TSP instance of n nodes, the number of possible edge exchanges in each iteration is (nn 1)/2.

4 412 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) A thread on the GPU is a basic element of the data to be processed. Threads map to stream processors of GPU. The parallelism is managed by distributing (nn 1)/2 possible edge exchanges among different threads equally so that each thread can calculate the effects of relevant exchanges on the tour cost concurrently. Therefore, the following formula is applied to all node combinations in the current tour in parallel: = (tt[ii],tt[jj]) + (tt[ii 1],tt[jj 1]) (tt[ii 1],tt[ii]) (tt[jj 1],tt[jj]), where d( ) refers to the distance. This formula calculates the change in the tour length by subtracting the distance of removed edges from the distance of added edges. ii, jj, ii 1, and jj 1 represent the nodes that connect these edges. The most critical part in the algorithm is to develop common formulas for all threads so that they can produce the correct i and j values which parallelize the formula. In order to obtain i and j values, they should be related to the ids that derive from predefined variables in CUDA. Predefined variables are built-in variables that return the thread ID of the thread that is executed by its stream processor. An id represents each unique job in the problem which calculates the result of the edge exchange. Rocki and Suda (2012) proposed the following formulae to produce i and j values: ii = 3+ 8 iiii+1 (1) 2 jj = 1+id (i 2)(i 1) (2) 2 Table 1 provides an example data including sample i and j values, the corresponding ids that derive from predefined CUDA variables, and calculation of the change in the total tour cost. Table 1. i and j values related to GPU ids id i j (tt[ ],tt[1]) + (tt[1],tt[ ]) (tt[1],tt[ ]) (tt[ ]+tt[ (tt[3],tt[1]) + (tt[ ],tt[ ]) (tt[ ],tt[3]) (tt[ ]+tt[ (tt[3],tt[ ]) + (tt[2],tt[1]) (tt[ ],tt[3]) (tt[1]+tt[ (tt[ ],tt[1]) + (tt[3],tt[ ]) (tt[3],tt[ ]) (tt[ ]+tt[ (tt[ ],tt[ ]) + (tt[3],tt[1]) (tt[3],tt[ ]) (tt[1]+tt[ (tt[ ],tt[3]) + (tt[3],tt[ ]) (tt[3],tt[ ]) (tt[ ]+tt[ (tt[ ],tt[1]) + (tt[ ],tt[ ]) (tt[ ],tt[ ]) (tt[ ]+tt[1 The only difference in the 3-opt algorithm is that three edges will be cut and then reconnected at different places in the tour. In this case, the indices i, j, and k are used to determine the edges. Considering (nn-1)(nn-2)/6 possible edge exchanges, Rocki and Suda (2012) proposed the following formulations to determine i, j, and k values: 3 ii = 3 iiii + 9 iiii iiii 9 iiii (3) 9 jj = 3+ 8 (iiii ii(ii 1)(ii 2) 6 2 )+1 (4) kk = iiii ( ii(ii 1)(ii 2) 6 ) ( jj(jj 1) ) (5) 2

5 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Proposed revision for the parallelization of 3-opt algorithm The formulations of 3-opt generate false i, j, and k values for certain id values. Table 2 provides examples of the problematic ids and the resulting false i values as well as the expected correct values. For example, for id values within the interval [364,454] i should take the value of 14; however, i=15 when id=454. Since j and k are calculated using i, their values are also false. Table 2. Examples of problematic ids giving false i values id range [364,454] [455,559] [560,679] [680,815] [816,968] [3276,3653] [30856,32508] Problematic ids , ,32506,32507,32508 Incorrect i value produced by (3) Correct i value obtained by (6) To fix the above problem, we replace (3) with the following formulation: 3 ii = 3 iiii + 9 iiii (6) 9 The new formulation leads to a deficiency in the value of j: when j is calculated using (4) based on i given by (6) i and j can take the same value, hence that does not correspond to an exchange. To overcome this shortcoming, we propose the following correction approach: if i=j, then i=i+1 and j is recalculated using (4). k is still calculated using (5). Table 3 shows a set of sample i, j, and k values calculated by (6), (4), and (5), respectively, and their final values after the correction. The underlined values highlight the problematic ids. Table 3. Sample data showing the implementation of the revised formulation and correction Values Calculated using (6) Values after Correction id i j k i j k Effective usage of GPU resources Imprecise usage of GPU causes considerable decrease in the performance of the algorithm. For this reason, we test different configurations in order to draw conclusions about the optimal usage of GPU resources. Due to resource restrictions of our GPU device, Nvidia Quadro K600, and the occupancy of its streaming multiprocessor (SM), our experimental results are specific to that device. However, they also provide insights for strategies to use GPUs effectively.

6 414 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) (a) (b) Fig 3. Parallelization (a) Iteration level parallelism, (b) Thread level parallelism During parallelization, it is important to combine thread and iteration level parallelism. In our case, one thread can perform several jobs in a parallel way, which exploits iteration level parallelism (Fig 3.a), and also different threads can perform different jobs in parallel, which utilizes thread level parallelism (Fig 3.b) Experimental design The number of jobs that will be performed by a thread can be calculated by dividing the number of possible edge exchanges by the number of total threads in GPU, which is denoted by t. The number of jobs for each thread is referred to as iterations as it represents iteration level parallelism. So, iterations= nn(nn 1)/(2 tt). This basically means that each thread will process iterations times to complete its assigned jobs. We analyze different combinations of thread and iteration level parallelism to investigate the effective allocation of resources. The GPU has a grid structure which has several blocks inside and each block has several threads. In CUDA, each block is executed as warps and a warp is a group of 32 threads. - The occupancy is calculated by dividing the number of active (busy) warps in the SM by the number of warps supported by the SM. - Quadro K600 has one streaming multiprocessor. Its SM has maximum 2048 resident threads and 16 resident blocks. - As each warp consists of 32 threads, an SM has a maximum of 64 (2048/32) resident warps. 1 SM = 2048 active threads = 64 active warps B T 32 W B T 32 W B1 512 T 16 W B2 512 T 16 W B T 16 W B4 512 T 16 W B1 256T 8W B2 256T 8W B3 256T 8W B4 256T 8W B5 256T 8W B6 256T 8W B7 256T 8W B8 256T 8W B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 Fig 4. Configuration settings to utilize all warps in SM Each row in Fig. 4 shows four different configuration settings in SM. In this figure, B, T, and W represent block, thread, and warp, respectively. Each row can be considered as a grid of SM. For example, in the first setting there are 2 blocks in the grid and 1024 threads, which corresponds to 32 warps in each block. In the second, there are 4 blocks in the grid and 512 threads, with 16 warps in each block. In each setting all possible active warps (64 warps) in SM are utilized and block dimensions are arranged as the product of a warp size because each block is executed as warps. Since all the warps are used, the occupancy is %100 in these settings. Under normal circumstances 100% occupancy is managed through the settings in Fig. 4. However, other resource restrictions such as the maximum number of registers and shared memory limit per SM may prevent all the warps to be utilized. As a result, 100% occupancy may not be achievable. On Quadro K600, the total number of registers per SM is and shared memory per SM is bytes.

7 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Analysis In this section, we investigate the performance of different resource allocation settings on the TSP instances with different sizes. Table 4 shows the results on a TSP instance consisting of 500 nodes. Block Dim and Grid Dim refer to block and grid dimensions, respectively, and the column Time shows the time required for all possible edge exchange calculations in a tour in milliseconds. Table 4. Observed best kernel launches for an instance with 500 nodes Block Dim Grid Dim Time (ms) A total of 125,250 edge exchange calculation jobs are distributed among the threads in different ways. The block dimensions are arranged as multiples of warp size 32. The configurations belonging to the same block dimension form a group. So, we have four test groups with block dimensions of 1024, 512, 256, and 128. For each block dimension, we tested different [grid dimension, iterations] combinations (See Appendix for details). Table 4 reports only the best performing combination. We observe that in the first three groups the best performances are observed when all warps are active. This means that shared memory and registers do not restrict the system. However, in the last group, only 9 blocks out of 16 are used when the best performance is achieved. Table 5. Restrictions of shared memory and registers for an instance with 500 nodes All warps used Restriction of Registers Restriction of Shared Memory After Restrictions Grid Dim Registers Used in SM Max Active Blocks Allowed Block Dim = /20480 = 3 > 2 Shared Memory Used in SM Max Active Blocks Allowed Grid Dim Occupancy (%) 49152/5012 = 9 > = /10240 = 6 > /5012 = 9 > bytes = /5120 = 12 > /5012 = 9 > = /2560 = 25 > /5012 = 9 < In the instance with 500 nodes, the shared memory usage per block is 5012 bytes. The maximum shared memory size per SM is 49,152 bytes. The shared memory allows 9 active blocks (49152/5012) which is greater than 2. So, the shared memory capacity of the SM does not restrict the usage of all active warps. For the first three groups, the best performances are achieved when all warps are active, i.e. when the occupancy is 100%. However, the shared memory capacity prevents the usage of all warps in the last group. Although shared memory still allows 9 active blocks, 16 active blocks are required to exploit all warps in this case. In other words, SM has enough memory for only 9 blocks. In this group, (128/32) 9 = 36 warps out of 64 can be utilized by launching 9 blocks and 128 threads in each block. Hence, the occupancy is 56% and the run time performance is the worse. Nevertheless, among all combinations the best performance within this group is achieved when all the warps allowed by the shared memory and register are used. Table 6 summarizes the results for problems with different sizes. We observe that the run time performance declines with decreasing occupancy rate.

8 416 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Table 6. Performances for instances of different sizes Occupancy (%) Time (ms) Block Size Nodes Computational results We tested the best performing GPU configuration on TSP instances with different sizes. Table 7 compares the performances of sequential 2-opt algorithm implemented on the CPU and the parallel implementation on the GPU. The initial solution is constructed naively using sequential ordering of the nodes. The results show that parallel 2-opt algorithm runs faster than the sequential and the difference in the speed becomes significant with the growing size of the problem. For example, GPU performs 14 times faster in the instance with 4000 nodes. On the other hand, it does not necessarily yield the best solution. Table 7. Comparison of sequential and parallel 2-opt implementations Sequential Implementation on CPU Parallel Implementation on GPU Nodes Run Time (ms) Tour Length Run Time (ms) Tour Length Applying the local search algorithm starting with a good initial solution decreases the number of iterations and shortens the total run time. So, we applied the nearest neighbor method (NN) to obtain an improved initial solution compared to the naive approach above. NN starts from an arbitrary node and builds the tour by moving to the closest not-yet-visited node next. In this case, we also tested larger instances including 6000 and 9000 nodes. The results are reported in Table 8. We can observe that NN enables the local search to converge to better solutions in less time, in line with the expectations. Note that NN speeds up the algorithm 20.4% in the 9000-node instance whereas the speed-up is only 4.4% for the 500-node instance. This difference is due to the fact that NN is performed on the CPU, which dramatically increases the share of the CPU time on the total run time in smaller problems.

9 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Table 8. Algorithm performance with naive and nearest neighbor initial solutions Nodes Tour Length With Naive Initial Solution Run Time (ms) With Nearest Neighbor Initial Solution Tour Length Run Time (ms) Conclusion In this study, we investigated parallelization strategies for utilizing the GPU resources effectively. Considering the 2-opt local search parallelization approach presented in the literature we performed an extensive performance analysis to configure the kernel parameters. Our experiments showed that the best strategy for the peak performance is to utilize all resident warps as the shared memory and register limits of the device allow and perform the remaining jobs through iteration level parallelism. We should keep the warps in the device busy as much as possible, which provides thread level parallelism. Then, the remaining jobs should be distributed among the launched threads equally. In other words, one thread will perform more than one job concurrently and this will accompany iteration level parallelism. Since the occupancy decreases in larger problems, the iteration level parallelism should be kept at maximum. Future research on this topic may focus on the parallelization strategies for other local search and metaheuristic methods. These methods are widely applied to various combinatorial optimization problems and usually require long computation times for better convergence. So, problem specific implementations of GPU are needed to enhance their performance where the parts of the algorithm requiring intense computations may be handled by GPU specific functions. Acknowledgment This research was partially supported by The Scientific and Technical Research Council of Turkey through Grant #113M522 to the second author. Appendix A. Detailed experimental results of the 500-node TSP instance We report our experimental results belonging to four test groups with block dimensions of 1024, 512, 256, and 128. Table A.1 gives the performances of different resource allocation settings on the TSP instance including 500 nodes. The rows in bold indicate the best result among each group. Block Dimension and Grid Dimension refer to the number of threads in a block and the number of blocks in a grid, respectively. is the number of possible 2-opt edge exchange calculations performed by a thread. GPU Time is the time spent for calculating all 125,250 edge exchanges. 2-opt shows the total number of 2-opt edge exchanges performed. CPU+GPU Time is the total time elapsed from the start of the algorithm until the last exchange has been performed.

10 418 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Table 9.1. Detailed results of the 2-opt experiments on the 500-node TSP instance Block Dimension Grid Dimension GPU Time (ms) Tour Length 2-opt CPU+GPU Time (ms) References Brodtkorb, A. R., Hagen, T. R., Schulz, C., Hasle, G., GPU computing in discrete optimization. Part I: Introduction to the GPU. EURO Journal on Transportation and Logistics 2, Capodieci N., Emilia R., Burgio P., Efficient implementation of genetic algorithms on GP-GPU with scheduled persistent CUDA threads. In: Proceedings of the 7th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), Coelho, I. M., Ochi, L. S., Munhoz, P. L. A., Souza, M. J. F., Farias, R., Bentes, C., Farias R., An integrated CPU GPU heuristic inspired on variable neighborhood search for the single vehicle routing problem with deliveries and selective pickups. International Journal of Production Research 54, Dawson, L., Stewart I. A., Optimization-based edge detection on the GPU using CUDA. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Kang S., Kim. S, Won. J, Kang. Y., GPU-based parallel genetic approach to large-scale travelling salesman problem. The Journal of Supercomputing, 1-16 (Available online). Janiak, A., Janiak, W. A., Lichtenstein, M., Tabu search on GPU. Journal of Universal Computer Science 14, O Neil, M. A., Burtscher, M., Rethinking the parallelization of random-restart hill climbing: a case study in optimizing a 2-opt TSP solver for GPU execution. In: Proceedings of the 8th Workshop on General Purpose Processing using GPUs, O Neil, M. A., Tamir, D., Burtscher, M., A parallel GPU version of the traveling salesman problem. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Rocki, K., Suda, R., Accelerating 2-opt and 3-opt local search using GPU in the travelling salesman problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), Schulz, C., Efficient local search on the GPU investigations on the vehicle routing problem. Journal of Parallel and Distributed Computing 73, Sinha R. S., Singh S., Singh S., Banga V. K., Accelerating genetic algorithm using general purpose GPU and CUDA. International Journal of Computer Graphics 7, Van Luong, T., Melab, N., Talbi, E. G., Parallel local search on GPU. Research Report RR-6915, INRIA. Van Luong, T., Melab, N., Talbi, E. G., GPU computing for parallel local search metaheuristic algorithms. IEEE Transactions on Computers 62, (last accessed on May 5, 2016)

Advances in Metaheuristics on GPU

Advances in Metaheuristics on GPU Advances in Metaheuristics on GPU 1 Thé Van Luong, El-Ghazali Talbi and Nouredine Melab DOLPHIN Project Team May 2011 Interests in optimization methods 2 Exact Algorithms Heuristics Branch and X Dynamic

More information

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics N. Melab, T-V. Luong, K. Boufaras and E-G. Talbi Dolphin Project INRIA Lille Nord Europe - LIFL/CNRS UMR 8022 - Université

More information

HEURISTICS optimization algorithms like 2-opt or 3-opt

HEURISTICS optimization algorithms like 2-opt or 3-opt Parallel 2-Opt Local Search on GPU Wen-Bao Qiao, Jean-Charles Créput Abstract To accelerate the solution for large scale traveling salesman problems (TSP), a parallel 2-opt local search algorithm with

More information

Parallel local search on GPU and CPU with OpenCL Language

Parallel local search on GPU and CPU with OpenCL Language Parallel local search on GPU and CPU with OpenCL Language Omar ABDELKAFI, Khalil CHEBIL, Mahdi KHEMAKHEM LOGIQ, UNIVERSITY OF SFAX SFAX-TUNISIA omarabd.recherche@gmail.com, chebilkhalil@gmail.com, mahdi.khemakhem@isecs.rnu.tn

More information

Parallel Metaheuristics on GPU

Parallel Metaheuristics on GPU Ph.D. Defense - Thé Van LUONG December 1st 2011 Parallel Metaheuristics on GPU Advisors: Nouredine MELAB and El-Ghazali TALBI Outline 2 I. Scientific Context 1. Parallel Metaheuristics 2. GPU Computing

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J. Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering

More information

Parallelization Strategies for Local Search Algorithms on Graphics Processing Units

Parallelization Strategies for Local Search Algorithms on Graphics Processing Units Parallelization Strategies for Local Search Algorithms on Graphics Processing Units Audrey Delévacq, Pierre Delisle, and Michaël Krajecki CReSTIC, Université de Reims Champagne-Ardenne, Reims, France Abstract

More information

High Performance GPU Accelerated Local Optimization in TSP

High Performance GPU Accelerated Local Optimization in TSP 2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum High Performance GPU Accelerated Local Optimization in TSP Kamil Rocki, Reiji Suda The University of

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies Available online at www.sciencedirect.com Procedia Environmental Sciences 12 (212 ) 628 633 211 International Conference on Environmental Science and Engineering (ICESE 211) The GPU-based Parallel Calculation

More information

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 21 RESEARCH ARTICLE Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

The Beach Law does not hold any more Discrete optimization needs heterogeneous computing. Seminar

The Beach Law does not hold any more Discrete optimization needs heterogeneous computing. Seminar The Beach Law does not hold any more Discrete optimization needs heterogeneous computing Christian Schulz, Trond Hagen, Geir Hasle Department of, SINTEF ICT, Oslo, Norway Seminar CORAL, Aarhus School of

More information

A Parallel Simulated Annealing Algorithm for Weapon-Target Assignment Problem

A Parallel Simulated Annealing Algorithm for Weapon-Target Assignment Problem A Parallel Simulated Annealing Algorithm for Weapon-Target Assignment Problem Emrullah SONUC Department of Computer Engineering Karabuk University Karabuk, TURKEY Baha SEN Department of Computer Engineering

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface IZZATDIN A. AZIZ, NAZLEENI HARON, MAZLINA MEHAT, LOW TAN JUNG, AISYAH NABILAH Computer and Information Sciences

More information

ACCELERATING THE ANT COLONY OPTIMIZATION

ACCELERATING THE ANT COLONY OPTIMIZATION ACCELERATING THE ANT COLONY OPTIMIZATION BY SMART ANTS, USING GENETIC OPERATOR Hassan Ismkhan Department of Computer Engineering, University of Bonab, Bonab, East Azerbaijan, Iran H.Ismkhan@bonabu.ac.ir

More information

A Parallel GPU Version of the Traveling Salesman Problem

A Parallel GPU Version of the Traveling Salesman Problem A Parallel GPU Version of the Traveling Salesman Problem Molly A. O Neil, Dan Tamir, and Martin Burtscher Department of Computer Science, Texas State University, San Marcos, TX Abstract - This paper describes

More information

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu

More information

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini Metaheuristic Development Methodology Fall 2009 Instructor: Dr. Masoud Yaghini Phases and Steps Phases and Steps Phase 1: Understanding Problem Step 1: State the Problem Step 2: Review of Existing Solution

More information

Hybrid Differential Evolution Algorithm for Traveling Salesman Problem

Hybrid Differential Evolution Algorithm for Traveling Salesman Problem Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 2716 2720 Advanced in Control Engineeringand Information Science Hybrid Differential Evolution Algorithm for Traveling Salesman

More information

PARTICLE Swarm Optimization (PSO), an algorithm by

PARTICLE Swarm Optimization (PSO), an algorithm by , March 12-14, 2014, Hong Kong Cluster-based Particle Swarm Algorithm for Solving the Mastermind Problem Dan Partynski Abstract In this paper we present a metaheuristic algorithm that is inspired by Particle

More information

Modified Order Crossover (OX) Operator

Modified Order Crossover (OX) Operator Modified Order Crossover (OX) Operator Ms. Monica Sehrawat 1 N.C. College of Engineering, Israna Panipat, Haryana, INDIA. Mr. Sukhvir Singh 2 N.C. College of Engineering, Israna Panipat, Haryana, INDIA.

More information

Multiple Depot Vehicle Routing Problems on Clustering Algorithms

Multiple Depot Vehicle Routing Problems on Clustering Algorithms Thai Journal of Mathematics : 205 216 Special Issue: Annual Meeting in Mathematics 2017 http://thaijmath.in.cmu.ac.th ISSN 1686-0209 Multiple Depot Vehicle Routing Problems on Clustering Algorithms Kanokon

More information

A heuristic approach to find the global optimum of function

A heuristic approach to find the global optimum of function Journal of Computational and Applied Mathematics 209 (2007) 160 166 www.elsevier.com/locate/cam A heuristic approach to find the global optimum of function M. Duran Toksarı Engineering Faculty, Industrial

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Modeling the Component Pickup and Placement Sequencing Problem with Nozzle Assignment in a Chip Mounting Machine

Modeling the Component Pickup and Placement Sequencing Problem with Nozzle Assignment in a Chip Mounting Machine Modeling the Component Pickup and Placement Sequencing Problem with Nozzle Assignment in a Chip Mounting Machine Hiroaki Konishi, Hidenori Ohta and Mario Nakamori Department of Information and Computer

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Using Genetic Algorithm with Triple Crossover to Solve Travelling Salesman Problem

Using Genetic Algorithm with Triple Crossover to Solve Travelling Salesman Problem Proc. 1 st International Conference on Machine Learning and Data Engineering (icmlde2017) 20-22 Nov 2017, Sydney, Australia ISBN: 978-0-6480147-3-7 Using Genetic Algorithm with Triple Crossover to Solve

More information

Massively Parallel Approximation Algorithms for the Traveling Salesman Problem

Massively Parallel Approximation Algorithms for the Traveling Salesman Problem Massively Parallel Approximation Algorithms for the Traveling Salesman Problem Vaibhav Gandhi May 14, 2015 Abstract This paper introduces the reader to massively parallel approximation algorithms which

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

ANANT COLONY SYSTEMFOR ROUTING IN PCB HOLES DRILLING PROCESS

ANANT COLONY SYSTEMFOR ROUTING IN PCB HOLES DRILLING PROCESS International Journal of Innovative Management, Information & Production ISME International c2013 ISSN 2185-5439 Volume 4, 1, June 2013 PP. 50-56 ANANT COLONY SYSTEMFOR ROUTING IN PCB HOLES DRILLING PROCESS

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Machine Learning for Software Engineering

Machine Learning for Software Engineering Machine Learning for Software Engineering Introduction and Motivation Prof. Dr.-Ing. Norbert Siegmund Intelligent Software Systems 1 2 Organizational Stuff Lectures: Tuesday 11:00 12:30 in room SR015 Cover

More information

Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques

Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques N.N.Poddar 1, D. Kaur 2 1 Electrical Engineering and Computer Science, University of Toledo, Toledo, OH, USA 2

More information

ACO and other (meta)heuristics for CO

ACO and other (meta)heuristics for CO ACO and other (meta)heuristics for CO 32 33 Outline Notes on combinatorial optimization and algorithmic complexity Construction and modification metaheuristics: two complementary ways of searching a solution

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Hybrid ant colony optimization algorithm for two echelon vehicle routing problem

Hybrid ant colony optimization algorithm for two echelon vehicle routing problem Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3361 3365 Advanced in Control Engineering and Information Science Hybrid ant colony optimization algorithm for two echelon vehicle

More information

Fuzzy Inspired Hybrid Genetic Approach to Optimize Travelling Salesman Problem

Fuzzy Inspired Hybrid Genetic Approach to Optimize Travelling Salesman Problem Fuzzy Inspired Hybrid Genetic Approach to Optimize Travelling Salesman Problem Bindu Student, JMIT Radaur binduaahuja@gmail.com Mrs. Pinki Tanwar Asstt. Prof, CSE, JMIT Radaur pinki.tanwar@gmail.com Abstract

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017 1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA How Can You Gain Access to GPU Power? 3

More information

GPU 101. Mike Bailey. Oregon State University

GPU 101. Mike Bailey. Oregon State University 1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA 1 How Can You Gain Access to GPU Power?

More information

The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem. Quan OuYang, Hongyun XU a*

The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem. Quan OuYang, Hongyun XU a* International Conference on Manufacturing Science and Engineering (ICMSE 2015) The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem Quan

More information

O(1) Delta Component Computation Technique for the Quadratic Assignment Problem

O(1) Delta Component Computation Technique for the Quadratic Assignment Problem O(1) Delta Component Computation Technique for the Quadratic Assignment Problem Sergey Podolsky, Yuri Zorin National Technical University of Ukraine Kyiv Polytechnic Institute Faculty of Applied Mathematics

More information

A Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs

A Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs MACRo 2015-5 th International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics A Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs

More information

LECTURE 20: SWARM INTELLIGENCE 6 / ANT COLONY OPTIMIZATION 2

LECTURE 20: SWARM INTELLIGENCE 6 / ANT COLONY OPTIMIZATION 2 15-382 COLLECTIVE INTELLIGENCE - S18 LECTURE 20: SWARM INTELLIGENCE 6 / ANT COLONY OPTIMIZATION 2 INSTRUCTOR: GIANNI A. DI CARO ANT-ROUTING TABLE: COMBINING PHEROMONE AND HEURISTIC 2 STATE-TRANSITION:

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Optimisation problems Optimisation & search Two Examples The knapsack problem

More information

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey Fast and robust solvers for pressure systems on the GPU

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers , October 20-22, 2010, San Francisco, USA Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers I-Lun Tseng, Member, IAENG, Huan-Wen Chen, and Che-I Lee Abstract Longest-path routing problems,

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Navigation of Multiple Mobile Robots Using Swarm Intelligence

Navigation of Multiple Mobile Robots Using Swarm Intelligence Navigation of Multiple Mobile Robots Using Swarm Intelligence Dayal R. Parhi National Institute of Technology, Rourkela, India E-mail: dayalparhi@yahoo.com Jayanta Kumar Pothal National Institute of Technology,

More information

A NEW HEURISTIC ALGORITHM FOR MULTIPLE TRAVELING SALESMAN PROBLEM

A NEW HEURISTIC ALGORITHM FOR MULTIPLE TRAVELING SALESMAN PROBLEM TWMS J. App. Eng. Math. V.7, N.1, 2017, pp. 101-109 A NEW HEURISTIC ALGORITHM FOR MULTIPLE TRAVELING SALESMAN PROBLEM F. NURIYEVA 1, G. KIZILATES 2, Abstract. The Multiple Traveling Salesman Problem (mtsp)

More information

Exploring Lin Kernighan neighborhoods for the indexing problem

Exploring Lin Kernighan neighborhoods for the indexing problem INDIAN INSTITUTE OF MANAGEMENT AHMEDABAD INDIA Exploring Lin Kernighan neighborhoods for the indexing problem Diptesh Ghosh W.P. No. 2016-02-13 February 2016 The main objective of the Working Paper series

More information

GPU-based Multi-start Local Search Algorithms

GPU-based Multi-start Local Search Algorithms GPU-based Multi-start Local Search Algorithms Thé Van Luong, Nouredine Melab, El-Ghazali Talbi To cite this version: Thé Van Luong, Nouredine Melab, El-Ghazali Talbi. GPU-based Multi-start Local Search

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

Innovative Systems Design and Engineering ISSN (Paper) ISSN (Online) Vol.5, No.1, 2014

Innovative Systems Design and Engineering ISSN (Paper) ISSN (Online) Vol.5, No.1, 2014 Abstract Tool Path Optimization of Drilling Sequence in CNC Machine Using Genetic Algorithm Prof. Dr. Nabeel Kadim Abid Al-Sahib 1, Hasan Fahad Abdulrazzaq 2* 1. Thi-Qar University, Al-Jadriya, Baghdad,

More information

Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm. Santos and Mateus (2007)

Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm. Santos and Mateus (2007) In the name of God Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm Spring 2009 Instructor: Dr. Masoud Yaghini Outlines Problem Definition Modeling As A Set Partitioning

More information

Stefan WAGNER *, Michael AFFENZELLER * HEURISTICLAB GRID A FLEXIBLE AND EXTENSIBLE ENVIRONMENT 1. INTRODUCTION

Stefan WAGNER *, Michael AFFENZELLER * HEURISTICLAB GRID A FLEXIBLE AND EXTENSIBLE ENVIRONMENT 1. INTRODUCTION heuristic optimization, distributed computation, optimization frameworks Stefan WAGNER *, Michael AFFENZELLER * HEURISTICLAB GRID A FLEXIBLE AND EXTENSIBLE ENVIRONMENT FOR PARALLEL HEURISTIC OPTIMIZATION

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

XLVI Pesquisa Operacional na Gestão da Segurança Pública

XLVI Pesquisa Operacional na Gestão da Segurança Pública PARALLEL CONSTRUCTION FOR CONTINUOUS GRASP OPTIMIZATION ON GPUs Lisieux Marie Marinho dos Santos Andrade Centro de Informática Universidade Federal da Paraíba Campus I, Cidade Universitária 58059-900,

More information

SCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT

SCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT SCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT QlikView Technical Brief February 2012 qlikview.com Introduction When it comes to the enterprise Business Discovery environments, the ability of the

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Enhanced ABC Algorithm for Optimization of Multiple Traveling Salesman Problem

Enhanced ABC Algorithm for Optimization of Multiple Traveling Salesman Problem I J C T A, 9(3), 2016, pp. 1647-1656 International Science Press Enhanced ABC Algorithm for Optimization of Multiple Traveling Salesman Problem P. Shunmugapriya 1, S. Kanmani 2, R. Hemalatha 3, D. Lahari

More information

An Ant Approach to the Flow Shop Problem

An Ant Approach to the Flow Shop Problem An Ant Approach to the Flow Shop Problem Thomas Stützle TU Darmstadt, Computer Science Department Alexanderstr. 10, 64283 Darmstadt Phone: +49-6151-166651, Fax +49-6151-165326 email: stuetzle@informatik.tu-darmstadt.de

More information

Simultaneous Solving of Linear Programming Problems in GPU

Simultaneous Solving of Linear Programming Problems in GPU Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* amitgurung@nitm.ac.in Binayak Das* binayak89cse@gmail.com Rajarshi Ray* raj.ray84@gmail.com * National Institute of Technology Meghalaya

More information

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Ryouhei Murooka, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1,

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Rethinking the Parallelization of Random-Restart Hill Climbing

Rethinking the Parallelization of Random-Restart Hill Climbing Rethinking the Parallelization of Random-Restart Hill Climbing A Case Study in Optimizing a 2-Opt TSP Solver for GPU Execution Molly A. O Neil Department of Computer Science Texas State University San

More information

Hybrid Constraint Programming and Metaheuristic methods for Large Scale Optimization Problems

Hybrid Constraint Programming and Metaheuristic methods for Large Scale Optimization Problems Hybrid Constraint Programming and Metaheuristic methods for Large Scale Optimization Problems Fabio Parisini Tutor: Paola Mello Co-tutor: Michela Milano Final seminars of the XXIII cycle of the doctorate

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

LOW AND HIGH LEVEL HYBRIDIZATION OF ANT COLONY SYSTEM AND GENETIC ALGORITHM FOR JOB SCHEDULING IN GRID COMPUTING

LOW AND HIGH LEVEL HYBRIDIZATION OF ANT COLONY SYSTEM AND GENETIC ALGORITHM FOR JOB SCHEDULING IN GRID COMPUTING LOW AND HIGH LEVEL HYBRIDIZATION OF ANT COLONY SYSTEM AND GENETIC ALGORITHM FOR JOB SCHEDULING IN GRID COMPUTING Mustafa Muwafak Alobaedy 1, and Ku Ruhana Ku-Mahamud 2 2 Universiti Utara Malaysia), Malaysia,

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,

More information

Solving a combinatorial problem using a local optimization in ant based system

Solving a combinatorial problem using a local optimization in ant based system Solving a combinatorial problem using a local optimization in ant based system C-M.Pintea and D.Dumitrescu Babeş-Bolyai University of Cluj-Napoca, Department of Computer-Science Kogalniceanu 1, 400084

More information

Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization

Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization Richa Agnihotri #1, Dr. Shikha Agrawal #1, Dr. Rajeev Pandey #1 # Department of Computer Science Engineering, UIT,

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

A robust enhancement to the Clarke-Wright savings algorithm

A robust enhancement to the Clarke-Wright savings algorithm A robust enhancement to the Clarke-Wright savings algorithm Tamer Doyuran * and Bülent Çatay Sabanci University, Faculty of Engineering and Natural Sciences Tuzla, Istanbul, 34956, Turkey Abstract: We

More information

Parallel Evaluation of Hopfield Neural Networks

Parallel Evaluation of Hopfield Neural Networks Parallel Evaluation of Hopfield Neural Networks Antoine Eiche, Daniel Chillet, Sebastien Pillement and Olivier Sentieys University of Rennes I / IRISA / INRIA 6 rue de Kerampont, BP 818 2232 LANNION,FRANCE

More information

Metaheuristic Optimization with Evolver, Genocop and OptQuest

Metaheuristic Optimization with Evolver, Genocop and OptQuest Metaheuristic Optimization with Evolver, Genocop and OptQuest MANUEL LAGUNA Graduate School of Business Administration University of Colorado, Boulder, CO 80309-0419 Manuel.Laguna@Colorado.EDU Last revision:

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Ant Colony Optimization: The Traveling Salesman Problem

Ant Colony Optimization: The Traveling Salesman Problem Ant Colony Optimization: The Traveling Salesman Problem Section 2.3 from Swarm Intelligence: From Natural to Artificial Systems by Bonabeau, Dorigo, and Theraulaz Andrew Compton Ian Rogers 12/4/2006 Traveling

More information

Genetic Algorithms with Oracle for the Traveling Salesman Problem

Genetic Algorithms with Oracle for the Traveling Salesman Problem PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY VOLUME 7 AUGUST 25 ISSN 17-884 Genetic Algorithms with Oracle for the Traveling Salesman Problem Robin Gremlich, Andreas Hamfelt, Héctor

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Algorithm Design (4) Metaheuristics

Algorithm Design (4) Metaheuristics Algorithm Design (4) Metaheuristics Takashi Chikayama School of Engineering The University of Tokyo Formalization of Constraint Optimization Minimize (or maximize) the objective function f(x 0,, x n )

More information

Accelerating Dynamic Binary Translation with GPUs

Accelerating Dynamic Binary Translation with GPUs Accelerating Dynamic Binary Translation with GPUs Chung Hwan Kim, Srikanth Manikarnike, Vaibhav Sharma, Eric Eide, Robert Ricci School of Computing, University of Utah {chunghwn,smanikar,vaibhavs,eeide,ricci}@utah.edu

More information