Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU

Similar documents
Advances in Metaheuristics on GPU

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

HEURISTICS optimization algorithms like 2-opt or 3-opt

Parallel local search on GPU and CPU with OpenCL Language

Parallel Metaheuristics on GPU

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Parallelization Strategies for Local Search Algorithms on Graphics Processing Units

High Performance GPU Accelerated Local Optimization in TSP

GPU-accelerated Verification of the Collatz Conjecture

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

The Beach Law does not hold any more Discrete optimization needs heterogeneous computing. Seminar

A Parallel Simulated Annealing Algorithm for Weapon-Target Assignment Problem

GPU Programming Using NVIDIA CUDA

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

ACCELERATING THE ANT COLONY OPTIMIZATION

A Parallel GPU Version of the Traveling Salesman Problem

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Hybrid Differential Evolution Algorithm for Traveling Salesman Problem

PARTICLE Swarm Optimization (PSO), an algorithm by

Modified Order Crossover (OX) Operator

Multiple Depot Vehicle Routing Problems on Clustering Algorithms

A heuristic approach to find the global optimum of function

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

TUNING CUDA APPLICATIONS FOR MAXWELL

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Center for Computational Science

Modeling the Component Pickup and Placement Sequencing Problem with Nozzle Assignment in a Chip Mounting Machine

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Using Genetic Algorithm with Triple Crossover to Solve Travelling Salesman Problem

Massively Parallel Approximation Algorithms for the Traveling Salesman Problem

TUNING CUDA APPLICATIONS FOR MAXWELL

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Optimization solutions for the segmented sum algorithmic function

ANANT COLONY SYSTEMFOR ROUTING IN PCB HOLES DRILLING PROCESS

CME 213 S PRING Eric Darve

Machine Learning for Software Engineering

Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques

ACO and other (meta)heuristics for CO

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Hybrid ant colony optimization algorithm for two echelon vehicle routing problem

Fuzzy Inspired Hybrid Genetic Approach to Optimize Travelling Salesman Problem

B. Tech. Project Second Stage Report on

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem. Quan OuYang, Hongyun XU a*

O(1) Delta Component Computation Technique for the Quadratic Assignment Problem

A Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs

LECTURE 20: SWARM INTELLIGENCE 6 / ANT COLONY OPTIMIZATION 2

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey

High Performance Computing on GPUs using NVIDIA CUDA

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

Navigation of Multiple Mobile Robots Using Swarm Intelligence

A NEW HEURISTIC ALGORITHM FOR MULTIPLE TRAVELING SALESMAN PROBLEM

Exploring Lin Kernighan neighborhoods for the indexing problem

GPU-based Multi-start Local Search Algorithms

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Innovative Systems Design and Engineering ISSN (Paper) ISSN (Online) Vol.5, No.1, 2014

Crew Scheduling Problem: A Column Generation Approach Improved by a Genetic Algorithm. Santos and Mateus (2007)

Stefan WAGNER *, Michael AFFENZELLER * HEURISTICLAB GRID A FLEXIBLE AND EXTENSIBLE ENVIRONMENT 1. INTRODUCTION

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

XLVI Pesquisa Operacional na Gestão da Segurança Pública

SCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Enhanced ABC Algorithm for Optimization of Multiple Traveling Salesman Problem

An Ant Approach to the Flow Shop Problem

Simultaneous Solving of Linear Programming Problems in GPU

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU

Dense matching GPU implementation

Introduction to GPU hardware and to CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Rethinking the Parallelization of Random-Restart Hill Climbing

Hybrid Constraint Programming and Metaheuristic methods for Large Scale Optimization Problems

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

LOW AND HIGH LEVEL HYBRIDIZATION OF ANT COLONY SYSTEM AND GENETIC ALGORITHM FOR JOB SCHEDULING IN GRID COMPUTING

Evaluation Of The Performance Of GPU Global Memory Coalescing

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Solving a combinatorial problem using a local optimization in ant based system

Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization

Parallel Computing: Parallel Architectures Jin, Hai

Exploring GPU Architecture for N2P Image Processing Algorithms

A robust enhancement to the Clarke-Wright savings algorithm

Parallel Evaluation of Hopfield Neural Networks

Metaheuristic Optimization with Evolver, Genocop and OptQuest

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Ant Colony Optimization: The Traveling Salesman Problem

Genetic Algorithms with Oracle for the Traveling Salesman Problem

Performance impact of dynamic parallelism on different clustering algorithms

Algorithm Design (4) Metaheuristics

Accelerating Dynamic Binary Translation with GPUs

Transcription:

Available online at www.sciencedirect.com ScienceDirect Transportation Research Procedia 22 (2017) 409 418 www.elsevier.com/locate/procedia 19th EURO Working Group on Transportation Meeting, EWGT2016, 5-7 September 2016, Istanbul, Turkey Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU Gizem Ermiş *, Bülent Çatay Sabanci University, Faculty of Engineering and Natural Sciences, Tuzla, Istanbul 34956, Turkey Abstract Graphics processor units (GPUs) are many-core processors that perform better than central processing units (CPUs) on data parallel, throughput-oriented applications with intense arithmetic operations. Thus, they can considerably reduce the execution time of the algorithms by performing a wide range of calculations in a parallel manner. On the other hand, imprecise usage of GPU may cause significant loss in the performance. This study examines the impact of GPU resource allocations on the GPU performance. Our aim is to provide insights about parallelization strategies in CUDA and to propose strategies for utilizing GPU resources effectively. We investigate the parallelization of 2-opt and 3-opt local search heuristics for solving the travelling salesman problem. We perform an extensive experimental study on different instances of various sizes and attempt to determine an effective setting which accelerates the computation time the most. We also compare the performance of the GPU against that of the CPU. In addition, we revise the 3-opt implementation strategy presented in the literature for parallelization. 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Scientific Committee of EWGT2016. Keywords: GPU computing; parallelization; optimization; GPU architecture; travelling salesperson problem. 1. Introduction With their highly parallel structure, graphics processor units (GPUs) are many-core processors that are specifically designed to perform data-parallel computation. Because of the architectural differences between the * Corresponding author. Tel.: +90-537-981-3498 E-mail address:ermisgizem@sabanciuniv.edu 2214-241X 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Scientific Committee of EWGT2016. 10.1016/j.trpro.2017.03.012

410 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 central processing units (CPUs) and GPUs, the GPU performance is better on data parallel, throughput-oriented applications with intense arithmetic operations. GPUs have the capability to accelerate algorithms that require high computational power. Thus, they can considerably reduce the execution time of such algorithms by performing a wide range of calculations in a parallel manner. Modern GPUs are many-core processors that are specifically designed to perform data-parallel computation. Data parallelism means that each processor performs the same task on different pieces of distributed data (Brodtkorb et al., 2013). Before the evolution of today s advanced GPUs, traditional, single-core processors were exploited. Computationally hard tasks were taking a great deal of time when they were solved by the help of single-core processors. Faster single-core processors were developed by the computer industry but they were still insufficient for peak performances. Around the year 2000, by fitting more cores in the same chip, single-core processors evolved to multi-core processors (Fig. 1), which work together to process instructions, and thus have higher total theoretical performance (Brodtkorb et al., 2013). Fig 1. A basic block diagram of a generic multi-core processor Because of gaming industry needs, GPUs which actually were the normal component in common PCs, developed quickly in terms of computational performance. Multi-core GPU processors evolved to massive multi-core or manycore processors which work as massively parallel stream processing accelerators or data parallel accelerators. Because of the rapid advancements in the GPU technology, they became common as accelerators in general purpose programming. Although both multi-core CPUs and GPUs can implement parallel algorithms, the architectural differences between CPUs and GPUs created different usage areas depending upon the nature of the problem. While multi-core CPUs are designed for task parallel implementations, many-core processors are specifically designed for data parallel implementations. The efficiency is as much critical factor as the solution quality when an algorithm is applied to an optimization problem such as traveling salesman problem (TSP), a well-known NP-Hard combinatorial optimization problem. Local search algorithms such as 2-opt or 3-opt are computationally difficult when they are implemented on the CPU. Because these techniques evaluate all edge exchanges on the tour to determine the exchange that reduces the tour length the most, they require a large number of computations and comparisons. So, parallel implementation can accelerate these computations significantly. GPUs which have data parallel structure can perform these simple computations in parallel. On the other hand, the design of the parallel implementation plays a crucial role in achieving effective utilization of the GPU resources, thus optimizing the system performance. CUDA is a parallel computing platform and programming model introduced by NVIDIA. It enables programmers to use GPUs for general purpose processing (Wikipedia, 2016). Van Luong et al. (2009) used GPU as a coprocessor for extensive computations where the solutions of the TSP from a given 2-exchange neighborhood are evaluated in parallel. The remaining computations are performed on the CPU. A local search has four main steps: neighborhood generation, evaluation, move selection, and solution update. The simplest method is to create the neighborhood on the CPU and transfer it to GPU each time. Van Luong et al. (2013) applied this technique; however, it requires copying of a lot of information from the CPU to the GPU. To prevent this drawback Rocki and Suda (2012) and Schulz (2013) utilized an explicit formula to explore the

Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 411 neighborhood by assigning one or several moves to a thread. Since the neighborhood evaluation is the most computationally expensive task, it was generally performed on the GPU. O Neil et al. (2011), Rocki and Suda (2012), and Schulz (2013) presented some implementation details in order to execute kernel efficiently. Yet, only Schulz (2013) presented the profiling analysis of the implementations. We refer the interested reader to Brodtkorb et al. (2013) for further details on the GPU technology and its applications. Dawson and Stewart (2014) applied the first parallel implementation of Ant Colony Optimization based edge detection algorithm on GPU. They mapped individual ants to warps and executed more ants in each iteration which reduced the number of iterations to generate an edge map. O Neil and Burtscher (2015) presented random restart hill climbing with 2-opt local search for TSP in parallel. They parallelized independent climbs between blocks and 2-opt evaluations between threads within a block. Genetic Algorithms were also successfully implemented on the GPU by Capodieci and Emilia (2015), Sinha et al. (2016), and Kang et al. (2016). Recently, Coelho et al. (2016) applied the variable neighborhood search to the single vehicle routing problem with deliveries and selective pickups where the local search was executed on GPU. GPU implementations can be difficult because of the distinctive manner of work and the complicated memory structure of the GPU. Imprecise usage of GPU may cause significant loss in the performance. In this study, we analyze the effect of GPU resource allocations on the GPU performance. Our aim is to provide insights about parallelization strategies in CUDA and to propose strategies for utilizing GPU resources effectively. Following the approaches of Rocki and Suda (2012) we investigate the parallelization of 2-opt and 3-opt local search heuristics by allocating the resources of Nvidia Quadro K600 1 GB GPU device in different ways. We perform an extensive experimental study on TSP instances of various sizes and attempt to determine an effective setting which accelerates the computation time the most. We also compare the performance of the GPU against the sequential implementation on an Intel Xeon E5 CPU with 3.30 GHz speed. The main contribution of the study is to improve the work of Rocki and Suda (2012) by determining the most effective allocation of GPU resources on large size problems. In addition, we correct the parallelization formulation of Rocki and Suda (2012) proposed for the implementation of the 3-opt algorithm. 2. Parallelization 2.1. Methodology 2-opt algorithm with the best improvement calculates the effect of each possible edge exchange on the total cost of the current tour. Among all these possible exchanges, it performs the one that yields the largest improvement, in other words the exchange that decreases the total cost the most. The algorithm is repeated until no further improving exchange exists. Fig 2. A 2-opt move on the travelling salesman tour Fig. 2 demonstrates a 2-opt exchange move. To calculate the effect of a 2-opt exchange, two edges are removed from the current tour and the emerged two sub-tours are reconnected at a different position by protecting the validity of the tour. For a TSP instance of n nodes, the number of possible edge exchanges in each iteration is (nn 1)/2.

412 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 A thread on the GPU is a basic element of the data to be processed. Threads map to stream processors of GPU. The parallelism is managed by distributing (nn 1)/2 possible edge exchanges among different threads equally so that each thread can calculate the effects of relevant exchanges on the tour cost concurrently. Therefore, the following formula is applied to all node combinations in the current tour in parallel: = (tt[ii],tt[jj]) + (tt[ii 1],tt[jj 1]) (tt[ii 1],tt[ii]) (tt[jj 1],tt[jj]), where d( ) refers to the distance. This formula calculates the change in the tour length by subtracting the distance of removed edges from the distance of added edges. ii, jj, ii 1, and jj 1 represent the nodes that connect these edges. The most critical part in the algorithm is to develop common formulas for all threads so that they can produce the correct i and j values which parallelize the formula. In order to obtain i and j values, they should be related to the ids that derive from predefined variables in CUDA. Predefined variables are built-in variables that return the thread ID of the thread that is executed by its stream processor. An id represents each unique job in the problem which calculates the result of the edge exchange. Rocki and Suda (2012) proposed the following formulae to produce i and j values: ii = 3+ 8 iiii+1 (1) 2 jj = 1+id (i 2)(i 1) (2) 2 Table 1 provides an example data including sample i and j values, the corresponding ids that derive from predefined CUDA variables, and calculation of the change in the total tour cost. Table 1. i and j values related to GPU ids id i j 0 2 1 (tt[ ],tt[1]) + (tt[1],tt[ ]) (tt[1],tt[ ]) (tt[ ]+tt[1 1 3 1 (tt[3],tt[1]) + (tt[ ],tt[ ]) (tt[ ],tt[3]) (tt[ ]+tt[1 2 3 2 (tt[3],tt[ ]) + (tt[2],tt[1]) (tt[ ],tt[3]) (tt[1]+tt[ 3 4 1 (tt[ ],tt[1]) + (tt[3],tt[ ]) (tt[3],tt[ ]) (tt[ ]+tt[1 4 4 2 (tt[ ],tt[ ]) + (tt[3],tt[1]) (tt[3],tt[ ]) (tt[1]+tt[ 5 4 3 (tt[ ],tt[3]) + (tt[3],tt[ ]) (tt[3],tt[ ]) (tt[ ]+tt[3 6 5 1 (tt[ ],tt[1]) + (tt[ ],tt[ ]) (tt[ ],tt[ ]) (tt[ ]+tt[1 The only difference in the 3-opt algorithm is that three edges will be cut and then reconnected at different places in the tour. In this case, the indices i, j, and k are used to determine the edges. Considering (nn-1)(nn-2)/6 possible edge exchanges, Rocki and Suda (2012) proposed the following formulations to determine i, j, and k values: 3 ii = 3 iiii + 9 iiii 2 1 9 3 + 3 iiii 9 iiii 2 1 + 1 (3) 9 jj = 3+ 8 (iiii ii(ii 1)(ii 2) 6 2 )+1 (4) kk = iiii ( ii(ii 1)(ii 2) 6 ) ( jj(jj 1) ) (5) 2

Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 413 2.2. Proposed revision for the parallelization of 3-opt algorithm The formulations of 3-opt generate false i, j, and k values for certain id values. Table 2 provides examples of the problematic ids and the resulting false i values as well as the expected correct values. For example, for id values within the interval [364,454] i should take the value of 14; however, i=15 when id=454. Since j and k are calculated using i, their values are also false. Table 2. Examples of problematic ids giving false i values id range [364,454] [455,559] [560,679] [680,815] [816,968] [3276,3653] [30856,32508] Problematic ids 454 559 679 815 968 3652,3653 32505,32506,32507,32508 Incorrect i value produced by (3) Correct i value obtained by (6) 15 16 17 18 19 29 59 14 15 16 17 18 28 58 To fix the above problem, we replace (3) with the following formulation: 3 ii = 3 iiii + 9 iiii 2 1 + 1 (6) 9 The new formulation leads to a deficiency in the value of j: when j is calculated using (4) based on i given by (6) i and j can take the same value, hence that does not correspond to an exchange. To overcome this shortcoming, we propose the following correction approach: if i=j, then i=i+1 and j is recalculated using (4). k is still calculated using (5). Table 3 shows a set of sample i, j, and k values calculated by (6), (4), and (5), respectively, and their final values after the correction. The underlined values highlight the problematic ids. Table 3. Sample data showing the implementation of the revised formulation and correction Values Calculated using (6) Values after Correction id i j k i j k 364 13 13 0 14 1 0 365 13 13 1 14 2 0 366 13 13 2 14 2 1 367 14 3 0 14 3 0 368 14 3 1 14 3 1 369 14 3 2 14 3 2 370 14 4 0 14 4 0 3. Effective usage of GPU resources Imprecise usage of GPU causes considerable decrease in the performance of the algorithm. For this reason, we test different configurations in order to draw conclusions about the optimal usage of GPU resources. Due to resource restrictions of our GPU device, Nvidia Quadro K600, and the occupancy of its streaming multiprocessor (SM), our experimental results are specific to that device. However, they also provide insights for strategies to use GPUs effectively.

414 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 (a) (b) Fig 3. Parallelization (a) Iteration level parallelism, (b) Thread level parallelism During parallelization, it is important to combine thread and iteration level parallelism. In our case, one thread can perform several jobs in a parallel way, which exploits iteration level parallelism (Fig 3.a), and also different threads can perform different jobs in parallel, which utilizes thread level parallelism (Fig 3.b). 3.1. Experimental design The number of jobs that will be performed by a thread can be calculated by dividing the number of possible edge exchanges by the number of total threads in GPU, which is denoted by t. The number of jobs for each thread is referred to as iterations as it represents iteration level parallelism. So, iterations= nn(nn 1)/(2 tt). This basically means that each thread will process iterations times to complete its assigned jobs. We analyze different combinations of thread and iteration level parallelism to investigate the effective allocation of resources. The GPU has a grid structure which has several blocks inside and each block has several threads. In CUDA, each block is executed as warps and a warp is a group of 32 threads. - The occupancy is calculated by dividing the number of active (busy) warps in the SM by the number of warps supported by the SM. - Quadro K600 has one streaming multiprocessor. Its SM has maximum 2048 resident threads and 16 resident blocks. - As each warp consists of 32 threads, an SM has a maximum of 64 (2048/32) resident warps. 1 SM = 2048 active threads = 64 active warps B1 1024 T 32 W B2 1024 T 32 W B1 512 T 16 W B2 512 T 16 W B 3 512 T 16 W B4 512 T 16 W B1 256T 8W B2 256T 8W B3 256T 8W B4 256T 8W B5 256T 8W B6 256T 8W B7 256T 8W B8 256T 8W B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 Fig 4. Configuration settings to utilize all warps in SM Each row in Fig. 4 shows four different configuration settings in SM. In this figure, B, T, and W represent block, thread, and warp, respectively. Each row can be considered as a grid of SM. For example, in the first setting there are 2 blocks in the grid and 1024 threads, which corresponds to 32 warps in each block. In the second, there are 4 blocks in the grid and 512 threads, with 16 warps in each block. In each setting all possible active warps (64 warps) in SM are utilized and block dimensions are arranged as the product of a warp size because each block is executed as warps. Since all the warps are used, the occupancy is %100 in these settings. Under normal circumstances 100% occupancy is managed through the settings in Fig. 4. However, other resource restrictions such as the maximum number of registers and shared memory limit per SM may prevent all the warps to be utilized. As a result, 100% occupancy may not be achievable. On Quadro K600, the total number of registers per SM is 65536 and shared memory per SM is 49152 bytes.

Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 415 3.2. Analysis In this section, we investigate the performance of different resource allocation settings on the TSP instances with different sizes. Table 4 shows the results on a TSP instance consisting of 500 nodes. Block Dim and Grid Dim refer to block and grid dimensions, respectively, and the column Time shows the time required for all possible edge exchange calculations in a tour in milliseconds. Table 4. Observed best kernel launches for an instance with 500 nodes Block Dim Grid Dim Time (ms) 1024 2 61 0.203 512 4 61 0.203 256 8 61 0.209 128 9 109 0.307 A total of 125,250 edge exchange calculation jobs are distributed among the threads in different ways. The block dimensions are arranged as multiples of warp size 32. The configurations belonging to the same block dimension form a group. So, we have four test groups with block dimensions of 1024, 512, 256, and 128. For each block dimension, we tested different [grid dimension, iterations] combinations (See Appendix for details). Table 4 reports only the best performing combination. We observe that in the first three groups the best performances are observed when all warps are active. This means that shared memory and registers do not restrict the system. However, in the last group, only 9 blocks out of 16 are used when the best performance is achieved. Table 5. Restrictions of shared memory and registers for an instance with 500 nodes All warps used Restriction of Registers Restriction of Shared Memory After Restrictions Grid Dim Registers Used in SM Max Active Blocks Allowed Block Dim 1024 2 1024 20 = 20480 65536/20480 = 3 > 2 Shared Memory Used in SM Max Active Blocks Allowed Grid Dim Occupancy (%) 49152/5012 = 9 > 2 2 100 512 4 512 20 = 10240 65536/10240 = 6 > 4 49152/5012 = 9 > 4 4 100 5012 bytes 256 8 256 20 = 5120 65536/5120 = 12 > 8 49152/5012 = 9 > 8 8 100 128 16 128 20 = 2560 65536/2560 = 25 >16 49152/5012 = 9 <16 9 56 In the instance with 500 nodes, the shared memory usage per block is 5012 bytes. The maximum shared memory size per SM is 49,152 bytes. The shared memory allows 9 active blocks (49152/5012) which is greater than 2. So, the shared memory capacity of the SM does not restrict the usage of all active warps. For the first three groups, the best performances are achieved when all warps are active, i.e. when the occupancy is 100%. However, the shared memory capacity prevents the usage of all warps in the last group. Although shared memory still allows 9 active blocks, 16 active blocks are required to exploit all warps in this case. In other words, SM has enough memory for only 9 blocks. In this group, (128/32) 9 = 36 warps out of 64 can be utilized by launching 9 blocks and 128 threads in each block. Hence, the occupancy is 56% and the run time performance is the worse. Nevertheless, among all combinations the best performance within this group is achieved when all the warps allowed by the shared memory and register are used. Table 6 summarizes the results for problems with different sizes. We observe that the run time performance declines with decreasing occupancy rate.

416 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 Table 6. Performances for instances of different sizes Occupancy (%) Time (ms) Block Size Nodes 1024 512 256 128 500 100 100 50 25 1000 100 100 50 25 1500 100 75 38 19 2000 100 50 25 13 500 0.2 0.2 0.2 0.3 1000 0.7 0.7 1 2 1500 1 2 3 7 2000 3 5 10 19 4. Computational results We tested the best performing GPU configuration on TSP instances with different sizes. Table 7 compares the performances of sequential 2-opt algorithm implemented on the CPU and the parallel implementation on the GPU. The initial solution is constructed naively using sequential ordering of the nodes. The results show that parallel 2-opt algorithm runs faster than the sequential and the difference in the speed becomes significant with the growing size of the problem. For example, GPU performs 14 times faster in the instance with 4000 nodes. On the other hand, it does not necessarily yield the best solution. Table 7. Comparison of sequential and parallel 2-opt implementations Sequential Implementation on CPU Parallel Implementation on GPU Nodes Run Time (ms) Tour Length Run Time (ms) Tour Length 500 253 541 16117 181 533 16278 1000 5512 1081 34448 1096 1094 35985 2000 79233 2219 61279 7099 2166 60926 3000 315715 3292 91778 50754 3268 91132 4000 1287758 4470 121050 93671 4450 121396 Applying the local search algorithm starting with a good initial solution decreases the number of iterations and shortens the total run time. So, we applied the nearest neighbor method (NN) to obtain an improved initial solution compared to the naive approach above. NN starts from an arbitrary node and builds the tour by moving to the closest not-yet-visited node next. In this case, we also tested larger instances including 6000 and 9000 nodes. The results are reported in Table 8. We can observe that NN enables the local search to converge to better solutions in less time, in line with the expectations. Note that NN speeds up the algorithm 20.4% in the 9000-node instance whereas the speed-up is only 4.4% for the 500-node instance. This difference is due to the fact that NN is performed on the CPU, which dramatically increases the share of the CPU time on the total run time in smaller problems.

Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 417 Table 8. Algorithm performance with naive and nearest neighbor initial solutions Nodes Tour Length With Naive Initial Solution Run Time (ms) With Nearest Neighbor Initial Solution Tour Length Run Time (ms) 500 16278 533 181 15707 417 173 1000 35985 1094 1096 39763 867 970 2000 60926 2166 7099 60521 1877 6276 2500 78383 2772 23011 77113 2273 19449 3000 91132 3268 50754 89700 2847 34611 3500 105356 3883 62493 103951 3396 55915 4000 121946 4462 93820 120219 3812 81460 6000 10480046 1554 95925 8559520 1212 74821 9000 15206013 2362 493356 12344314 1876 392864 5. Conclusion In this study, we investigated parallelization strategies for utilizing the GPU resources effectively. Considering the 2-opt local search parallelization approach presented in the literature we performed an extensive performance analysis to configure the kernel parameters. Our experiments showed that the best strategy for the peak performance is to utilize all resident warps as the shared memory and register limits of the device allow and perform the remaining jobs through iteration level parallelism. We should keep the warps in the device busy as much as possible, which provides thread level parallelism. Then, the remaining jobs should be distributed among the launched threads equally. In other words, one thread will perform more than one job concurrently and this will accompany iteration level parallelism. Since the occupancy decreases in larger problems, the iteration level parallelism should be kept at maximum. Future research on this topic may focus on the parallelization strategies for other local search and metaheuristic methods. These methods are widely applied to various combinatorial optimization problems and usually require long computation times for better convergence. So, problem specific implementations of GPU are needed to enhance their performance where the parts of the algorithm requiring intense computations may be handled by GPU specific functions. Acknowledgment This research was partially supported by The Scientific and Technical Research Council of Turkey through Grant #113M522 to the second author. Appendix A. Detailed experimental results of the 500-node TSP instance We report our experimental results belonging to four test groups with block dimensions of 1024, 512, 256, and 128. Table A.1 gives the performances of different resource allocation settings on the TSP instance including 500 nodes. The rows in bold indicate the best result among each group. Block Dimension and Grid Dimension refer to the number of threads in a block and the number of blocks in a grid, respectively. is the number of possible 2-opt edge exchange calculations performed by a thread. GPU Time is the time spent for calculating all 125,250 edge exchanges. 2-opt shows the total number of 2-opt edge exchanges performed. CPU+GPU Time is the total time elapsed from the start of the algorithm until the last exchange has been performed.

418 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) 409 418 Table 9.1. Detailed results of the 2-opt experiments on the 500-node TSP instance Block Dimension Grid Dimension GPU Time (ms) Tour Length 2-opt CPU+GPU Time (ms) 1024 122 1 0.369 16298 530 428 1024 61 2 0.295 16304 529 354 1024 2 61 0.205 16278 533 181 1024 1 122 0.333 16148 543 318 512 244 1 0.348 16304 529 264 512 122 2 0.281 16304 529 222 512 4 61 0.212 16298 529 189 512 2 122 0.334 16304 529 249 512 1 244 0.635 16304 529 425 256 488 1 0.366 16304 529 371 256 244 2 0.280 16304 529 301 256 122 4 0.246 16304 529 337 256 8 61 0.209 16193 539 318 256 4 122 0.335 16304 529 267 256 1 488 1.246 16304 529 888 128 975 1 0.624 16304 529 422 128 488 2 0.458 16304 529 423 128 244 4 0.386 16304 529 342 128 122 8 0.353 16304 529 319 128 16 61 0.342 16196 538 313 128 8 122 0.337 16304 529 327 128 1 975 2.469 16304 529 1426 References Brodtkorb, A. R., Hagen, T. R., Schulz, C., Hasle, G., 2013. GPU computing in discrete optimization. Part I: Introduction to the GPU. EURO Journal on Transportation and Logistics 2, 129 157. Capodieci N., Emilia R., Burgio P., 2015. Efficient implementation of genetic algorithms on GP-GPU with scheduled persistent CUDA threads. In: Proceedings of the 7th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), 6-12. Coelho, I. M., Ochi, L. S., Munhoz, P. L. A., Souza, M. J. F., Farias, R., Bentes, C., Farias R., 2016. An integrated CPU GPU heuristic inspired on variable neighborhood search for the single vehicle routing problem with deliveries and selective pickups. International Journal of Production Research 54, 945-962. Dawson, L., Stewart I. A., 2014. Optimization-based edge detection on the GPU using CUDA. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC), 1736 1743 Kang S., Kim. S, Won. J, Kang. Y., 2016. GPU-based parallel genetic approach to large-scale travelling salesman problem. The Journal of Supercomputing, 1-16 (Available online). Janiak, A., Janiak, W. A., Lichtenstein, M., 2008. Tabu search on GPU. Journal of Universal Computer Science 14, 2416 2426. O Neil, M. A., Burtscher, M., 2015. Rethinking the parallelization of random-restart hill climbing: a case study in optimizing a 2-opt TSP solver for GPU execution. In: Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 99-108. O Neil, M. A., Tamir, D., Burtscher, M., 2011. A parallel GPU version of the traveling salesman problem. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 348 353. Rocki, K., Suda, R., 2012. Accelerating 2-opt and 3-opt local search using GPU in the travelling salesman problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), 489 495. Schulz, C., 2013. Efficient local search on the GPU investigations on the vehicle routing problem. Journal of Parallel and Distributed Computing 73, 14 31. Sinha R. S., Singh S., Singh S., Banga V. K., 2016. Accelerating genetic algorithm using general purpose GPU and CUDA. International Journal of Computer Graphics 7, 17-30. Van Luong, T., Melab, N., Talbi, E. G., 2009. Parallel local search on GPU. Research Report RR-6915, INRIA. Van Luong, T., Melab, N., Talbi, E. G., 2013. GPU computing for parallel local search metaheuristic algorithms. IEEE Transactions on Computers 62, 173 185. https://en.wikipedia.org/wiki/cuda (last accessed on May 5, 2016)