Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU

Size: px

Start display at page:

Download "Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU"

Prosper Gibson
6 years ago
Views:

Available online at www.sciencedirect.com ScienceDirect Transportation Research Procedia 22 (2017) 409 418 www.elsevier.

the effective use of GPU Gizem Ermiş *, Bülent Çatay Sabanci University, Faculty of Engineering and Natural Sciences, Tuzla, Istanbul 34956, Turkey Abstract Graphics processor units (GPUs) are

1 Available online at ScienceDirect Transportation Research Procedia 22 (2017) th EURO Working Group on Transportation Meeting, EWGT2016, 5-7 September 2016, Istanbul, Turkey Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU Gizem Ermiş *, Bülent Çatay Sabanci University, Faculty of Engineering and Natural Sciences, Tuzla, Istanbul 34956, Turkey Abstract Graphics processor units (GPUs) are many-core processors that perform better than central processing units (CPUs) on data parallel, throughput-oriented applications with intense arithmetic operations. Thus, they can considerably reduce the execution time of the algorithms by performing a wide range of calculations in a parallel manner. On the other hand, imprecise usage of GPU may cause significant loss in the performance. This study examines the impact of GPU resource allocations on the GPU performance. Our aim is to provide insights about parallelization strategies in CUDA and to propose strategies for utilizing GPU resources effectively. We investigate the parallelization of 2-opt and 3-opt local search heuristics for solving the travelling salesman problem. We perform an extensive experimental study on different instances of various sizes and attempt to determine an effective setting which accelerates the computation time the most. We also compare the performance of the GPU against that of the CPU. In addition, we revise the 3-opt implementation strategy presented in the literature for parallelization The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Scientific Committee of EWGT2016. Keywords: GPU computing; parallelization; optimization; GPU architecture; travelling salesperson problem. 1. Introduction With their highly parallel structure, graphics processor units (GPUs) are many-core processors that are specifically designed to perform data-parallel computation. Because of the architectural differences between the * Corresponding author. Tel.: address:ermisgizem@sabanciuniv.edu X 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Scientific Committee of EWGT /j.trpro

2 410 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) central processing units (CPUs) and GPUs, the GPU performance is better on data parallel, throughput-oriented applications with intense arithmetic operations. GPUs have the capability to accelerate algorithms that require high computational power. Thus, they can considerably reduce the execution time of such algorithms by performing a wide range of calculations in a parallel manner. Modern GPUs are many-core processors that are specifically designed to perform data-parallel computation. Data parallelism means that each processor performs the same task on different pieces of distributed data (Brodtkorb et al., 2013). Before the evolution of today s advanced GPUs, traditional, single-core processors were exploited. Computationally hard tasks were taking a great deal of time when they were solved by the help of single-core processors. Faster single-core processors were developed by the computer industry but they were still insufficient for peak performances. Around the year 2000, by fitting more cores in the same chip, single-core processors evolved to multi-core processors (Fig. 1), which work together to process instructions, and thus have higher total theoretical performance (Brodtkorb et al., 2013). Fig 1. A basic block diagram of a generic multi-core processor Because of gaming industry needs, GPUs which actually were the normal component in common PCs, developed quickly in terms of computational performance. Multi-core GPU processors evolved to massive multi-core or manycore processors which work as massively parallel stream processing accelerators or data parallel accelerators. Because of the rapid advancements in the GPU technology, they became common as accelerators in general purpose programming. Although both multi-core CPUs and GPUs can implement parallel algorithms, the architectural differences between CPUs and GPUs created different usage areas depending upon the nature of the problem. While multi-core CPUs are designed for task parallel implementations, many-core processors are specifically designed for data parallel implementations. The efficiency is as much critical factor as the solution quality when an algorithm is applied to an optimization problem such as traveling salesman problem (TSP), a well-known NP-Hard combinatorial optimization problem. Local search algorithms such as 2-opt or 3-opt are computationally difficult when they are implemented on the CPU. Because these techniques evaluate all edge exchanges on the tour to determine the exchange that reduces the tour length the most, they require a large number of computations and comparisons. So, parallel implementation can accelerate these computations significantly. GPUs which have data parallel structure can perform these simple computations in parallel. On the other hand, the design of the parallel implementation plays a crucial role in achieving effective utilization of the GPU resources, thus optimizing the system performance. CUDA is a parallel computing platform and programming model introduced by NVIDIA. It enables programmers to use GPUs for general purpose processing (Wikipedia, 2016). Van Luong et al. (2009) used GPU as a coprocessor for extensive computations where the solutions of the TSP from a given 2-exchange neighborhood are evaluated in parallel. The remaining computations are performed on the CPU. A local search has four main steps: neighborhood generation, evaluation, move selection, and solution update. The simplest method is to create the neighborhood on the CPU and transfer it to GPU each time. Van Luong et al. (2013) applied this technique; however, it requires copying of a lot of information from the CPU to the GPU. To prevent this drawback Rocki and Suda (2012) and Schulz (2013) utilized an explicit formula to explore the

3 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) neighborhood by assigning one or several moves to a thread. Since the neighborhood evaluation is the most computationally expensive task, it was generally performed on the GPU. O Neil et al. (2011), Rocki and Suda (2012), and Schulz (2013) presented some implementation details in order to execute kernel efficiently. Yet, only Schulz (2013) presented the profiling analysis of the implementations. We refer the interested reader to Brodtkorb et al. (2013) for further details on the GPU technology and its applications. Dawson and Stewart (2014) applied the first parallel implementation of Ant Colony Optimization based edge detection algorithm on GPU. They mapped individual ants to warps and executed more ants in each iteration which reduced the number of iterations to generate an edge map. O Neil and Burtscher (2015) presented random restart hill climbing with 2-opt local search for TSP in parallel. They parallelized independent climbs between blocks and 2-opt evaluations between threads within a block. Genetic Algorithms were also successfully implemented on the GPU by Capodieci and Emilia (2015), Sinha et al. (2016), and Kang et al. (2016). Recently, Coelho et al. (2016) applied the variable neighborhood search to the single vehicle routing problem with deliveries and selective pickups where the local search was executed on GPU. GPU implementations can be difficult because of the distinctive manner of work and the complicated memory structure of the GPU. Imprecise usage of GPU may cause significant loss in the performance. In this study, we analyze the effect of GPU resource allocations on the GPU performance. Our aim is to provide insights about parallelization strategies in CUDA and to propose strategies for utilizing GPU resources effectively. Following the approaches of Rocki and Suda (2012) we investigate the parallelization of 2-opt and 3-opt local search heuristics by allocating the resources of Nvidia Quadro K600 1 GB GPU device in different ways. We perform an extensive experimental study on TSP instances of various sizes and attempt to determine an effective setting which accelerates the computation time the most. We also compare the performance of the GPU against the sequential implementation on an Intel Xeon E5 CPU with 3.30 GHz speed. The main contribution of the study is to improve the work of Rocki and Suda (2012) by determining the most effective allocation of GPU resources on large size problems. In addition, we correct the parallelization formulation of Rocki and Suda (2012) proposed for the implementation of the 3-opt algorithm. 2. Parallelization 2.1. Methodology 2-opt algorithm with the best improvement calculates the effect of each possible edge exchange on the total cost of the current tour. Among all these possible exchanges, it performs the one that yields the largest improvement, in other words the exchange that decreases the total cost the most. The algorithm is repeated until no further improving exchange exists. Fig 2. A 2-opt move on the travelling salesman tour Fig. 2 demonstrates a 2-opt exchange move. To calculate the effect of a 2-opt exchange, two edges are removed from the current tour and the emerged two sub-tours are reconnected at a different position by protecting the validity of the tour. For a TSP instance of n nodes, the number of possible edge exchanges in each iteration is (nn 1)/2.

4 412 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) A thread on the GPU is a basic element of the data to be processed. Threads map to stream processors of GPU. The parallelism is managed by distributing (nn 1)/2 possible edge exchanges among different threads equally so that each thread can calculate the effects of relevant exchanges on the tour cost concurrently. Therefore, the following formula is applied to all node combinations in the current tour in parallel: = (tt[ii],tt[jj]) + (tt[ii 1],tt[jj 1]) (tt[ii 1],tt[ii]) (tt[jj 1],tt[jj]), where d( ) refers to the distance. This formula calculates the change in the tour length by subtracting the distance of removed edges from the distance of added edges. ii, jj, ii 1, and jj 1 represent the nodes that connect these edges. The most critical part in the algorithm is to develop common formulas for all threads so that they can produce the correct i and j values which parallelize the formula. In order to obtain i and j values, they should be related to the ids that derive from predefined variables in CUDA. Predefined variables are built-in variables that return the thread ID of the thread that is executed by its stream processor. An id represents each unique job in the problem which calculates the result of the edge exchange. Rocki and Suda (2012) proposed the following formulae to produce i and j values: ii = 3+ 8 iiii+1 (1) 2 jj = 1+id (i 2)(i 1) (2) 2 Table 1 provides an example data including sample i and j values, the corresponding ids that derive from predefined CUDA variables, and calculation of the change in the total tour cost. Table 1. i and j values related to GPU ids id i j (tt[ ],tt[1]) + (tt[1],tt[ ]) (tt[1],tt[ ]) (tt[ ]+tt[ (tt[3],tt[1]) + (tt[ ],tt[ ]) (tt[ ],tt[3]) (tt[ ]+tt[ (tt[3],tt[ ]) + (tt[2],tt[1]) (tt[ ],tt[3]) (tt[1]+tt[ (tt[ ],tt[1]) + (tt[3],tt[ ]) (tt[3],tt[ ]) (tt[ ]+tt[ (tt[ ],tt[ ]) + (tt[3],tt[1]) (tt[3],tt[ ]) (tt[1]+tt[ (tt[ ],tt[3]) + (tt[3],tt[ ]) (tt[3],tt[ ]) (tt[ ]+tt[ (tt[ ],tt[1]) + (tt[ ],tt[ ]) (tt[ ],tt[ ]) (tt[ ]+tt[1 The only difference in the 3-opt algorithm is that three edges will be cut and then reconnected at different places in the tour. In this case, the indices i, j, and k are used to determine the edges. Considering (nn-1)(nn-2)/6 possible edge exchanges, Rocki and Suda (2012) proposed the following formulations to determine i, j, and k values: 3 ii = 3 iiii + 9 iiii iiii 9 iiii (3) 9 jj = 3+ 8 (iiii ii(ii 1)(ii 2) 6 2 )+1 (4) kk = iiii ( ii(ii 1)(ii 2) 6 ) ( jj(jj 1) ) (5) 2

5 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Proposed revision for the parallelization of 3-opt algorithm The formulations of 3-opt generate false i, j, and k values for certain id values. Table 2 provides examples of the problematic ids and the resulting false i values as well as the expected correct values. For example, for id values within the interval [364,454] i should take the value of 14; however, i=15 when id=454. Since j and k are calculated using i, their values are also false. Table 2. Examples of problematic ids giving false i values id range [364,454] [455,559] [560,679] [680,815] [816,968] [3276,3653] [30856,32508] Problematic ids , ,32506,32507,32508 Incorrect i value produced by (3) Correct i value obtained by (6) To fix the above problem, we replace (3) with the following formulation: 3 ii = 3 iiii + 9 iiii (6) 9 The new formulation leads to a deficiency in the value of j: when j is calculated using (4) based on i given by (6) i and j can take the same value, hence that does not correspond to an exchange. To overcome this shortcoming, we propose the following correction approach: if i=j, then i=i+1 and j is recalculated using (4). k is still calculated using (5). Table 3 shows a set of sample i, j, and k values calculated by (6), (4), and (5), respectively, and their final values after the correction. The underlined values highlight the problematic ids. Table 3. Sample data showing the implementation of the revised formulation and correction Values Calculated using (6) Values after Correction id i j k i j k Effective usage of GPU resources Imprecise usage of GPU causes considerable decrease in the performance of the algorithm. For this reason, we test different configurations in order to draw conclusions about the optimal usage of GPU resources. Due to resource restrictions of our GPU device, Nvidia Quadro K600, and the occupancy of its streaming multiprocessor (SM), our experimental results are specific to that device. However, they also provide insights for strategies to use GPUs effectively.

6 414 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) (a) (b) Fig 3. Parallelization (a) Iteration level parallelism, (b) Thread level parallelism During parallelization, it is important to combine thread and iteration level parallelism. In our case, one thread can perform several jobs in a parallel way, which exploits iteration level parallelism (Fig 3.a), and also different threads can perform different jobs in parallel, which utilizes thread level parallelism (Fig 3.b) Experimental design The number of jobs that will be performed by a thread can be calculated by dividing the number of possible edge exchanges by the number of total threads in GPU, which is denoted by t. The number of jobs for each thread is referred to as iterations as it represents iteration level parallelism. So, iterations= nn(nn 1)/(2 tt). This basically means that each thread will process iterations times to complete its assigned jobs. We analyze different combinations of thread and iteration level parallelism to investigate the effective allocation of resources. The GPU has a grid structure which has several blocks inside and each block has several threads. In CUDA, each block is executed as warps and a warp is a group of 32 threads. - The occupancy is calculated by dividing the number of active (busy) warps in the SM by the number of warps supported by the SM. - Quadro K600 has one streaming multiprocessor. Its SM has maximum 2048 resident threads and 16 resident blocks. - As each warp consists of 32 threads, an SM has a maximum of 64 (2048/32) resident warps. 1 SM = 2048 active threads = 64 active warps B T 32 W B T 32 W B1 512 T 16 W B2 512 T 16 W B T 16 W B4 512 T 16 W B1 256T 8W B2 256T 8W B3 256T 8W B4 256T 8W B5 256T 8W B6 256T 8W B7 256T 8W B8 256T 8W B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 Fig 4. Configuration settings to utilize all warps in SM Each row in Fig. 4 shows four different configuration settings in SM. In this figure, B, T, and W represent block, thread, and warp, respectively. Each row can be considered as a grid of SM. For example, in the first setting there are 2 blocks in the grid and 1024 threads, which corresponds to 32 warps in each block. In the second, there are 4 blocks in the grid and 512 threads, with 16 warps in each block. In each setting all possible active warps (64 warps) in SM are utilized and block dimensions are arranged as the product of a warp size because each block is executed as warps. Since all the warps are used, the occupancy is %100 in these settings. Under normal circumstances 100% occupancy is managed through the settings in Fig. 4. However, other resource restrictions such as the maximum number of registers and shared memory limit per SM may prevent all the warps to be utilized. As a result, 100% occupancy may not be achievable. On Quadro K600, the total number of registers per SM is and shared memory per SM is bytes.

7 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Analysis In this section, we investigate the performance of different resource allocation settings on the TSP instances with different sizes. Table 4 shows the results on a TSP instance consisting of 500 nodes. Block Dim and Grid Dim refer to block and grid dimensions, respectively, and the column Time shows the time required for all possible edge exchange calculations in a tour in milliseconds. Table 4. Observed best kernel launches for an instance with 500 nodes Block Dim Grid Dim Time (ms) A total of 125,250 edge exchange calculation jobs are distributed among the threads in different ways. The block dimensions are arranged as multiples of warp size 32. The configurations belonging to the same block dimension form a group. So, we have four test groups with block dimensions of 1024, 512, 256, and 128. For each block dimension, we tested different [grid dimension, iterations] combinations (See Appendix for details). Table 4 reports only the best performing combination. We observe that in the first three groups the best performances are observed when all warps are active. This means that shared memory and registers do not restrict the system. However, in the last group, only 9 blocks out of 16 are used when the best performance is achieved. Table 5. Restrictions of shared memory and registers for an instance with 500 nodes All warps used Restriction of Registers Restriction of Shared Memory After Restrictions Grid Dim Registers Used in SM Max Active Blocks Allowed Block Dim = /20480 = 3 > 2 Shared Memory Used in SM Max Active Blocks Allowed Grid Dim Occupancy (%) 49152/5012 = 9 > = /10240 = 6 > /5012 = 9 > bytes = /5120 = 12 > /5012 = 9 > = /2560 = 25 > /5012 = 9 < In the instance with 500 nodes, the shared memory usage per block is 5012 bytes. The maximum shared memory size per SM is 49,152 bytes. The shared memory allows 9 active blocks (49152/5012) which is greater than 2. So, the shared memory capacity of the SM does not restrict the usage of all active warps. For the first three groups, the best performances are achieved when all warps are active, i.e. when the occupancy is 100%. However, the shared memory capacity prevents the usage of all warps in the last group. Although shared memory still allows 9 active blocks, 16 active blocks are required to exploit all warps in this case. In other words, SM has enough memory for only 9 blocks. In this group, (128/32) 9 = 36 warps out of 64 can be utilized by launching 9 blocks and 128 threads in each block. Hence, the occupancy is 56% and the run time performance is the worse. Nevertheless, among all combinations the best performance within this group is achieved when all the warps allowed by the shared memory and register are used. Table 6 summarizes the results for problems with different sizes. We observe that the run time performance declines with decreasing occupancy rate.

8 416 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Table 6. Performances for instances of different sizes Occupancy (%) Time (ms) Block Size Nodes Computational results We tested the best performing GPU configuration on TSP instances with different sizes. Table 7 compares the performances of sequential 2-opt algorithm implemented on the CPU and the parallel implementation on the GPU. The initial solution is constructed naively using sequential ordering of the nodes. The results show that parallel 2-opt algorithm runs faster than the sequential and the difference in the speed becomes significant with the growing size of the problem. For example, GPU performs 14 times faster in the instance with 4000 nodes. On the other hand, it does not necessarily yield the best solution. Table 7. Comparison of sequential and parallel 2-opt implementations Sequential Implementation on CPU Parallel Implementation on GPU Nodes Run Time (ms) Tour Length Run Time (ms) Tour Length Applying the local search algorithm starting with a good initial solution decreases the number of iterations and shortens the total run time. So, we applied the nearest neighbor method (NN) to obtain an improved initial solution compared to the naive approach above. NN starts from an arbitrary node and builds the tour by moving to the closest not-yet-visited node next. In this case, we also tested larger instances including 6000 and 9000 nodes. The results are reported in Table 8. We can observe that NN enables the local search to converge to better solutions in less time, in line with the expectations. Note that NN speeds up the algorithm 20.4% in the 9000-node instance whereas the speed-up is only 4.4% for the 500-node instance. This difference is due to the fact that NN is performed on the CPU, which dramatically increases the share of the CPU time on the total run time in smaller problems.

9 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Table 8. Algorithm performance with naive and nearest neighbor initial solutions Nodes Tour Length With Naive Initial Solution Run Time (ms) With Nearest Neighbor Initial Solution Tour Length Run Time (ms) Conclusion In this study, we investigated parallelization strategies for utilizing the GPU resources effectively. Considering the 2-opt local search parallelization approach presented in the literature we performed an extensive performance analysis to configure the kernel parameters. Our experiments showed that the best strategy for the peak performance is to utilize all resident warps as the shared memory and register limits of the device allow and perform the remaining jobs through iteration level parallelism. We should keep the warps in the device busy as much as possible, which provides thread level parallelism. Then, the remaining jobs should be distributed among the launched threads equally. In other words, one thread will perform more than one job concurrently and this will accompany iteration level parallelism. Since the occupancy decreases in larger problems, the iteration level parallelism should be kept at maximum. Future research on this topic may focus on the parallelization strategies for other local search and metaheuristic methods. These methods are widely applied to various combinatorial optimization problems and usually require long computation times for better convergence. So, problem specific implementations of GPU are needed to enhance their performance where the parts of the algorithm requiring intense computations may be handled by GPU specific functions. Acknowledgment This research was partially supported by The Scientific and Technical Research Council of Turkey through Grant #113M522 to the second author. Appendix A. Detailed experimental results of the 500-node TSP instance We report our experimental results belonging to four test groups with block dimensions of 1024, 512, 256, and 128. Table A.1 gives the performances of different resource allocation settings on the TSP instance including 500 nodes. The rows in bold indicate the best result among each group. Block Dimension and Grid Dimension refer to the number of threads in a block and the number of blocks in a grid, respectively. is the number of possible 2-opt edge exchange calculations performed by a thread. GPU Time is the time spent for calculating all 125,250 edge exchanges. 2-opt shows the total number of 2-opt edge exchanges performed. CPU+GPU Time is the total time elapsed from the start of the algorithm until the last exchange has been performed.

10 418 Gizem Ermiş et al. / Transportation Research Procedia 22 (2017) Table 9.1. Detailed results of the 2-opt experiments on the 500-node TSP instance Block Dimension Grid Dimension GPU Time (ms) Tour Length 2-opt CPU+GPU Time (ms) References Brodtkorb, A. R., Hagen, T. R., Schulz, C., Hasle, G., GPU computing in discrete optimization. Part I: Introduction to the GPU. EURO Journal on Transportation and Logistics 2, Capodieci N., Emilia R., Burgio P., Efficient implementation of genetic algorithms on GP-GPU with scheduled persistent CUDA threads. In: Proceedings of the 7th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), Coelho, I. M., Ochi, L. S., Munhoz, P. L. A., Souza, M. J. F., Farias, R., Bentes, C., Farias R., An integrated CPU GPU heuristic inspired on variable neighborhood search for the single vehicle routing problem with deliveries and selective pickups. International Journal of Production Research 54, Dawson, L., Stewart I. A., Optimization-based edge detection on the GPU using CUDA. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Kang S., Kim. S, Won. J, Kang. Y., GPU-based parallel genetic approach to large-scale travelling salesman problem. The Journal of Supercomputing, 1-16 (Available online). Janiak, A., Janiak, W. A., Lichtenstein, M., Tabu search on GPU. Journal of Universal Computer Science 14, O Neil, M. A., Burtscher, M., Rethinking the parallelization of random-restart hill climbing: a case study in optimizing a 2-opt TSP solver for GPU execution. In: Proceedings of the 8th Workshop on General Purpose Processing using GPUs, O Neil, M. A., Tamir, D., Burtscher, M., A parallel GPU version of the traveling salesman problem. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Rocki, K., Suda, R., Accelerating 2-opt and 3-opt local search using GPU in the travelling salesman problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), Schulz, C., Efficient local search on the GPU investigations on the vehicle routing problem. Journal of Parallel and Distributed Computing 73, Sinha R. S., Singh S., Singh S., Banga V. K., Accelerating genetic algorithm using general purpose GPU and CUDA. International Journal of Computer Graphics 7, Van Luong, T., Melab, N., Talbi, E. G., Parallel local search on GPU. Research Report RR-6915, INRIA. Van Luong, T., Melab, N., Talbi, E. G., GPU computing for parallel local search metaheuristic algorithms. IEEE Transactions on Computers 62, (last accessed on May 5, 2016)

Advances in Metaheuristics on GPU

Advances in Metaheuristics on GPU 1 Thé Van Luong, El-Ghazali Talbi and Nouredine Melab DOLPHIN Project Team May 2011 Interests in optimization methods 2 Exact Algorithms Heuristics Branch and X Dynamic