Parallel local search on GPU and CPU with OpenCL Language

Size: px

Start display at page:

Download "Parallel local search on GPU and CPU with OpenCL Language"

Tabitha Webster
6 years ago
Views:

1 Parallel local search on GPU and CPU with OpenCL Language Omar ABDELKAFI, Khalil CHEBIL, Mahdi KHEMAKHEM LOGIQ, UNIVERSITY OF SFAX SFAX-TUNISIA Abstract Real-world optimization problems are very complex and NP-hard. The modeling of such problems is in constant evolution in term of constraints and objectives and their resolution is expensive in computation time. With all this change, even metaheuristics, well known for their efficiency, begin to be overtaken by data explosion. Recently, Thanks to the publication of languages as OpenCL and CUDA, the development of parallel metaheuristics on GPU platform has a growing interest. Throughout this paper, we propose a parallelization in an iterative level of a local search. The contribution of this work is to propose a robust local search through two popular neighborhood structures. This contribution is applied to some combinatorial problems and adapted for the GPU platform. For this, several techniques have been proposed to accelerate the memory access, control the divergence and to maximize the parallelization. Many versions have been implemented with the OpenCL language to test parallelization on both GPU and CPU. Computational performance of this parallel local search are reported and compared to the sequential version. Keywords: GPU, OpenCL, Optimization, Parallel Local search, Knapsack problem, Traveling salesman problem 1. INTRODUCTION Many optimization problems in different area including data mining, communication, logistic and transport are NPhard. These problems have been resolved successfully with optimization approach such as metaheuristics (generic heuristics). Local search (LS) is one of the basic metaheuristics well known for his efficiency. LS is based on a single-solution process. Starting from a single solution, at each iteration the heuristic replaces the current solution by a neighbor that improves the objective function. The search stops when local optimum is reached. In the literature, we can found various LS based heuristics such as hill climbing, simulated annealing, tabu search, iterative local search and variable neighborhood search, etc. A state-of-the-art of LS algorithms can be found in [1]. Many methods have been proposed to parallelize the LS. Ref. [1] proposed three major parallel models to design parallel metaheuristics. These models are the algorithmiclevel, the iteration-level and the solution-level. Our work focus on the iteration-level, the main goal of this model is to parallelize the neighborhood generation. During the last decade, the growing of the Graphic Process Unit (GPU) is faster than the growing of the Central Process Unit (CPU). This amounts to the explosion of the industry of video games and his greedy demand for graphic power. GPU provide great computing power for a low cost, this is one of the main reason for the success of this architecture. Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessor. The multi-core trajectory like the current CPU and the many-core trajectory like the current GPU [7]. Nowadays, many-core GPU as NVIDIA Tesla K20X have 2688 cores and can provide a theoretical performance of 3.95 Tflops. In the other hand, multi-core CPU as AMD FX 8350 has 8 cores and a performance up to 256 Gflops. The use of the GPU on non-graphic application caused a rapid increase of interest on the scientist community. Indeed, the GPU became one of the most interesting platforms to implement parallel metaheuristics. As academic and industrial combinatorial optimization problems always increase in size and complexity, the field of parallel metaheuristics has to follow this evolution of high performance computing. Currently, the most three biggest vendors of GPU are Intel, Nvidia and AMD [2]. This market sharing allows the Open Computing Language (OpenCL) to become a very interesting choice to implement with GPU and to deal with the financier challenge. Indeed, thanks to the Khornos group, OpenCL became an industrial standard and can support many vendors of GPU. In addition, OpenCL allows us to run the same parallel code on both GPU and CPU to compare the efficiency of the two platforms. In this paper, we present a design of robust local search implementations on GPU using OpenCL language applied to the Knapsack Problem (KP) [10] and the Traveling Salesman Problem (TSP) [13]. Three contributions are proposed, the first one is the acceleration of the memory access on GPU, the second is the maximization of the data parallelized and the last one is the control of divergence. The rest of the paper is organized as follows. The second section is dedicated to introduce the OpenCL programming model with GPU and the different strategies to implement efficient metaheuristics. In section three, we present several techniques to design parallel local search for the KP and the TSP. In the section four we present the results of the computational experiments. The last section concludes the manuscript. 2. OPEN COMPUTING LANGUAGE ON GPU Programmable GPU hardware started with programmable units (pixel and vertex shaders), The Open Computing

2 Language is created to help programmer to use the GPU platform. The main advantage of this language is his portability and capacity to run on the most popular vendors of GPU like AMD, NVIDIA and Intel [2]. To get a good result with GPU, we need different strategies to design efficient metaheuristics with OpenCL. For this reason, we need to understand how the GPU execute the parallel program. Like the other language on GPU, OpenCL is based on cooperation between the host who represent the CPU and one or many devices such as the GPU in our case. The parallel code is implemented on a kernel that will be duplicated when the program is executed. This kernel is executed through several Work-Items (WI), these WI are grouped on a set of Work-Groups (WG). WI on the same WG needs to be synchronized with a barrier of synchronization. The most important concept to deal with on the GPU is the management of memories. With OpenCL we have access to four types of memories. The first one is the global memory which is very slow but very large. We should rarely use it because multiple accesses to this memory affect the performance. The second memory is the constant memory, slow and large, we use it only when all the WI have to access to the same memory address. The third memory is the local memory, fast and small, this memory is shared between WI in the same WG. If this memory is well used it can be one of the faster memory on GPU. However, if we over use this memory we can affect the performance. The final type of memory is the private memory (also called register), very fast but very small, using this memory must be carefully because the abuse causes the failure of the execution. The GPU architecture is very well designed for the parallel model and the CPU architecture is very powerful in the sequential model. Indeed, CPU has few arithmetic and logic unit (ALU) but powerful to perform sequential work and the GPU have a lot of ALU less powerful to increase throughput and perform the parallel work. The aim is to use the CPU to execute each iteration in sequentially manner and to use the GPU for the generation of neighborhood in a parallel way. The GPU is composed with a set of compute unit (CU), each CU is composed with a set of processing elements (PE) considered as cores on the GPU architecture. Every PE has his own private memory and every CU shares a local memory between his PE. All transfer between the host and the device are performed throw the global and the constant memory (see Fig1). To optimize execution and have an efficient design on GPU, several strategies need to be considered. We present four strategies that will help us in our work. The first strategy is to reduce transfer between the host and the device. The access to the global memory is faster than the transfer CPU-GPU, for this reason, we need to learn how to reuse data and reduce transfer [3]. The second strategy is to find a way to use the local memory instead of the global memory to accelerate access on GPU memory. The third strategy is to maximize the number of WI to take advantage of Fig1. Architecture of the GPU on the OpenCL Fig2.Divergence of instruction with SIMD system the parallel architecture of the GPU. The last strategy is the control of divergence, since GPU use a Single Instruction Multiple Data (SIMD) system, all data execute a single instruction in the same time, for this reason we need to use the less possible fragments of code like If-Else or loops in the kernel to control the divergence of the execution (see Fig2) [3]. An investigation was conducted in [4], for the decade between 2002 and 2012 they found only 8 works of LS on GPU and no one of them was implemented with OpenCL. In 2011 [5] was the first group who proposed the first results on a LS, they concentrate on binary problems and represent one neighborhood on one single thread. The thread in CUDA language is the equivalent of the WI in OpenCL language. They propose a technique to transform two or three index to one and one index to two or three. The objective of such technique is to reduce the transfer. In 2012, [8] used the LS to accelerate TSP with 2-opt and 3-opt structure [12]. More recently, [6] proposed several techniques and tools to optimize the transfer between the host and the device on LS. In the best of our knowledge, this is the first work that implements LS using the OpenCL language and compares the results between the parallelization on CPU and GPU with the same code. 3. PARALLEL LOCAL SEARCH As already said, the focus in this work is the iterative level [1]. One way to parallelize the neighborhood generation is to affect one neighbor to one WI [5]. This way is the most efficient approach but steal have a problem of memory. Indeed, on the case of large neighborhood, only recent GPU have

generation. With this approach we can solve very large instances and produce robust LS but the neighbor s evaluation still performed sequentially in each WI.

3 enough memory to support large number of WI because of the reduced number of registers. Since our focus is on portability and robustness, we choose another approach consisting on representing every WI as an item of the solution, which leads to a parallel execution of neighbor s generation. With this approach we can solve very large instances and produce robust LS but the neighbor s evaluation still performed sequentially in each WI. Both sequential and parallel pseudo-code versions of local search with the best improving strategy are shown respectively in Listing 1 and 2. Procedure Sequential local search Lets T: current iteration, S(T): the current solution, N(T): the number of neighbors of S(T), V(i): the i th neighbor of S(T), V * (T): the best neighbor of S(T); Begin Build an initial solution S(0); Specific local search pre-treatment; T: =0; Repeat V * (T):=V(1) For i=2 to N(T) Do Generate V(i); If V(i) is better than V * (T) V * (T):=V (i); Endif Endfor S (T+1):= V * (T); T: = T+1; Until S(T-1) better than S(T) Return S(T); End Listing 1: sequential Local search Procedure Parallel local search Lets T: current iteration, S(T): the current solution, N(T): the number of neighbors of S(T) which is equal to the WI number, V(i): the i th neighbor of S(T) produced by WI(i), V * (T): the best neighbor of S(T); Begin Build an initial solution S(0); Specific local search pre-treatment; T: =0; Allocate and copy problem data inputs on device memory; Repeat Allocate and copy the current solution on device memory; Allocate and copy additional information on device memory; Parallel execution of all WI(i) (i:=1 to N(T)) to generate V(i); V * (T):=V(1); For i=2 to N(T) Do If V(i) is better than V * (T) V * (T):=V (i); Endif EndFor S (T+1):= V * (T); T: = T+1; Until S(T-1) better than S(T) Return S(T); End Listing 2: Parallel local search with OpenCL Fig3.Traffic reduction of memory access Our first contribution consists to load some information from the global memory to the local memory to accelerate access time. Every WG has a local memory shared between his WI. Information used by all WI in the same time is loaded on the local memory instead the global memory. This technique reduces the access traffic and the memory access became faster (see Fig3). Listing 3 show the load of the information from the global to the local memory. Procedure reduce memory traffic Lets N: the size of the instance, GM[N]: information from the global memory, LM: a local memory variable Begin For i: =0 to N Do LM: = GM[i]; Synchronize WI; Perform compute operations with LM instead GM[i]; Synchronize WI; Endfor End Listing 3: traffic reduction of memory access A. Local search for KP The 0-1 Knapsack problem is defined as follow: given a knapsack with capacity c and a set of n items. Each item j has profit and weight. The objective is to select a subset of the items so as to Where {

4 KP is the most important knapsack problem and one of the most intensively studied. The literature on KP is vast but covered very well in [9] [10]. For this problem we use Drop and Add moves to generate neighborhoods. We drop each item that was selected in the knapsack and we add another item, not selected yet, hopping to found a best combination (see Fig5). Both the weight and the profit of every item are loaded on the local memory as shown on fig 3 and listing 3. Then, to maximize the number of WI launched, we create two kernels and we count the number of potential drop and add moves. If the number of drop moves is bigger than add moves, we launch every drop move on a WI and we evaluate the potential item to add (Kernel1). Else, we launch every add move on a WI and we evaluate the potential item to drop (Kernel2). At the end of the parallel execution we choose the best combination of drop and add moves. This contribution gives us two advantages, the first one is to maximize the WI launched to exploit the parallel architecture of GPU and the second one is to reduce the sequential work in each WI. B. Local search for TSP The TSP can be defined on a complete undirected graph G= (V, E). The set V= {1,, n} is the vertex (city) set, is an edge set. is defined on E as the Euclidean distance between two vertex i and j. The TSP consists to found a least-cost sequence in which to visit a set of cities, starting and ending at the same city, and in such a way that each city is visited exactly once (the lower cost Hamiltonian circuit). Ref. [11] formulated this problem as follows: For this problem we use the switch neighborhood structure. Every city is switched with all the others hopping to find a shorter circuit (see Fig4). To accelerate the time access using the local memory, the distances between each city to switch and his neighbors are loaded on the local memory as shown on fig 3 and listing 3. Fig5.neighborhood for KP with Drop-Add structure To control the divergence, we create two kernels, kernel 1, to perform adjacent permutation and the rest of permutations for kernel 2. The two kernels are executed in a parallel way. In the end of the parallel execution we chose the best switch move. 4. EXPERIMENTAL RESULTS For the KP we generate 50 instances from 10 3 to 10 4 items strongly correlated randomly generated, we make 5 tests for every instance. For the TSP we use 10 well known tests from the TSPLIB [15] to valid our work. To perform the experimentation we use CPU and GPU configuration. Our objective is seeing what we can earn in parallelization using the same machine without spending money to buy other configurations. Table 1 shows the specifications of each configuration. Unlike the GPU, the number of CU is equal to the number of PE for the CPU. The frequency is the operation performed by the processor in one second, the Turbo Frequency is the frequency of the processor when only one core is working (sequential model). Finally, the theoretical performance is calculated on simple precision (SP) for the two configurations to compare them. The execution depends on this performance and on the rapidity of memory access. For fair comparison in both KP and TSP, the execution on parallel CPU doesn t use the local memory (listing 3) because it s disadvantage the CPU execution. A. Knapsack Problem Table 2 resumes the acceleration of the parallel CPU and GPU for solving KP, every instance in table 2 is the average of 5 tests. We can see that the acceleration of GPU is better than CPU. The KP is a problem which use only two accesses to the memory for add move and two other accesses for drop move, and need a computing power to calculate the new weight and profit. In this specific problem we can see that the theoretical performance in simple precision is very significant. We can see also that the acceleration on GPU appears from the instance with items only at the 47 th iteration. This is thanks to the capability of GPU to adapt quickly with the parallelization. For the CPU the acceleration begin only on the instance with items at the 77 th iteration. The best result recorded is an acceleration of 6.70 times for GPU against 3.69 times for CPU when the LS perform 152 iterations. Fig4.neighborhood for TSP with Switch structure

5 Table 1.Specifications of the configurations Specifications CU/PE Frequency (Turbo) Performance (SP) Intel core i7-2630qm Nvidia Ge-Force GT525M 8/8 2 (2.9) Ghz 128 GFOLPS 2/ Ghz GFLOPS Table2. Acceleration for knapsack Problem Instances (items number) Avg. iterations AccCPU AccGPU Fig6.Curve of acceleration KP Fig 6 shows the evolution of the three models. We can see the clear advantage of the two parallel models against the sequential model. Indeed, the curve of the sequential model is exponential. The two parallel models are more stable and we can see the constant advantage for the GPU. For the KP we can conclude that the performances recorded are very interesting. Indeed, the proposition is robust because it depends on the instance and not on the size of neighborhoods (listing 2). The acceleration grows when the GPU platform is more powerful and the LS perform more iteration. Few work considered the KP on GPU, Indeed, on the best of our knowledge we found only two work for exact method solving the KP with GPU [16] [17]. B. Traveling Salesman Problem The second experience is on TSP. Table 3 shows the acceleration of the parallel CPU and GPU. In this specific problem, the memory bound is very important. For this reason the CPU is more efficient but with the GPU we record good acceleration regarding the theoretical performance of the GPU used. We can see that we get acceleration with the GPU before the parallel CPU. The GPU stay more efficient up to 3496 cities, this efficiency is the consequence of the fast adaptation of GPU with the parallel model. The CPU takes the lead in the rest of instances thanks to his fast access to the memory and efficiency on the sequential work. The GPU continue to grow slowly until have an acceleration of 3.04 times against 5.91 times for the parallel CPU when the LS perform 711 iterations for the last instance. Fig 7 shows the curve for the three models of execution on TSP. We can see the exponential evolution of the sequential Fig7.Curve of acceleration TSP Table3. Acceleration for Traveling Salesman Problem Instances (city number) iterations AccCPU AccGPU dsj fra mu nu dlb Xua fnl ca eg ym model. The two parallel models for TSP haven t the same stability as in KP. The challenge with the TSP is the access to the memory, as we can see the theoretical performances of the configurations are not very significant. For this specific problem we try to compensate the access to the memory with the control of the divergence. It works better for the parallel execution on CPU but with this platform we still not obtaining performance in

6 small instance with few iterations. The GPU platform is better in this case. The last experience is using both parallel CPU and GPU, the idea is to run the first kernel performing the adjacent cities with the GPU and the second kernel performing the others permutation with the CPU. The objective is to exploit the best of the two platforms. The idea came from the work of AMD and Intel on a new fusion between CPU and GPU in the same chips. To perform this proposition we create two separated devices. For our case we need to do the transfer twice and create two separated program with OpenCL to execute the two devices. All of this additional work can be avoided if the GPU and the CPU are in the same chips. Despite this we can see that we obtain very interesting results using this approach. Table 4 resumes the acceleration of this proposition. We can see that the acceleration begin earlier than the parallel CPU model alone. Indeed, from 3496 cities we have acceleration, but the most interesting, we have a very significant acceleration for 3694 cities with an acceleration of 4.00 times when the LS perform 244 iterations. We didn t obtain better acceleration then the parallel CPU but the results are very close and more stable. With this solution we can execute very large instances and the LS is efficient more quickly than the parallelization with the CPU alone. The TSP is a very challenging problem, for the decade between 2002 and 2012, 16 publications works on the parallelization of TSP with the GPU using different metaheuristics with population-based and single solution-based [4]. In 2011 [14] proposed a design of a LS representing one neighborhood on one single thread as their earlier work [5], but in this work they concentrated on the management of memories and minimized transfer between the CPU and GPU, they studies four optimization problems including TSP with a size between 101 and 5915 cities. In 2012, [8] used the LS to accelerate TSP with 2-opt and 3-opt structure, they used 13 TSPLIB problem instances [15] with a size between 100 and 4461 cities, In this work they choose to dedicate one thread to one 2-opt (or 3-opt) swap calculation, the most important contribution on this work is to re-compute data from the coordinates of the points, instead acceding to the pre-calculated matrix of distances, they recompute the data using the high peak computational power of GPU, the coordinates of points are stored in the local memory, but the problem is the limitation of this approach to approximately 4800 cities because of the small size of the local memory. On our side, we propose this approach of fusion between platforms and we believe that it can be a very efficient approach especially for large neighborhood instances when the GPU is not very powerful. It resolves the problem with little number of registers and gives good acceleration. More interesting, it can be a very efficient solution if we have in the future a chip with both GPU and CPU. Table4. Acceleration for TSP CPU-GPU Instances (city number) iterations AccCPU+GPU dsj fra mu nu dlb Xua fnl ca eg ym CONCLUSION AND FUTURE WORKS The aim of this paper was to propose a robust local search that can produce results for large scale problems. We propose several techniques to accelerate the execution on GPU and we adapt these techniques with KP and TSP. The proposition was experimented on both CPU and GPU thanks to the portability of OpenCL. For the two problems we can clearly see the stability of the parallel models against the exponential evolution of the sequential model. Through this work we can conclude that the parallelization depend on two main factors. The first one is the characteristics of the used platform. For us it was the power of the GPU on parallel execution and the power of the CPU on sequential execution. The second factor is the nature of the problem. Indeed, we can observe that when the problem depends on the computing bound, the theoretical performance of the platform is significant. On the other side, the bandwidth is more important when the problem depends on memory bound. For the KP, very good acceleration on GPU was recorded regarding the number of iterations performed. For TSP, better acceleration was recorded with the parallel CPU but the fusion between GPU and CPU is a promising approach. Our next objective is to adapt our parallel LS for different others combinatorial problems. Each problem can be studied and allow us to find new techniques to apply. Many other metaheuristics based on single solution or population can be explored. For other metaheuristics based on a single solution process like tabu search, it s easy to adapt our techniques on these methods. For example the tabu list can be saved on the local memory and the other techniques are perfectly applicable. Other level of parallelization like the algorithmic-level and the solution-level can be used to accelerate the execution and to ameliorate the results [1]. Also, constructive methods and classification methods can be parallelized with GPU. These methods are used to help the metaheuristics to solve the combinatorial problems. Several advanced techniques can be used to optimize the transfer between the host and the device like the overlap between the execution and the transfer [6]. Another very interesting objective is the use of many GPU (a GPU cluster), it can product a very powerful system with a reasonable cost.

7 References [1] E.G. Talbi, Metaheuristics :from Design to Implementation, John wiley and sons Inc, [2] R.B. André, R.H. Trond, L.S. Martin, Graphic process unit (GPU) programming strategies and trends in GPU computing, J.Parallel Distrib.Comput, 73: 4-13, [3] R.B. André, R.H. Trond, S. Christian, H.Geir, GPU Computing in discrete optimization Part 1:Introduction to the GPU, EURO journal on transportation and logistics, 2: , [4] S. Christian, H.Geir, R.B. André, R.H. Trond, GPU Computing in discrete optimization Part 2:Survey focused on Routing problems, EURO journal on transportation and logistics, 2: , [5] L.T. Luang, N. Melab, E.G. Talbi, Large neighborhood local search optimization on graphic process unit, Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 1-8, [6] S. Christian, Efficient local search on the GPU- Investigations on the vehucule routing problem, J.Parallel Distrib.Comput, 73: 14-31, [7] D.B. Kirk, W.M. Hwu, Programming massively parallel processor-a hands on approach, Morgan kaufmann, 30 Corporate Drive, [8] K. Rocki, R. Suda, Acceleratin 2-opt and 3-opt local search using GPU in the traveling salesman problem, High Performance Computing and Simulation (HPCS), 2012 [9] S. Martello,P.Toth,. Knapsack problems: Algorithms and computer implementations, in: Series in Discrete Mathematics and Optimization, Wiley Interscience [10] H. Kellerer,U. Pferschy,D. Pisinger, Knapsack Problems, Springer, [11] M. Desrochers, G. Laporte, Improvements and extensions to the millertucker-zemlin subtour elimination constraints. Operations Reasearch Letters 10, 27-36, [12] S. Lin and B. W. Kernighan. An Effective Heuristic Algorithm for the Traveling-Salesman Problem. Operations Research 21, , [13] G. Gutin, A.P. Punnen, The Traveling Salesman Problem and Its Variations, Springer, [14] T. V. Luong, N. Melab, E.-G. Talbi. Gpu computing for parallel local search metaheuristic algorithms. IEEE TRANSACTIONS ON COMPUTERS, 63 :( ), [15] G. Reinelt, TSPLIB - A Traveling Salesman Problem Library. ORSA Journal on Computing, Vol. 3, No. 4, pp Fall,1991. [16] M. E. Lalami and D. El-Baz,. Gpu implementation of the branch and bound method for knapsack problems. IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, pages 1-9,2012 [17] V. Boyera, D. E. Baz, M. Elkihel, Solving knapsack problems on gpu, Computers and Operations Research, 39 :(42-47).2012

HEURISTICS optimization algorithms like 2-opt or 3-opt

HEURISTICS optimization algorithms like 2-opt or 3-opt Parallel 2-Opt Local Search on GPU Wen-Bao Qiao, Jean-Charles Créput Abstract To accelerate the solution for large scale traveling salesman problems (TSP), a parallel 2-opt local search algorithm with