Cost Optimal Parallel Algorithm for 0-1 Knapsack Problem

Size: px

Start display at page:

Download "Cost Optimal Parallel Algorithm for 0-1 Knapsack Problem"

Stephanie Daniel
5 years ago
Views:

1 Cost Optimal Parallel Algorithm for 0-1 Knapsack Problem Project Report Sandeep Kumar Ragila Rochester Institute of Technology Santosh Vodela Rochester Institute of Technology ABSTRACT NP-Hard problems are as hard as NP-problem. Solving these NP-Hard problems are computationally hard, one such problem that we will be looking at in this paper is 0-1 Knapsack problem. 0-1 Knapsack problems are highly used in areas like cargo loading, scoring of tests and badge processing. We will take a heuristic search approach to solve the problem by using the multicore parallel programming techniques[3]. 1. COMPUTATIONAL PROBLEM The 0-1 Knapsack problem is best expressed as the question: Given a set V of n items to be packed in a knapsack of capacity C, where each item in V has a weight and profit associated with it, the problem is to choose a subset of items such that the total weight of the items chosen does not exceed the knapsack capacity C and they give maximum profit. 0-1 Knapsack problem is a modification of the knapsack problem, here no fractions are allowed. 2. RELATED WORK 2.1 Research Paper 1 Authors of paper [4] LI Ken-Li et. al. talk about the parallel Three-List Algorithm and about the past research on solving the knapsack problem. In the Parallel three list algorithm they are two stages namely generation stage and the search stage. In the generation stage the input is divided into three parts W 1, W 2 and W 3 of sizes 7n/16, 7n/16 and 2n/16 respectively, here n is the size of the input. Next for each part, generate all possible subsets in descending order and name them as A, B, and C. In the search stage, for every possible value in C, do a binary search over A+B. If the total sum of that pair is equal to the desired solution, output that solution. The author also discusses how to parallelise this algorithm. The author based on the Three-list algorithm propose an enhanced parallel algorithm to improve speed up and also reduce the memory consumption. In the authors proposed algorithm, each of the two lists generated A and Copyright 2015 ACM X-XXXXX-XX-X/XX/XX...$ B occupy a size of O(27n/16), which can be dynamically generated by using only O(213n/48) shared memory units. In the search stage of the new algorithm, this stage is after the parallel generation stage so by now list A and B are generated in descending order. Now for each processor if A[i] + B[j] = M-c[m], then stop: solution found if A[i] + B[j] < M-c[M], then i=i+1; else j=j+1 check the above rules till i or j go more than the list sizes. In the three list generation stage, the authors discuss how to use 2k Tabes to dynamically generate two sorted lists. They use six tables T 1, T 2...T 6 in which T 1 include all subsets of W 11= (w 1,w 2,...,w 7n/48), T 3 includes subsets of W 13= (w 14n/48+1, w 14n/48+2,..., w 21n/48). Similarly for T 4 and T 6. Now T 1 is sorted in ascending order and the author mentioned about using priority queues for faster retrieval of the min sum. In the algorithm, values from T 1 and T 2 are added in order and inserted into a queue Q1, For every pair check if it is needed for the objectivity of computation, delete that sum from Q 1 and insert into successor Q 2 by adding value from T 3. This is repeated till Q 1 is empty and we have the complete list Q 2. Similarly this is done for list B that is T 4, T 5, and T 6. These generated lists are now passed to the search stage to find a possible solution. This algorithm is an efficient way to save space when computing subsets as the size of N increases, and we are likely to run out of space to store all the possible combinations. 2.2 Research Paper 2 Paper [1] V. Boyer et. al. analyzes techniques to parallelize the dynamic programming method for solving 0-1 Knapsack problem on a GPU. It mainly focuses on memory optimizations and the processing time. To achieve this goal it uses a data compression algorithm with less memory occupancy to compute large problem sizes within small processing time. The paper also provides some computational results for various experiments. As we know how to solve a knapsack problem using dynamic programming they leverage this idea using GPU architecture. Their idea is to reduce the amount of communication between CPU and the GPU and compute most of the computations on the GPU. They propose that each GPU core will calculate the subproblem of finding the items that should be included to fit a capacity c i where 0<= i <= C. The other technique in this process is to store the results of items that are included in a knapsack memory efficient way. At each stage knapsack problem has to decide whether it has to include a particular element or not. This is a Boolean decision. So in order to maintain one such decision they have previously stored the value in a single 32-bit integer but, we can store 32 item decisions in a

2 single integer variable where each bit represents if that item is included or not. This will reduce the amount of memory used by the system to store the results drastically. When an item is included in the knapsack, then they add 2 i to the output that will set the bit at an i th location indicating the item is included. They have a counter that increments for every item and when the counter hits 32 that determines all the 32 bits are set they remove the results and store in a matrix in order to store the results of the next items. So communication between CPU and GPU happens for every 32 iterations. Their analysis also shows that the computational speed is increased when the values are sorted in profit per weight ratio. 2.3 Research Paper 3 Authors of paper[2] M. E. Lalami et. al. talk about a new approach to solving the knapsack problem, which is the Branch and Bound Approach. They discuss how the algorithm can be used to reduce the number of computations and come up with a better and efficient solution. They also propose a technique to use the algorithm on a CPU-GPU system via CUDA. Branch and Bound Algorithm: The Branch and Bound algorithm follows an approach of enumerating all the possible solution but with pruning a significant amount of branches. There are several solutions that follow a greedy approach by checking all the possible branches from the root. But, Branch and Bound, at each stage decides whether to traverse down the specific path or take a different one. In this algorithm items are called nodes where branching and bounding operations will be performed on these nodes. Each node is represented as a 5 tuple (w e, p e, X e, U e, L e) where e is the current node, we is the weight of the current node, pe represents the profit of the current node, Xe is the solution subvector, U e is the upper bound and L e is the lower bound. The upper bound can be calculated by using any popular greedy knapsack algorithm, and Lower bound can be calculated in the same way but adding fractions is disallowed. Once we have represented a node as above, we compare the upper bound and lower bounds of all the current nodes, i.e., the nodes at a certain level in a tree. Then we first calculate the maximum lower bound of all the nodes then if we find that the upper bound of any node is less than the maximum lower bound we prune this node and no searching is done down this path. At each step we either go to the right or the left of the tree, so, the value on each edge will be either a 0 or 1 which indicated whether we included that node or not. If we would like to see the included nodes we traverse the final tree along the path, which has edge values 1. The authors use this kind of algorithm to divide the computation between CPU and GPU to speed up the processing. They suggest that we have a threshold on the number of nodes and if the input is less than the threshold we compute the branching and bounding operations on CPU itself. This is because, as the input size is small the communication between CPU and GPU will be higher than the computations that indeed decrease the speed. So, if the input is higher than threshold we first transfer the nodes to the device and then launch branch and bounding operations on GPU from CPU. After each step the GPU will return the tuple to the CPU where the CPU checks if it has to prune the node or include it. The authors have tested this approach with large data sets and found it was efficient in calculating them as the algorithm does not enumerate all the possible solutions. Also, the operations are done using GPU. 3. IMPLEMENTATION 3.1 Sequential Program In our sequential design, we have followed a two list algorithm discussed by the paper. The sequential algorithm has two stages Generation Stage In the generation stage, we take the input that is a list of n items say V and then we divide this list into two disjoint sets V 1 and V 2 of equal length. Now we have calculated each subset in V 1 and subsets are represented as ai. After diving the list into two sub-lists we get 2 n/2 subsets each. For each subset a i calculate its a i.w (weight sum of the subset) and a i.p(profit sum of all items in subset a i). Each subset item is represented as a triplet (a i, a i.w, a i.p) and arrange all these triplets in increasing order of weight. We do the similar thing for V 2 but the triplets are stored in non-increasing order of weight Search Stage In the search stage the two generated list are compared to check which combination of items yields the maximum profit. We have used ArrayLists to store the lists and to generate the subsets we have used the bitset class present in pj Parallel Program Our parallel program follows the same idea as the sequential program. This uses the two list algorithm but with additional algorithms which minimize the number of subsets to be compared. In the sequential approach we have compared all the enumerations of the two lists but in parallel program lot of subsets are pruned and hence we will gain a speed up. The parallel design has five stages Parallel Generation Stage In this stage we have divided the list into two lists and then used bitset class to generate the bitmap. After generating the bitmap, we send this to a parallel for and loop over all the possible combinations. So, each thread will calculate one subset and hence all the threads are used to calculate all possible subsets of n/2 elements. Using this approach we are reducing the number of subsets we generate. If we generate subsets of n items, we get 2 n subsets but if we divide n into n/2 we get (2 n/2 + 2 n/2 ) subsets. We also use an optimal merge algorithm to merge the lists generated by all the threads into one list also in sorted order. We repeat the same thing for the other list First Parallel Saving Max-value Stage In this stage we take the lists generated in the above phase (say A and B) and partition the lists into K blocks. Where K is the number of processors used. We then send these K blocks to K processors and compute max profit value among the blocks The Parallel Pruning Stage This stage is the most important stage for our parallel program as we discard most of the subsets, which does not

3 provide us the optimal solution using few lemmas. By not considering the subsets we do not compare all the possible combinations we have computed and thus increasing the computational speed. Below are the lemmas used to prune the subsets. Lemma 1: For any block pair, if sum of the first element in A and last element in B is greater than c (max weight), prune this block as the solution is not possible further. Lemma 2: For any block pair, if the sum of the last element of A and first element in B is less than c (max weight), save the max profit to further examine this pair. We do this in parallel for K threads that are the number of processors. So the full utilization of cores is being done The Second Parallel Saving Max-value Stage Before the pruning stage there were K2 number of block pairs but after pruning stage the number of block pairs is significantly reduce to at most 2k-1. We send these 2k-1 pairs to K available cores to calculate the max profit for the pruned pairs. So, each core may get at most two block pairs to execute The Parallel Search Stage This is the final stage of our parallel program. In this stage all the cores calculate the max value for the pruned block pairs and reduce to a single reduction variable to get the best max value of all the subsets 3.3 Developer s Manual We have used RIT CS department machines to test our programs. We have mainly used machines nessie and kraken as nessie has got 16 cores and kraken with 80 cores. To run the program, we first need to create a jar consisting of all the class files. Below are the steps to create the jar file. 1. Set the classpath Setting the classpath in nessie or kraken using bash shell: input line is the knapsack total capacity C and n lines follow. Each line after the first line has three numbers in it. Each represents an item where the first number is the item number, the second number is the weight of the item, and the third number is the profit for that item. export CLASSPATH=.:/var/tmp/parajava/pj2/pj2.jar Setting the classpath in nessie or kraken using csh shell: setenv CLASSPATH.:/var/tmp/parajava/pj2/pj2.jar 2. Compile the java files: javac *.java 3. Creating the jar file with class files: jar cf myprogram.jar *.class 3.4 User s Manual We have to use the jar file that was created using the above commands. To run the sequential version of the program, run: java pj2 jar=<jar file> KnapsackSeq <input file> Where <jar file> is the name of the jar file created, <input file> is the name of the input file To run the parallel version of the program run: java pj2 cores=<k> jar=<jar file> KnapsackSmp <input file> Where K is the number of cores to be used, <jar file> is the name of the jar file created, <input file> is the name of the input file The input files are in the form of.txt, generating using a random number generation program. The first line of the 4. PERFORMANCE Figure 1: Strong scaling 4.1 Strong Scaling As we can see Strong scaling performance in Table 1 data. We have tested the program with input data items ranging from 32 to 40. We can observe that as the number of cores increases the speed decreases. This is because as we increase the cores the sequential fraction of the program becomes significant and we cannot avoid that part. Initially we have a peak as we are using a two list algorithm, generating two lists on one core is not very efficient when compared to the performance on the two cores. We have observed ideal strong scaling for higher N value. 4.2 Weak Scaling Table 2 shows the weak scaling performance for input sizes 34 to 42. As the value of N increases by 2, the number of calculations increase by 22 that is 4 times the number of

Figure 2: Weak scaling Figure 4: speedup vs cores data size:34 Figure 5: efficiency vs cores data size:34 Figure 3: time parallel vs cores data size:32 cores.

4 Figure 2: Weak scaling Figure 4: speedup vs cores data size:34 Figure 5: efficiency vs cores data size:34 Figure 3: time parallel vs cores data size:32 cores. As we are generating two lists, each list will increase by 2. We were able to get good efficiency for 1,2 and 8 cores, but as the number of cores increases beyond 8 cores the sequential part plays role and our efficiency is decreasing. Also one of the factor could be that the input data is different for each input size for lower efficiency. 5. FUTURE WORK The algorithm that we have implemented has five stages and hence it takes quite some time to execute these in CPU. We can extend the same algorithm to compute the results on the GPU. Since we are calculating all possible subsets of a set and these subsets are not dependent we can execute them on different cores. This operation can be better utilized by the GPU cores, thus making the computations fast. We can also use other algorithms discussed in the research papers to get exact optimal solution rather than the heuristic approach which we have followed. 6. LESSONS LEARNED We learned about knapsack problem and different algorithms to solve the problem. The exciting thing about the project is that we learned what a heuristic search is and how to achieve a near optimal solution. The emphasize on strong scaling and weak scaling gave us a good understanding of the parallel program performance. If the input is too small and the number of cores is high, we would not achieve a high efficiency. As the number of cores increases for a large data, we get a speed up. We also learned that the two list algorithm that we have used will significantly reduce the number of subsets to be evaluated and hence gaining faster performance. Apart from all these learning s we looked into different aspects of Parallel Java 2 library. 7. TEAM CONTRIBUTION Finding the research papers and deciding the topic is a mutual task done by each team member. Sandeep has come up with the sequential design and implemented it. He also implemented two of the five algorithms in the parallel program. Santosh has designed the parallel version of our project. He has implemented three of the five algorithms in parallel

5 Figure 7: cores vs sizeup Figure 6: cores vs Time program. Measuring the strong scaling and weak scaling performances, writing the project report, documenting the code and other works were a mutual task by each team member. Figure 8: cores vs efficiency 8. REFERENCES [1] V. Boyer, D. El Baz, and M. Elkihel. Solving knapsack problems on gpu. Computers & Operations Research, 39(1):42 47, [2] M. E. Lalami and D. El-Baz. Gpu implementation of the branch and bound method for knapsack problems. In Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pages IEEE, [3] K. Li, J. Liu, L. Wan, S. Yin, and K. Li. A cost-optimal parallel algorithm for the 0 1 knapsack problem and its performance on multicore cpu and gpu implementations. Parallel Computing, 43:27 42, [4] L. R.-F. LI Ken-Li, ZHAO Huan and L. Qing-Hua. A parallel time-memory-processor tradeoff o(25n/6) for knapsack-like np-complete problems. pages , 2007.

Subset Sum Problem Parallel Solution

Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in