Methods for Solving Subset Sum Problems

Methods for Solving Subset Sum Problems R.J.W. James Department of Management, University of Canterbury, Private Bag 4800, Christchurch, New Zealand ross.james@canterbury.ac.nz R.H. Storer Department of Industrial and Systems Engineering, Lehigh University, Mohler Laboratory, 200 West Packer Ave, Bethlehem, PA 18015, USA rhs2@lehigh.edu Abstract The subset sum problem is a simple and fundamental NP-hard problem that is found in many real world applications. For the particular application motivating this paper (combination weighers) a solution is required to the subset sum problem that is within a small tolerance level, and can be computed quickly. We propose five techniques for solving this problem. The first is an enumeration technique that is capable of solving small problems very efficiently. The next two techniques are based on an efficient number partitioning algorithm. These techniques can solve small problems very efficiently when the solution uses approximately half the available elements (numbers) and outperforms dynamic programming approaches proposed in the literature. The last two techniques use a direct approach of improving a solution. These techniques were found to perform very efficiently on large problems and outperform heuristic techniques currently proposed in the literature. 1 Introduction The Subset Sum problem is an important and fundamental combinatorial optimisation problem from which many real world problems are derived. The application we consider is a problem faced by automated packing machines, which are commonly used in the food industry, although the techniques proposed here have wider applications. In this application a bag of food, say crisps or pretzels, needs to have a minimum weight c of the product in it (i.e. c is the weight printed on the label). Automated packing machines, called combination weighers, are typically used in such applications. These machines consist of a number (often around 32) of containers or buckets. Each bucket has a built-in scale that can measure the weight of product it contains. Product is typically fed into each bucket by a vibratory feeder. The vibratory feeder is set to dispense a specified target weight. For example each bucket may have a target weight of V6 the total package weight c, and we would normally expect, in this instance, 6 containers to be used to fill a bag of product. Because vibratory feeders are not particularly accurate, especially when dispensing items like pretzels or crisps, it is normal for the actual weight in these containers to vary significantly under or over the target weight. The reason for having buckets in the first place is that it allows us to select a subset of buckets whose contents sum closely to the total package weight. The question to be answered is; which set of containers should be used to fill a bag in order to achieve a packing weight no less than the stated package size, while minimizing the amount of overfilling? The problem can be stated as follows: 29

minimize ^ h)x, /=i subject // X M-'v. - c /= i A',. <E{0,1},/ = 1,,.,/Z where w, is the weight of product in bucket /. As a further refinement to this formulation we also make the assumption that in practicality we can only measure the weights to a given tolerance level, and hence if we can find a solution that is between c and c + e then this an acceptable solution and no further searching is required. Following standard practice, we assume that each of the containers has the same target weight and that the actual weights in these containers are normally distributed around the target. However we will also experiment with different distributions to see how they affect the solution times of the different algorithms we propose. The subset sum problem is NP-Hard (Garey and Johnson, 1979) and therefore heuristic methods are most commonly used to solve it, however several papers have proposed exact methods. Martello and Toth (1984a) use a combination of dynamic programming and branch and bound in solving this problem. This technique was able to solve problems of up to 28 items drawn from a uniform population in a range from 1to 1012. Pisinger (1999) proposed a restricted dynamic programming method based on balanced solutions. His technique was tested on various sized problems, with various ranges in the weights of the items. The technique was able to solve 3000 item problems that were drawn from a uniform population from 1 to 106. This approach was restricted to solving the problem for a single constraint value (c). Some of the earlier heuristics for the subset sum problem are described and compared in Martello and Toth (1985). Their conclusion was that the algorithm of Martello and Toth (1984b) was the most efficient scheme. This was essentially a greedy scheme whereby items are added to the set in decreasing weight order until no more can be added due to the constraint. A successful generalisation of this scheme was to initially select the first element in the set then run the greedy algorithm on the remaining elements. More recent research has examined the use of heuristics to solve this problem. Gens and Levner (1994) used a combination of a heuristic and dynamic programming, Fischetti (1990) used a local search based heuristic while Ghosh and Chakravarti (1999) used a local search that used a permutation search space and a greedy heuristic. Keller et al (2000) developed two linear time construction heuristics which outperform those proposed by Martello and Toth (1984b) which are 0(n2). Przydatek (2002) developed a two-phase subset sum approximation algorithm. The first phase randomly generates a solution to the problem while the second phase improves the solution. This is repeated a given number of times and the best result found is returned. Przydatek (2002) compared this algorithm to the Martello and Toth (1984b) algorithm and found it to be superior. The problem proposed in this paper sits between the problems currently studied in the literature in that it does not require an optimal solution to the problem as the exact solutions methods provide, as it only requires a solution within a given tolerance range. However the heuristic methods currently proposed in the literature do not provide the level of accuracy required to give a solution within the tolerance level or the minimum if outside the tolerance level. to: 30

To the authors knowledge none of the literature to date solving subset sum problems has used normal data in testing the performance of the solution techniques. Previous testing has assumed numbers to be uniformly distributed integers between zero and some large number (or similarly, real numbers between 0 and 1 with fixed precision) or the data has been generated according to some scheme that is known to be difficult to solve by branch and bound and dynamic programming (Chvatal, 1980). Our application leads us to experiment with other distributions, and in particular, distributions with smaller coefficients of variation than those typically assumed. As will be seen in the results, the distribution of item weights turns out to have a significant effect on algorithm performance. 2 Linkages to Number Partitioning The subset sum problem is similar in nature to another well-known combinatorial optimization problem, the number partitioning problem. The number partitioning problem aims to split a set of numbers in two so that the difference between the sums of each of the two sets is minimized. Thus number partitioning is a special case of subset sum where c is exactly half the total weight T = X vr,. The subset sum problem can be converted to number partitioning by adding a new dummy variable, with weight equivalent to two times the required amount less the total of all the items (Korf 1998). The dummy variable essentially forces one of the sets to be either larger or smaller in order to get (close to) the required amount in one of the sets. An important difference is that number partitioning can be viewed as a two-sided problem in that the goal is to get as close as possible to the target sum regardless of whether the proposed solution is larger or smaller than the target. The subset sum problem is typically formulated to be one-sided in that we want to get as close as possible to the target sum without going under the target. When the target sum is exactly half the total sum (as in number partitioning), the one-sided problem is as easy as the two-sided problem. This is true since, after partitioning, one set will be slightly below the target, while the other will be slightly above. One simply chooses the set meeting the desired requirement (i.e. under or over the target). When the target sum is not exactly half the total, and the problem is one-sided, additional complications arise. However, number partitioning based algorithms can be successfully adapted to the subset sum problem in certain cases, as we will see subsequently. Clearly the application motivating our study requires a one-sided approach since failing to meet the printed package weight can result in substantial fines for a manufacturer. Karmarker and Karp (1982) proposed a very efficient (n\ogn) and effective differencing heuristic, which we will refer to as the KK heuristic, for solving number partitioning problems. Essentially the heuristic sorts the items from largest to smallest and places them on a list. It then removes the two largest items on the list, one going into one set and the other into the opposite set. The difference between the two items is then merged back into to the list so as to maintain the decreasing value order. This continues until one item remains on the list, whose value is the difference in the partition. Korf (1998) extended the concept of the KK heuristic so that it was incorporated into a branch and bound framework that was called the Complete Karmarker Karp (CKK). At each branch there are two options. Consecutive items on the list are either subtracted (placed in opposite sets) from each other or added together (placed in the same set). If at any node the value of the first item is greater than the sum of the remaining elements, then putting the largest item on one side and all remaining elements on the other will result in the minimum difference between the two partitions. Thus no further branching is required from such a node as the optimal solution among all successors is at hand. 31

Korf (1998) tested this procedure using integer numbers with 10 significant digits that were drawn from a uniform distribution from 0 to 10 10. It was found that the hardest problems to solve at this precision were those with around 35 to 36 items. What was unclear from Korf (1998) was the effect of the distribution and variance of the population from which the items were drawn. In particular we are interested in how this affects the speed with which we can find a solution to the problem. 3 Seeded CKK In order to overcome the problem of initially having many infeasible solutions when applying Korf s algorithm (CKK) to the subset sum problem, we propose a Seeded CKK variant that we have termed SCKK. Essentially it starts the branch and bound search with a known feasible solution, which could be any solution to the problem. In our experiments we create this feasible starting solution from the KK solution by moving one or more elements between the two sets. This solution is then improved by a one-pass heuristic that tries to swap taken items (items in the subset) with non-taken items with a smaller value, in order to reduce the total value of the solution while still maintaining feasibility. Given a starting feasible solution, the next step in this process is to determine the starting set of branch decisions (i.e. path in the branch and bound tree) that will produce our predefined starting solution. This is done by traversing down the branch and bound tree in a recursive fashion. At each node in the tree the starting solution values of the next two items in the list are compared. If they are in the same set in our starting solution, then the same side branch is taken, if they are on opposite sides, then a different side branch is defined. When calculated items are involved, the value of the original parent item is used - i.e. the value of the largest original item involved in the addition or subtraction calculation. Once the path to the root node representing our seed solution is defined, we then start the tree search using depth first with backtracking. The branch and bound simply switches from the original branching decision to the opposite one as we backtrack at each level along the depth first path defined by the seed solution. When branching from previously unvisited nodes, we again follow the difference first rule. 4 Enumeration Enumeration can be an efficient method for solving small subset sum problems if we carefully choose the order of the variables evaluated and predefine the values of the variables in order to start the search in a promising area of the solution space. The approach we found to be the most effective is presented in Figure 1. 32

Generate a feasible solutions, and generate two sets, taken = { / : x, = 1 }; nottaken = { /: x, = 0 } Sort taken such that w, < wl+, V/' e taken; Sort nottaken such that Wj < wj+, V/ e nottaken; B = Ywj: j e taken; // Define the Best solution found to date. Enumerate_Taken (taken, nottaken, Xw;: ye taken); Enumerate_Taken(taken, nottaken, T) if (taken * 0 ) /' = first element in taken; Enumerate_Taken(taken - /', nottaken, T) // Test first with it taken if (c < T - w,+ l{ w j: ye nottaken}) then // Still can get a feasible solution Enumerate_NotTaken(nottaken, T); Enumerate_Taken(taken - /', nottaken, T -w,)/l Test without it taken end if end if Enumerate_NotTaken(nottaken, T) if (nottaken = 0 ) or (T > c) if (T < B) then B = T; if (T < c + tolerance) then Terminate Enumeration else /' = first element in nottaken if (T + w; < B) and (c < T + : ye nottaken}) then Enumerate_Taken(nottaken - /', T + w)\ II Test first with it taken Enumerate_Nottaken(nottaken - /', T); // Test without it taken end if end if Figure 1. The Enumeration Algorithm When the number of items required is small this procedure is very fast due to the bound in Enumerate_NotTaken, however as the problem size increases, the number of combinations increases exponentially and enumeration starts to take an excessive amount of time to solve the problem. 5 Direct Search Direct Search is a new method of solving subset sum problems. It is based on the premise that for a large set of items it is possible to fine-tune the problem in a systematic way that keeps the number of items taken constant and in the process eliminates large areas of the solution space thus improving efficiency significantly. The performance of this search is reliant on the starting solution having the same number of items as a solution that is within the required tolerance. If the difference between the number of items in he starting solution and the number of items in a desired solution is large, then the computational time required to find the final solution could be large. Thus this algorithm will be better suited for certain problem data sets, particularly those with smaller coefficients of variation. The starting solution is the same as the one used for the Seeded CKK procedure. If a solution is not found with the current number of items taken then the number of items is systematically changed (one added, one subtracted, two added, two subtracted etc) in order to try every possible combination. The number of taken items is always kept within a valid range determined from the number of items taken when the smallest items are used, giving the maximum number of 1s required for a valid solution, and the number of items taken 33

when the largest items are used, giving the minimum number of items required for a valid solution. The summary of the direct search is shown in Figure 2. Use a heuristic to find a feasible solution, x, to the problem Calculate the minimum and maximum number of items that would be taken Starting with the current solution,x, Repeat Increase the value of the solution by swapping taken items with non-taken items with a higher value. R=1 Repeat Try improving the solution by swapping R taken items with R nontaken items of a lower value then swap other taken items with items of higher value. If at any time a solution within the tolerance range is found, terminate the search R=R+1 Loop until R > number of items taken Change the original solution by adding or subtracting the number of items taken so that the number of items taken had covered all possible values from the minimum to the maximum number of items. Loop until all numbers of items in the possible range have been tried. Figure 2. Direct Search Algorithm. This process can take a long time if a solution witliin the tolerance level has a significantly different number of Is than the starting solution or there is no solution within the tolerance for the problem being solved. The procedure outlined above is guaranteed to find the optimal solution, and for small problem sizes it is often necessary to adjust the size of the number of items taken from that produced by the starting solution. However when we have a large number of items to take then there are usually many solutions that will satisfy our tolerance level, and it is more common for there to be a solution within the tolerance level with the initial number of items taken. In order to speed up the process for larger problems we can define a heuristic search based on the processes above to find a solution. Rather than investigating all swaps involving 1, 2, 3 etc swaps, the heuristic search only considers swaps involving a single 1 item with a lower valued item. Once the best candidate move is identified, the move is made and the process repeats from the new solution. The heuristic terminates only when a solution is within the tolerance. The danger of course is that we may not find a solution within tolerance when one exists. For larger problems this seems unlikely. 6 Computational Experiments The above solution algorithms and heuristics were coded in Microsoft C++ V6.0 and all the following computational experiments were run on an AMD Athlon XP CPU running at 1.53GHz, 1GB RAM using the Windows XP operating system. 6.1 Small Problems In order to test the various techniques discussed within this paper we solved 50 mndomly generated problems with 32 containers. The required amounts (subset totals) tested were c = 8, 10, 16. In order to test if there was a difference in the results if we used different data distributions we used both normal and uniform distributions. The normal distributions all had a mean of 1 with three different standard deviations: 0.1, 0.2, and 0.3. For Uniform 34

distributions, the maximum and minimum were set to 1 ± 3 standard deviations. All problems were solved to a tolerance of 1O'* We then ran these problems using the algorithms described previously with the exception of the Direct Search. We found that this technique took a long time to solve 32 item problems and was not at all competitive with the other techniques proposed. The reason for this was that the perfonnance of the search was dependent on having a starting point with the same number of containers as a solution within the tolerance levels. This was not always possible for small problems; hence it took the search a great deal of time to find a solution that would meet the tolerance requirement. In some cases no solution was within the tolerance levels and hence the direct search ended up effectively doing a complete enumeration of the solution space.the results for these computational experiments are outlined in Table 2. Requirem ent Solution Method 16 10 8 Enum CKK SCKK Enum CKK SCKK Enum CKK SCKK Normal (1,0.1) 7.44 5.71 14.74 1.97 21.77 22.82 0.38 6.43 6.29 Uniform [0.7-1.3] 5.51 3.26 13.08 2.56 22.19 24.70 0.46 6.48 7.22 Normal (1,0.2) 9.71 3.73 11.21 2.26 16.68 18.02 0.44 5.22 5.89 Uniform [0.4-1.6] 6.07 3.75 7.54 3.22 11.02 15.66 0.86 4.64 6.14 Normal (1,0.3) 7.16 4.58 8.20 2.42 13.08 12.49 0.61 4.31 5.22 Uniform [0.1-1.9] 7.04 2.70 3.38 4.23 5.13 6.00 1.46 2.41 5.60 A verage 7.10 3.60 8.68 2.94 13.62 15.37 0.76 4.61 6.01 Table 1. Average Computational Times (in seconds) for 32 container problems From Table we can draw the following conclusions: CKK is the fastest technique to use when the required amount is 16. Enumeration is faster when required amounts are smaller (10 and 8). This can be explained by the fact that the subset sum problem is most similar to number partitioning when the requirement is 16, and CKK is number partitioning based. Enumeration becomes orders of magnitude faster as the number of items necessary to meet the requirement is reduced. On the other hand, CKK and SCKK were slowest solving problems with the requirement value of 10 due to the structure of the problem becoming more difficult to solve with one larger dummy variable dominating the KK process. However as the requirement reduces further the bounds start to make the solution process easier for the CKK and SCKK. The distribution of the data can have an impact on the perfonnance of the search. Generally the enumeration appears to perform better on uniform data when the requirement is 16, but prefers normal data when the requirement is 8 or 10. There does not appear to be a simple explanation for the CKK and SCKK results. Generally the enumeration s computational times are worse as the variance in the data increases whereas the CKK and the SCKK s performance improves as the variance in the data increases. In most cases the CKK outperforms the SCKK. In many cases this would be due to the extra overhead required by the SCKK to be implemented. Other times the KK solution may simply be closer to a good solution than the solution we have seeded the search with. As was indicated previously we must be aware that this conclusion is valid only for the starting point generated by our feasibility restoration heuristic. Other starting points may produce better or worse results for any problem instance. Further statistical analysis was carried out on the affect that the distribution had on the computational time for each technique. From this analysis it was found that in some instances there were statistically significant differences in the computational times when the problem data 35

was generated from different distributions. As would be expected, we found that as the variance increased the more likely significant differences between distributions occurred. Of the 54 comparisons between the means of the CKK and SCKK using nonnal and uniform data, 24 were found to be statistically significantly different. In all but two of these cases the CKK and SCKK found it more difficult to solve problems drawn from a nonnal distribution than a uniform distribution. In both exceptions to this rule the variance of the data was 0.1. Of the 27 comparisons between the means of the enumeration using nonnal and uniform data, 16 were found to be statistically significantly different. In all but two of these cases the enumeration found it more difficult to solve problems drawn from a uniform distribution than a nonnal distribution. Once again both of the exceptions to this rule were when the variance of the data was 0.1. From this we can conclude that, in general, the Karmarker- Karp-based algorithms perform best on data drawn from uniform distributions whereas the Enumeration algorithm perfonns best on data drawn from a normal distribution. 6.2 Larger Problems Larger problems are in some ways easier to solve than small problems, as the number of combinations of items that meet the given tolerance range increases. However the techniques used to solve smaller problems, which by nature have to be very thorough and complete, may not be as successful in solving larger problems that have exponentially large search spaces. For these problems the search needs to be taken quickly to a promising solution which is then manipulated in order to achieve a satisfactory result. In order to test the scalability of the solution techniques, we mndomly generated thirty problems of 200 items each with required amounts of 100 and 50. Normal distributions with mean 1 and standard deviations of 0.1, 0.2 and 0.3 were used to generate the item data. All problems were solved to a tolerance of 10'8 In pilot tests we attempted to solve these problems with enumeration and CKK but the CPU times required to solve some of the problems were excessive. Hence these techniques were abandoned for the larger size problems. The SCKK algorithm also required excessive computation times for some instances with standard deviation of 0.1. For this reason no results are reported for SCKK at 0.1, however at higher variances solutions were obtained. Along with the techniques proposed in this paper we also ran this experiment using a modified version of the Random Greedy Local Improvement (RGLI) algorithm proposed by Przydatek (2002). The two main changes made to this algorithm were that the objective was to minimise the weight while being over a certain constraint value and that instead of giving a predefined number of trials for the algorithm to execute, the algorithm was terminated only when it found a solution within the given tolerance level. The later modification means that if the tolerance level is too small for the size of the problem the algorithm could potentially run indefinitely, however with large problems this is unlikely. The results for these experiments are outlined in Table 2, where the HDS rows represents the results for the Heuristic Based Direct Search, DS are the results for the Direct Search algorithm, SCKK are the results for the Seeded Complete Karmarker Karp algorithm and RGLI are the results for the algorithm proposed by Przydatek (2002) modified as indicated previously. A dash ( - ) in the table indicates that solutions could not be generated within an acceptable time period. 36

Data Std D eviation 0.1 0.2 0.3 R equired 100 50 A lgorithm M ean Standard D eviation M ean Standard D eviation HDS 0.045 0.053 0.039 0.055 DS 0.044 0.050 416.793 1725.345 SCKK 228.183 968.097 - - RGLI 1.174 1.158 0.780 0.688 HDS 0.025 0.053 0.043 0.047 DS 0.024 0.053 0.061 0.076 SCKK 11.767 23.186 2482.854 2758.007 RGLI 1.538 2.368 1.222 1.087 HDS 0.108 0.141 0.069 0.063 DS 0.110 0.141 0.068 0.061 SCKK 3.697 6.828 1908.878 2184.913 RGLI 2.890 3.871 2.380 2.682 Table 2. Average and Standard Deviations of Computational Times (in seconds) for 200 container problems From these results we can conclude For large sized problems the HDS, DS and RGLI algorithms are clearly superior to the SCKK algorithm. The HDS is clearly superior to DS when the requirement is 50 and the data standard deviation is 0.1. When the data standard deviation is higher the results between the two schemes are very similar. HDS out performs RGLI by an order of magnitude, while DS out performs RGLI by an order of magnitude except in the case when the requirement is 50 and the data standard deviation is 0.1. The SCKK performs best on problem data that have a high variance. We note that previous experiments using Uniform integer data were generated between 0 and a large number reflect a case with an extremely high variance. For HDS and DS the higher the problem data variance, the higher the variance in the solution times. For the RGLI the higher the data variance, the longer the solution times, while the smaller the requirement the smaller the solution times. The distribution of the numbers, and particularly the coefficient of variation, has a significant effect on the perfomiance of the various algorithms. The problem size has an even greater effect on relative algorithm performance. Indeed it is quite interesting to note that the best techniques for solving large sized problem are the worst techniques for solving smaller problems and vice versa. This last observation casts serious doubts on the tendency of some researchers, especially those in the heuristics area, to advocate the use of techniques that have been proven on small problems as the best techniques for solving larger problems. 7 Conclusions In this paper we have presented two new techniques for solving the subset sum problem. The first was an enumeration approach and the second was an approach that directly manipulates the solution into a better one, which have called Direct Search. We have also proposed a modification of the Complete Karmarker Karp algorithm proposed by Korf (1998) that uses a starting feasible solution to seed the branch and bound tree. 37

Computational experiments have shown that the best technique to use to solve the problem depends on the combination of the number of items in the problem, the requirement constraint and the amount of variance in the data sets. The Enumeration technique proposed performs veiy well on data that is small and has a small requirement. The Direct Search approach proposed in this paper has proven to perfonn extremely well on large sized problems and performs an order of magnitude faster than the Random Greedy Local Improvement technique proposed by Przydatek (2002). In terms of the application that inspired this research, the results indicates that it would be possible to design a packing machine with many more bins than is the norm at present, and that it would be practical to solve the subset sum problem to the required level of accuracy in real time. The advantage of these changes would be that the containers would hold smaller weights, and the net effect would be less overfilling of packages and therefore less wastage for the organisation. References Chvatal V. 1980 Hard Knapsack Problems Operations Research 28: 1402-1411 Fischetti. M. 1990. A new linear storage, polynomial-time approximation scheme for the subset-sum problem." Discrete Applied Mathematics 26:61-77. Garey M.R., D.S. Johnson. 1979. Computers and intractability: a guide to the theoiy o f NP-completeness, Freeman, San Francisco. Gens G., E. Levner. 1994. A Fast Approximation Algorithm for the subset-sum problem. INFOR 32:143-148. Ghosh D., N. Chakravarti. 1999. A competitive local search heuristic for the subset sum problem. Computers and Operations Research 26:271-279. Karmarkar N., R.M. Karp. 1982. The differencing method of set partitioning. Technical Report UCB/CSD 82/113, Computer Science Division, University of California, Berkeley, CA. Korf R.E. 1998. A complete anytime algorithm for number partitioning. Artificial Intelligence. 106:181-203. Martello S., P. Toth. 1984a. A mixture of dynamic programming and branch-and-bound for the subset-sum problem. Management Science 30:765-771. Martello S., P. Toth. 1984b. Worst-Case Analysis of Greedy Algorithms for the Subset sum Problem. Mathematical Programming 28:198-205. Martello S., P. Toth. 1985. Approximation schemes for the subset-sum problem: survey and experimental analysis. European Journal o f Operational Research 22:56-69. Pisinger. D. 1999. Linear Time Algorithms for Knapsack Problems with Bounded Weights. Journal o f Algorithms 33:1-14. Przydatek. B. 2002. A fast approximation algorithm for the subset-sum problem. International Transactions in Operational Research 9:437-459. 38