Solving Large Problems with Heuristic Search: General-Purpose Parallel External-Memory Search

Size: px

Start display at page:

Download "Solving Large Problems with Heuristic Search: General-Purpose Parallel External-Memory Search"

Agnes Ryan
5 years ago
Views:

1 Journal of Artificial Intelligence Research 62 (2018) Submitted 12/2017; published 6/2018 Solving Large Problems with Heuristic Search: General-Purpose Parallel External-Memory Search Matthew Hatem Ethan Burns Wheeler Ruml Department of Computer Science University of New Hampshire Durham, NH USA mhatem at cs.unh.edu eaburns at cs.unh.edu ruml at cs.unh.edu Abstract Classic best-first heuristic search algorithms, like A*, record every unique state they encounter in RAM, making them infeasible for solving large problems. In this paper, we demonstrate how best-first search can be scaled to solve much larger problems by exploiting disk storage and parallel processing and, in some cases, slightly relaxing the strict bestfirst node expansion order. Some previous disk-based search algorithms abandon best-first search order in an attempt to increase efficiency. We present two case studies showing that A*, when augmented with Delayed Duplicate Detection, can actually be more efficient than these non-best-first search orders. First, we present a straightforward external variant of A*, called PEDAL, that slightly relaxes best-first order in order to be I/O efficient in both theory and practice, even on problems featuring real-valued node costs. Because it is easy to parallelize, PEDAL can be faster than in-memory IDA* even on domains with few duplicate states, such as the sliding-tile puzzle. Second, we present a variant of PEDAL, called PE2A*, that uses partial expansion to handle problems that have large branching factors. When tested on the problem of Multiple Sequence Alignment, PE2A* is the first algorithm capable of solving the entire Reference Set 1 of the standard BAliBASE benchmark using a biologically accurate cost function. This work shows that classic best-first algorithms like A* can be applied to large real-world problems. We also provide a detailed implementation guide with source code both for generic parallel disk-based best-first search and for Multiple Sequence Alignment with a biologically accurate cost function. Given its effectiveness as a general-purpose problem-solving method, we hope that this makes parallel and disk-based search accessible to a wider audience. 1. Introduction Best-first graph search algorithms such as A* (Hart, Nilsson, & Raphael, 1968) are widely used for solving problems in artificial intelligence. Graph search algorithms typically maintain an open list, containing nodes that have been generated but not yet expanded, and a closed list, containing all expanded nodes, 1 in order to prevent duplicated search effort when the same state is generated via multiple paths. As the size of problems increases, the memory required to maintain the open and closed lists makes algorithms like A* impractical. For example, an application of A* to random instances of the 15-puzzle using the 1. This data structure s name is indeed unfortunate, as it often holds more than just closed nodes in order to catch duplicates on the frontier and is rarely implemented as a list! 2018 AI Access Foundation. All rights reserved.

2 Hatem, Burns, & Ruml Manhattan distance heuristic will exhaust 8 GB of RAM in approximately two minutes on a modern computer (Burns, Hatem, Leighton, & Ruml, 2012). The A* algorithm, as it is described in most literature, cannot scale beyond what are considered easy problems today. This motivates linear-space variants of A* that are able to solve problems that A* cannot solve while using only a fraction of the memory. Iterative Deepening A* (IDA*, Korf, 1985) and Recursive Best-first Search (RBFS, Korf, 1993) achieve linear-space complexity by eliminating the open and closed lists. As a result, they are limited to a narrow class of problems: those that do not form highly connected spaces. Without a closed list, these algorithms are not able to detect duplicate paths to the same state and are doomed to repeatedly explore the same states multiple times before finding a solution. For example, a depth-first search to depth three on a grid with four-way movement generates 52 states while a breadth-first search that recognizes duplicates generates only 26. Furthermore, in the absence of an open list these methods use a depth-first search. For IDA*, this means a best-first search order is only possible if the heuristic is admissible. RBFS simulates a best-first search order even with an inadmissible heuristic but, like IDA*, it suffers from unbounded node regeneration overhead for problems that do not exhibit a narrow range of edge costs (however see Hatem, Kiesel, & Ruml, 2015; Burns & Ruml, 2013; Russell, 1992). IDA* and RBFS work well when there are few duplicates and a narrow range of edge costs. However, real-world problems such as Multiple Sequence Alignment form highly connected search spaces and require a wide range of edge costs to model biologically plausible results. In order to apply heuristic search to large problems that form highly connected spaces, we need scalable techniques for processing duplicates. In this paper we define scalable techniques as those that are capable of exploiting external memory and multiple CPUs to solve larger problems efficiently. External memory search algorithms take advantage of cheap secondary storage, such as magnetic disks, to solve much larger problems than algorithms that only use main memory. A naïve implementation of disk-based A* would exhibit poor performance because it relies on random access in order to process duplicates. The closed list, normally stored as a hash-table in RAM, provides quick random access to states that have already been explored by the search. While sequential access to disk can take upwards of two orders of magnitude longer than accessing RAM, random access to disk can take several orders of magnitude more time than sequential access. Storing the closed list as a hash table on disk is impractical. To implement an efficient disk-based best-first search, great care must be taken to access data sequentially to minimize seeks and exploit caching. The same techniques used by external search can be used to distribute search effort across multiple CPUs. This paper presents simple modifications to classic A* search and demonstrates that they result in a general scalable algorithm: one that can exploit external storage and additional CPUs to solve larger problems efficiently. In section 2, we discuss the technique of delayed duplicate detection in detail and present empirical results for an efficient external memory variant of A* (A*-HBDDD). As far as we are aware, we are the first to present results for HBDDD using A* search, other than the anecdotal results mentioned briefly by Korf (2004). These results provide evidence that A*-HBDDD performs well on unit-cost domains and that efficient parallel external memory search can surpass serial in-memory search. Although many regard disk-based search as slow and unwieldy, we hope this result 234

3 Solving Large Problems with Heuristic Search encourages practitioners to take another look at these techniques. In section 3, we show that previous approaches are unable to solve problems that exhibit a wide range of edge costs. To this end, we introduce Parallel External Dynamic A* Layering (PEDAL Hatem, Burns, & Ruml, 2011), an extension of A*-HBDDD that is able to solve problems with arbitrary costs. In section 4, we introduce the problem of Multiple Sequence Alignment (MSA). Previous approaches do not scale to the hardest instances of a popular MSA benchmark. In this section, we introduce a second extension to A*-HBDDD that uses the technique of partial expansions to solve the entire benchmark set. This work demonstrates that parallel external-memory search does not need to completely abandon the best-first search principle. Instead only a small relaxation is needed to significantly improve its efficiency. We hope that it demystifies scalable search and encourages wider use of these techniques, which match so well with modern multi-core commodity hardware. 2. External Memory Search Delayed Duplicate Detection (DDD, Korf, 2003) is a simple way to make use of external storage that places newly generated nodes in external memory and then processes them at a later time. The original description of DDD, also referred to as sorting-based DDD (SBDDD), divides the search process into two phases, an expand phase and a merge phase. The expand phase writes newly generated nodes directly to a file on disk. The merge phase performs a disk-based sort on the file so that duplicate nodes are brought together. Duplicate merging is accomplished by performing a linear scan of the sorted file, writing only unique nodes to a new file. This newly merged file becomes the search frontier for the next expand phase. The search continues, interleaving expand and merge phases, until a goal node is expanded. Files can in theory be made arbitrarily small and only one file needs to be kept in memory at a time. Unfortunately, the time complexity of this technique is O(n log n) where n is the total number of nodes encountered during search. For large problems, this technique incurs more overhead than is desirable. Structured Duplicate Detection (SDD, Zhou & Hansen, 2004) is an alternative to DDD that exploits connectivity in the state space to avoid writing duplicates to disk. SDD uses a projection function to localize memory references and performs duplicate merging immediately in main memory. Unlike DDD, SDD does not store duplicate states to disk and requires less external storage. However, this efficiency comes at the cost of increased time complexity, as SDD can read and write the same states to disk multiple times during duplicate processing (Zhou & Hansen, 2009). The benefits of SDD are limited to the amount of main memory available on a single machine and it is not obvious how to deploy SDD in a distributed setting. In this paper we focus on DDD because it is simple to implement and has been shown to easily scale beyond a single machine. External A* (Edelkamp, Jabbar, & Schrdl, 2004) combines A* with a variant of SBDDD whereby nodes with the same g and h values are grouped together in a bucket which maps to a file on external storage. The search proceeds by iteratively expanding layers of buckets for which the g and h values sum to the minimum f value among the search frontier. Delayed duplicate detection is performed by appending nodes to their respective buckets and later sorting and scanning each bucket to eliminate duplicate nodes. In External A*, the g and h 235

4 Hatem, Burns, & Ruml buckets must be expanded in lowest-g-value-first order which is equivalent to the A* search order with worst-case tie breaking and can therefore result in many more node expansions than a regular A* search. Moreover, because of the way buckets are organized according to two values, it is not obvious how to dynamically relax the best-first search order. To avoid the overhead of disk-based sorting, Korf (2004) presents an efficient form of DDD called Hash-Based Delayed Duplicate Detection (HBDDD). HBDDD uses two hash functions, one to assign nodes to buckets (which map to files on disk) and a second hash function to identify duplicate states within a bucket. Because duplicate nodes will hash to the same value, they will always be assigned to the same file. When removing duplicate nodes, only those nodes in the same file need to be in main memory. This technique increases the minimum memory requirements over SBDDD, requiring that the size of the largest bucket fit in main memory. However, this is easily achieved by using a hash function with an appropriate range. HBDDD has been shown to perform better than SBDDD when a search is limited by time rather than available storage (Korf, 2016). Korf (2008a) described how HBDDD can be combined with A* search (A*-HBDDD). A*- HBDDD proceeds in two phases: an expansion phase and a merge phase. In the expansion phase, all nodes that have an f that is equal to the minimum solution cost estimate f min of all open nodes are expanded. Unlike External A*, nodes are not grouped according to their g and h values so each bucket containing qualifying nodes must be scanned. The expanded nodes and the newly generated nodes are stored in their respective files. We define recursive expansion to be an expansion that is performed immediately to a generated node, without performing any duplicate checking. If a generated node has an f f min, then it is recursively expanded. Once all nodes within f min are expanded, the merge phase begins: each file is read into a hash-table in main memory and duplicates are removed in linear time. During the expand phase, HBDDD requires only enough memory to read and expand a single node from the open file; successors can be stored to disk immediately. During the merge phase, it is possible to process a single file at a time to reduce main memory requirements. HBDDD may also be used as a framework to parallelize search (Korf, 2008a). Because duplicate states will be located in the same file, the merging of delayed duplicates can be done in parallel, with each file assigned to a different thread. Expansion may also be done in parallel. As nodes are generated, they are stored in the file specified by the hash function. It is possible that two threads might generate nodes that need to be placed in the same file. Therefore, a lock (often provided by the OS) must be placed around each file so that a thread can obtain exclusive access to the file while writing. A carefully constructed hash function, one that bounds the number of buckets than need to be written to when expanding a node, can help minimize lock contention. See literature on SDD, for example, the work by Burns, Lemons, Ruml, and Zhou (2010) for discussion on abstraction based hashing and balance between locality and parallelism. For our experiments we verified that a lock was provided by examining the source code for the I/O modules. For example, the source code for the Glibc standard library does contain such a lock. Because the main contributions of this paper build on the framework of A*-HBDDD. We will discuss A*-HBDDD in detail and present empirical results. 236

5 Solving Large Problems with Heuristic Search Search(initial) 1. bound f (initial); bucket hash(initial) 2. write(openfile(bucket), initial) 3. while bucket Buckets : min f (bucket) bound 4. for each bucket Buckets : min f (bucket) bound 5. ThreadExpand(bucket) 6. if incumbent break 7. for each bucket Buckets : NeedsMerge(bucket) 8. ThreadMerge(bucket) 9. bound min f (Buckets) ThreadExpand(bucket) 10. for each state Read(OpenFile(bucket)) 11. if f (state) bound 12. RecurExpand(state) 13. else append(nextfile(bucket), state) RecurExpand(n) 14. if IsGoal(n) incumbent n; return 15. for each succ expand(n) 16. if f (succ) bound 17. RecurExpand(succ) 18. else 19. append(nextfile(hash(succ)), succ) 20. append(closedfile(hash(n)), n) ThreadMerge(bucket) 21. Closed read(closedfile(bucket)); Open 22. for each n NextFile(bucket) 23. if n / Closed Open or g(n) < g(closed Open[n]) 24. Open (Open Open[n]) {n} 25. write(openfile(bucket), Open) 26. write(closedfile(bucket), Closed) Figure 1: Pseudocode for A*-HBDDD. 237

6 Hatem, Burns, & Ruml 2.1 A*-HBDDD in Detail To understand the algorithm in more detail, we present pseudocode of A*-HBDDD in Figure 1. Search nodes are mapped to buckets using a hash function. Each bucket is backed by a set of three files 2 on disk: 1) a file of frontier nodes that have yet to be expanded, 2) a file of newly generated nodes (and possibly duplicates) that have yet to be checked against the closed list and 3) a file of closed nodes that have already been expanded. A*-HBDDD begins by placing the initial node in its respective bucket based on the supplied hash function (lines 1 2). The cost bound for the first iteration is set to the f value of the initial state (line 1). All buckets that contain a state with f less than or equal to the minimum bound are divided among a pool of threads to be expanded (lines 4 20). Alternatively, references to these buckets can be stored in a work queue, guarded by a lock. Free threads would acquire exclusive access to this queue for jobs. Recall that each bucket is backed by three files: OpenFile, NextFile and ClosedFile. The OpenFile contains all open nodes for a bucket. The set of OpenFiles among all buckets collectively represent the open list for the search. When processing an expansion job for a given bucket, a thread proceeds by expanding all of the frontier nodes with f values that are within the current bound from the OpenFile of the bucket (lines 10 13). Nodes that are chosen for expansion are appended to the ClosedFile for the current bucket (line 20). The set of ClosedFiles among all buckets collectively represent the closed list for the search. Nodes that were not chosen for expansion and successor nodes that exceed the bound are appended to the NextFile for the current bucket (lines 13 & 19). The set of NextFiles collectively represent the search frontier and require duplicate detection in the following merge phase. Finally, if a successor is generated with an f value that is within the current bound then it is expanded immediately as a recursive expansion (lines 12 & 17). To improve efficiency, individual states are not written to disk immediately upon generation. Instead each bucket has an internal buffer to hold states. When the buffer becomes full, the states are written to disk. If an expansion thread generates a goal state (line 15) within the bound (lines 16 and 11), a reference to the incumbent solution is updated (line 14) and (assuming the heuristic is admissible) the search terminates (line 6). If the heuristic is admissible, then the incumbent is admissible because of the strict best-first search order on f. Solution recovery is performed by walking backward from the goal state using an inversion operator to generate each parent state along the path to the initial state. This requires storing an inversion operator for each node. Each parent state generated during this solution recovery process needs to be mapped and loaded from its respective bucket. If a solution has not been found, then all buckets that require merging are divided among a pool of threads to be merged in the next phase (lines 7 8). In order to process a merge job, each thread begins by reading the ClosedFile for its bucket into a hash-table (line 21) called Closed. A*-HBDDD requires enough internal memory to store all closed nodes and unique nodes on the frontier in all buckets currently being merged by active threads. The size of a bucket can be easily tuned by varying the granularity of the hash function. Next, all frontier nodes in the NextFile are streamed in and checked for duplicates against the closed list (lines 22 26). The nodes that are not 2. With the exception of the init file our files roughly correspond to those described by Korf (2008b) 238

7 Solving Large Problems with Heuristic Search duplicates or that have been reached via a better path and therefore have a lower g value are written back out to OpenFile so that they remain on the frontier for latter phases of search (lines 23 25). The hash-table is updated to contain these nodes as well. All other duplicate nodes are ignored. Finally, the open and closed nodes are flushed to disk (lines 25 and 26). To save external storage, Korf (2008a) suggests that instead of proceeding in two phases, merge jobs may be interleaved with expansion jobs. With this optimization, a bucket may be merged if all of the buckets that contain its predecessor nodes have been expanded. An undocumented ramification of this optimization for HBDDD, however, is that it does not permit recursive expansions. Because of recursive expansions, one cannot determine the predecessor buckets and therefore all buckets must be expanded before merges can begin. Our variant of A*-HBDDD implements recursive expansions and therefore it does not interleave expansions and merges. One technique for detecting when predecessor nodes have been expanded is Structured Duplicate Detection (SDD, Zhou & Hansen, 2004). SDD is an alternative to DDD that exploits connectivity in the state space to avoid writing duplicates to disk. 2.2 Empirical Results We evaluated the performance of A*-HBDDD on the sliding-tile puzzle. We compared A*- HBDDD with highly optimized implementations of internal A*, IDA* and Asynchronous Parallel IDA* (AIDA*, Reinefeld & Schnecke, 1994). AIDA* is a parallel version of IDA* that works by performing a breadth-first search to some specified depth and the resulting frontier is then divided evenly among all available threads. Threads perform an IDA* search in parallel for each node in its queue. The upper bounds for all IDA* searches are synchronized across all threads so that a strict best-first search order is achieved given an admissible and consistent heuristic. AIDA* can be seen as a parallel approximation to Simplified Memory-Bounded A* (SMA*, Russell, 1992) with large f layers. To verify that we had efficient implementations of these algorithms, we compared our implementations (in Java) to highly optimized versions of A* and IDA* written in C++ (Burns et al., 2012). The Java implementations use many of the same optimizations. In addition we use the High Performance Primitive Collection (HPPC) in place of the Java Collections Framework (JCF) for many of our data structures. This improves both the time and memory performance of our implementations (Hatem, Burns, & Ruml, 2013). We also compared A*-HBDDD to an alternative external algorithm, breadth-first heuristic search (BFHS, Zhou & Hansen, 2006) with delayed duplicate detection (BFHS-DDD). BFHS attempts to reduce the memory requirement of search, in part by removing the need for a closed list. BFHS proceeds in a breadth-first ordering by expanding all nodes within a given upper bound on f at one depth before proceeding to the next depth. To prevent duplicated search effort Zhou and Hansen (2006) use a strategy first introduced by Korf (1999), which guarantees that, in an undirected graph, checking for duplicates against the previous depth layer and the frontier is sufficient to prevent the search from leaking back into previously visited portions of the space. While BFHS is able to do away with the closed list, for many problems it will still require a significant amount of memory to store the exponentially growing search frontier. This motivates combining BFHS with HBDDD. 239

8 Hatem, Burns, & Ruml Machine Threads Time Expanded Nodes/Sec A* (Java) A ,557,459,344 1,683,739 A* (C++) A ,557,459,344 3,018,332 IDA* (Java) B 1 1,104 18,433,671,328 16,697,166 IDA* (C++) B ,433,671,328 29,075,191 AIDA* (Java) B ,994,333,240 67,542,041 BFHS-DDD (Java) B 24 3,355 10,978,208,032 3,272,193 A*-HBDDD (Java) B 24 1,014 3,492,457,298 3,444,237 A*-HBDDD tt (Java) B ,489,553,397 3,440,077 Table 1: Performance summary on the 100 random 15-puzzle instances from (Korf, 1985). Times reported in wall clock seconds for solving all instances. Like IDA*, BFHS uses an upper bound on f values to prune nodes. If a bound is not available in advance, iterative deepening can be used. However, since BFHS does not store a closed list, the full path to each node from the root is not maintained and it must use divide-and-conquer solution reconstruction (Korf, Zhang, Thayer, & Hohwald, 2005) to rebuild the solution path. Our implementation of BFHS-DDD does not perform solution reconstruction and therefore the results presented give a lower bound on its actual solving times. The 15-puzzle is a standard search benchmark. We used the 100 instances from Korf (1985) and the Manhattan distance heuristic. For the algorithms using HBDDD, we selected a hash function that maps states to buckets by ignoring all except the position of the blank, one and two tiles. This hash function results in 3,360 buckets and the number of buckets that need to be considered for writing newly generated nodes when expanding a node is bound by the maximum number of actions applicable in any given state. A random hash function would probably provide even better load balancing among files. We use the minimum f value of any generated node greater than the current bound to update the cost bounds for both A*-HBDDD and BFHS-DDD. The first set of rows in Table 1 summarizes the performance of internal A*, IDA* and AIDA*. The results for A* were generated on Machine-A, a dual quad-core (8 cores) machine with Intel Xeon X GHz processors and 48 GB RAM. A* needs roughly 30 GB of RAM to solve all 100 instances. All other results were generated on Machine-B, a dual hexa-core machine (12 cores) with Xeon X GHz processors, 12 GB of RAM and GB disks. In-memory A* is not able to solve all 100 instances on this machine due to memory constraints. Our version of AIDA* used 24 threads and generated a frontier of 24,000 nodes, using an A* search, to seed the parallel phase of the search. From these results, we see that the Java implementation of A* is just a factor of 1.7 slower than the most optimized C++ implementation known. These results provide confidence that our comparisons reflect the true ability of the algorithms rather than misleading aspects of implementation details. 240

9 Solving Large Problems with Heuristic Search The second set of rows in Table 1 shows a summary of the performance results for A*-HBDDD compared to in-memory search. We used 24 threads and the states generated by the external algorithms were distributed across all 12 disks. A*-HBDDD outperforms BFHS-DDD because it expands fewer nodes. We discuss this in more detail in section 3. The results show that the base Java implementation of A*-HBDDD is just 1.7 slower than the C++ implementation of IDA* but slightly faster than the Java implementation. Note that A*-HBDDD expanded almost 3.5 billion nodes while A* expanded fewer than 1.6 billion. We believe this is due to duplicate states generated during recursive expansion, when the closed list is not consulted. We can improve the performance of A*-HBDDD by exploiting available RAM with the simple technique of using transposition tables to avoid expanding duplicate states during recursive expansions (A*-HBDDD tt ). With this improvement, A*-HBDDD is 1.4 faster than the highly optimized C++ IDA* solver and 2.5 faster than the optimized Java IDA* solver. A*-HBDDD tt is within a factor of two of a highly optimized implementation of parallel AIDA*, which cannot cope with state spaces with many duplicate nodes. Moreover, it is possible for AIDA* to expand more nodes than serial IDA* since it can expand parts of the tree that would not be reached by serial IDA* if serial IDA* finds a solution early, or it can expand fewer nodes than serial IDA* if the first solution that serial IDA* would find comes late in the search. AIDA* is able to outperform A*-HBDDD and A*-HBDDD tt even when it expands 5 to 10 times as many nodes because node expansion in the sliding-tiles domain is cheap. For many practical problems node expansion is much more expensive, and A*-HBDDD and A*-HBDDD tt may outperform AIDA*. While A*-HBDDDtt running on 12 cores (last line of table) has only 2x speed up over serial A* running on 1 core (first line of table), note that it is an external algorithm that trades slow access to disk for the ability to solve problems beyond the confines of RAM. Given that disk is millions of times slower than RAM, it is exciting to see that external-memory search can be faster than internal-memory search. While these results show that A*-HBDDD performs well compared to IDA* on problems like the sliding-tile puzzle, the strictly best-first layered search does not work well for other domains, preventing it from serving as a general-purpose search method for large problems. In the next two sections we discuss two important limitations of A*-HBDDD that motivate the main contributions of this paper. 3. External Memory Search With Non-Uniform Edge Costs A*-HBDDD achieves sequential I/O behavior by dividing the search into f layers. Each layer refers to nodes with the same lower bound on solution cost f. At each iteration of search, nodes are read sequentially from external memory and expanded only if their f value is within the current lower bound on solution cost. Many real-world problems have real-valued costs, giving rise to a large number of f layers with few nodes in each, substantially eroding performance. A*-HBDDD reads all open nodes from files on disk and expands only the nodes within the current f bound. If there is only a small number of nodes in each f layer, the algorithm pays the cost of reading the entire frontier only to expand a few nodes. Then in the merge phase, the entire closed list is read only to merge the same few nodes. Additionally, when there are many distinct f values, the successors of each node tend to exceed the current f bound, resulting in fewer I/O-efficient recursive 241

10 Hatem, Burns, & Ruml expansions. Korf (2004) speculated that the problem of many distinct f values could be remedied by somehow expanding more nodes than just those with the minimum f value. In this section we present an algorithm, Parallel External Dynamic A* Layering (PEDAL) that does exactly this. PEDAL improves on A*-HBDDD by relaxing the strictly bestfirst ordering of the search in order to perform a constant number of expansions per I/O operation. We begin by reviewing previous work in section 3.1. In section 3.2 we describe PEDAL in more detail and prove that it is I/O efficient. In an empirical evaluation in section 3.3, we compare PEDAL to IDA*, IDA* CR (Sarkar, Chakrabarti, Ghose, & Sarkar, 1991), A*- HBDDD and BFHS-DDD using a variant of the sliding-tile puzzle with non-unit edge costs and a more realistic dockyard planning domain. The results show that PEDAL gives the best performance on the sliding-tile puzzle and is the only practical approach for the realvalued problems among the algorithms tested in our experiments. PEDAL demonstrates that relaxed best-first heuristic search can be effective for large problems with arbitrary costs. 3.1 Previous Work In this section, we present relevant previous work that PEDAL builds on, as well as alternative techniques. IDA* and BFHS were introduced in a previous section but we include descriptions here with further details Iterative Deepening A* Iterative-deepening A* (IDA*, Korf, 1985) is an internal memory technique that requires memory only linear in the maximum depth of the search. This reduced memory complexity comes at the cost of repeated search effort. IDA* performs iterations of a bounded depthfirst search where a path is pruned if f(n) becomes greater than the bound for the current iteration. After each unsuccessful iteration, the bound is increased to the minimum f value among the nodes that were generated but not expanded in the previous iteration. Each iteration of IDA* expands a super-set of the nodes in the previous iteration. If the number of nodes expanded in each iteration grows geometrically, then the total number of nodes expanded by IDA* is O(n), where n is the number of nodes that A* would expand (Sarkar et al., 1991). In domains with real-valued edge costs, there can be many unique f values and the standard minimum-out-of-bound bound layering of IDA* may lead to only a few new nodes being expanded in each iteration. Because of this, the number of nodes expanded by IDA* can be O(n 2 ) (Sarkar et al., 1991) in the worst case when the number of new nodes expanded in each iteration is constant. To alleviate this problem, Sarkar et al. introduce IDA* CR. IDA* CR tracks the distribution of f values of pruned nodes (the nodes that were generated but not expanded during an iteration of search). This distribution used to find a good threshold for the next iteration. This is achieved by selecting the bound that will cause the desired number of pruned nodes to be expanded in the next iteration. To guarantee efficiency, the desired number must follow a geometric progression (at least doubling). If the successors of these pruned nodes are not expanded in the next iteration then this scheme is often able to accurately double the number of nodes between iterations. If the successors do fall within the bound on the next iteration then more nodes may be 242

11 Solving Large Problems with Heuristic Search expanded than desired. Since the threshold is increased liberally, nodes are not expanded in a strict best-first order. Therefore, branch-and-bound must be used on the final iteration of search to ensure optimality. In branch-and-bound, we continue the search after finding a solution until all nodes whose lower bounds are less than the incumbent solution cost have been expanded, ensuring that the solution is optimal. Any nodes whose lower bound is equal or greater than the incumbents cost can be pruned, as they cannot lead to a better solution. IDA* CR is effective for problems that exhibit a wide range of f values but may still achieve poor performance for domains where the branching does not allow for doubling the number of expanded nodes for each iteration. IDA* and IDA* CR suffer from an additional source of node regeneration overhead on search spaces that form highly connected graphs. Because they use depth-first search, they cannot detect duplicate search states except those that form cycles in the current search path. Even with cycle checking, the search will perform extremely poorly if there are many paths to each node in the search space. This motivates the use of a closed list in classic algorithms like A* Breadth-First Heuristic Search In this section we provide more details for BFHS, introduced in section 2.2. BFHS attempts to reduce the memory requirement of search by removing the need for a closed list. BFHS proceeds in a breadth-first ordering by expanding all nodes within a given upper bound on f at one depth before proceeding to the next depth. If a bound is not available in advance, iterative deepening can be used, however, as discussed earlier, iterative-deepening fails on domains with many distinct f values. To provide a suitable comparison to PEDAL, we propose a novel variant of BFHS that uses the same technique of IDA* CR for updating the upper bound at each iteration of search. One side effect of the breadth-first search order is that BFHS is not able to break ties among nodes with the same f value. A* with optimal tie-breaking (expanding nodes with highest g first) expands nodes with higher g values first (deeper nodes first in domains with uniform edge costs). BFHS needs to expand all nodes n with f(n) C at all depth-layers prior to the depth layer that contains the goal. The search order of BFHS is equivalent to the search order of A* with worst-case tie breaking (expanding nodes with lower g first) and can expand up to twice as many unique nodes as A* with optimal tie breaking. When combined with iterative deepening, BFHS can expand up to four times as many nodes as A* (Zhou & Hansen, 2006). Furthermore, when combined with the bound setting technique of IDA* CR, it can expand many nodes with f values greater than the optimal solution cost which are not strictly necessary for optimal search. BFHS is not able to benefit substantially from branch-and-bound in the final iteration because goal states are generated in the deepest layers of the search and it must expand all nodes within the final inflated upper bound whose depths are less than the goal depth. 3.2 Parallel External Dynamic A* Layering A*-HBDDD suffers from excessive I/O overhead when there are a small number of nodes in each f layer. PEDAL solves this problem by relaxing the best-first search order, allowing it to solve problems with arbitrary f cost distributions. PEDAL can be seen as a combination 243

12 Hatem, Burns, & Ruml Search(initial) 27. bound f (initial); bucket hash(initial) 28. write(openfile(bucket), initial) 29. while bucket Buckets : min f (bucket) bound 30. for each bucket Buckets : min f (bucket) bound 31. ThreadExpand(bucket) 32. if incumbent break 33. for each bucket Buckets : NeedsMerge(bucket) 34. ThreadMerge(bucket) 35. bound NextBound(f dist) ThreadMerge(bucket) 36. Closed read(closedfile(bucket)); Open 37. for each n NextFile(bucket) 38. if n / Closed Open or g(n) < g(closed Open[n]) 39. Open (Open Open[n]) {n} 40. f distribution add(f dist, f (n)) 41. write(openfile(bucket), Open) 42. write(closedfile(bucket), Closed) Figure 2: Pseudocode for PEDAL. of A*-HBDDD and an estimation technique inspired by IDA* CR to dynamically layer the search space. Like HBDDD-A*, PEDAL proceeds in two phases: an expansion phase and a merge phase. However, during the merge phase, it tracks the distribution of the f values of the frontier nodes that were determined not to be duplicates. As we explain in detail below, this distribution is used to select the f bound for the next expansion phase that will give a constant number of expansions per node I/O. The pseudo-code for PEDAL, given in Figure 2, is adapted from the pseudo-code for A*-HBDDD given in Figure 1. The main difference is at lines 35 and 40 where PEDAL records the f value of all nodes that are added to the frontier and uses this distribution to select the next bound for the following expansion phase. Another critical difference is that, since PEDAL relaxes the best-first search order, it must perform branch-and-bound after an incumbent solution is found Overhead PEDAL maintains a layering such that the number of nodes expanded in each layer is at least a constant fraction of the amount of I/O (the number of nodes read and written to external memory) at each iteration. It keeps a histogram of f values for all nodes on the open list and a count of the total number of nodes on the closed list. The cost bound for each layer is selected so that a constant fraction of the sum of nodes on the open and closed lists will be expanded. We found a value of 1/2 worked well in practice for the domains tested. Unlike IDA* CR which only provides a heuristic for the desired doubling behavior, the technique used by PEDAL is guaranteed to give only bounded I/O overhead. That is, 244

13 Solving Large Problems with Heuristic Search f distribution Closed List = 300 Open List = 500 Expanded Nodes = 400 Figure 3: PEDAL keeps a histogram of f values on the open list and uses it to update the threshold to allow for a constant fraction of the number of nodes on open and closed to be expanded in each iteration. the number of nodes expanded is at least a constant fraction of the number of nodes read from and written to disk. We assume a constant branching factor b and that the number of frontier nodes remaining after duplicate detection is always large enough to expand the desired number of nodes. We begin with a few useful lemmata. Let o be the number of nodes on the open list, c be the number of nodes on the closed list, e be the number of nodes expanded in an iteration and r be the number of recursively expanded nodes in an iteration. Lemma 1 The number of I/O operations during the expand phase is at most 2o+eb+rb+r. Proof: During the expand phase we read o open nodes from disk. We write at most eb nodes plus the remaining o e nodes, that were not expanded, to disk. We also write at most rb recursively generated nodes and e + r expanded nodes to disk. Lemma 2 The number of I/O operations during the subsequent merge phase is at most c + e + 2(r + eb + rb). Proof: During the merge phase we read at most c + e + r nodes from disk and eb + rb newly generated nodes from disk. We write at most r recursively expanded nodes to the closed list and eb + rb new nodes to the open list. Lemma 3 The total number of I/O operations is at most 2o + c + e(3b + 1) + r(3b + 3). Proof: From Lemma 1, Lemma 2 and total I /O = expanded I /O + merged I /O = (2o + eb + rb + r) + (c + e + 2(r + eb + rb)) = (2o + eb + rb + r) + (c + e + 2r + 2eb + 2rb)) = 2o + c + (3eb + e) + (3rb + 3r) = 2o + c + e(3b + 1) + r(3b + 3) 245

14 Hatem, Burns, & Ruml Threads Time Expanded Nodes/Sec IDA* CR 1 14,009 80,219,537,668 5,726,285 AIDA* CR 24 1,052 48,744,622,573 46,335,192 BFHS-DDD 24 3,147 7,532,248,808 2,393,469 PEDAL 24 1,066 6,585,305,718 6,177,585 Table 2: Performance summary for 15-puzzle with square root costs. Times reported in seconds for solving all instances. Theorem 1 If the number of nodes expanded e is chosen to be k(o + c) for some constant 0 < k 1, and there is a sufficient number of frontier nodes, o e, then the number of nodes expanded is bounded from below by a constant fraction of the total number of I/O operations for some constant q. Proof: total I /O = e(3b + 1) + r(3b + 3) + 2o + c by Lemma 3 < e(3b + 3) + r(3b + 3) + 2o + c = ze + zr + 2o + c for z = (3b + 3) = zko + zkc + zr + 2o + c for e = ko + kc = o(zk + 2) + c(zk + 1) + zr < o(zk + 2) + c(zk + 2) + zr < qko + qkc + qr for q (zk + 2)/k = q(ko + kc + r) = q(e + r) because e = k(o + c) = q total expanded Because q (zk + 2)/k = (3b + 3) + 2/k is constant, the theorem holds. 3.3 Empirical Evaluation We evaluated the performance of PEDAL on two domains with a wide range of edge costs: the square root sliding-tile puzzle and a dockyard robot planning domain. For these experiments, we implemented a novel variant of BFHS-DDD that uses the IDA* CR technique for setting the upper bound at each iteration. As in the previous experiments, our implementation of BFHS-DDD does not perform solution reconstruction and therefore the results presented give a lower bound on its actual solution times. All algorithms were written in Java as described in section 2.2 and were run on Machine-B The Square Root 15-Puzzle The classic sliding-tile puzzle lacks an important feature that many real-world applications of heuristic search have: real-valued costs. In order to evaluate PEDAL on a domain with 246

15 Solving Large Problems with Heuristic Search Square Root Tiles PEDAL 0 PEDAL 0 PEDAL IDA*cr AIDA*cr 2 1 BFHS-DDD 2 Figure 4: Comparison between PEDAL, IDA* CR, AIDA* CR, and BFHS-DDD. The axes show log 10 CPU time. real-valued costs that is simple, reproducible and has well understood connectivity, we use a variant of the puzzle proposed by Hatem et al. (2011), in which each move costs the square root of the number on the tile being moved. This gives rise to many distinct f values. The plots in Figure 4 show a comparison between PEDAL, IDA* CR, AIDA* CR and BFHS- DDD on the square root version of the 100 tiles instances used by Korf (1985). The x axes show log 10 CPU time in seconds: points below the diagonal y = x line represent instances where PEDAL solved the problems faster than the respective algorithm. The first square root tiles plot shows a comparison between PEDAL and IDA* CR. We can see from this plot that IDA* CR solved the easier instances faster because it does not have to go to disk, however PEDAL greatly outperformed IDA* CR on the more difficult problems. The advantage of PEDAL over IDA* CR grew quickly as the problems required more time. The center plot shows a comparison between PEDAL and AIDA* CR and Table 2 includes results for AIDA* CR. The node expansion rate of AIDA* CR is nearly 7.5 times that of PEDAL but has roughly the same solving time because it cannot detect duplicates. Both algorithms achieve a speedup of approximately 13x when run on this 24-core machine. The square root tiles plot on the left compares PEDAL to BFHS-DDD. PEDAL was much faster on easier instances and gave consistently superior performance throughout the range of problem difficulties. As discussed above, the search order of BFHS is equivalent to A* with worst-case tie breaking and when combined with iterative deepening, it can expand up to four times as many nodes as A* (Zhou & Hansen, 2006). Moreover, since BFHS-DDD and PEDAL use a loose upper bound, they can expand many nodes with f values greater than the optimal solution cost which are not strictly necessary for optimal search. However, unlike PEDAL, BFHS is not able to effectively perform branch-and-bound in the final iteration and must expand all nodes within the final inflated upper bound that are shallower than the goal. 247

16 Hatem, Burns, & Ruml Dockyard Robot 2 PEDAL BFHS-DDD 3 Time Expanded Nodes/Sec BFHS-DDD 4,993 4,695,394, ,395 PEDAL 1,765 1,983,155,888 1,123,601 Table 3: Performance summary for Dockyard Robot. Times reported in seconds for solving all instances using 12 cores and 24 threads. The axes show log 10 CPU time Dockyard Robot Planning The sliding-tile puzzle does not have many duplicate states and it is, for some, perhaps not a practically compelling domain. We implemented a planning domain inspired by the dockyard robot example used throughout the textbook by Ghallab, Nau, and Traverso (2004). In the dockyard robot domain, which is NP-hard, containers must be moved from their initial locations to their desired destinations via a robot that can carry only a single container at a time. The containers at each location form a stack from which the robot can only access the top container by using a crane that resides at the given location. Accessing a container that is not at the top of a stack therefore requires moving the upper container to a stack at a different location. The available actions are: load a container from a crane into the robot, unload a container from the robot into a crane, take the top container from the pile using a crane, put the container in the crane onto the top of a pile and move the robot between locations. There is a connection between all locations. The load and unload actions have a constant cost of 0.01, accessing a pile with a crane costs 0.05 times the height of the pile plus 1 (to ensure non-zero-cost actions) and movement 248

17 Solving Large Problems with Heuristic Search between locations costs the distance between locations. For these experiments, the location graph was created by placing random points on a unit square. The length of each edge was the Euclidean distance between the locations. Every location is connected to all other locations - the location graph is fully connected. All connections are undirected. The heuristic lower bound sums the distance of each container s current location from its goal location. We conducted these experiments on a configuration with 5 locations, cranes, and piles and 8 containers. We used a total of 50 instances and selected a hash function that maps states to buckets by ignoring all except the position of the robot and three containers. We used IEEE double-precision floating point to represent all costs. The search space for dockyard robot planning forms a highly connected graph, and thus there are many ways to reach the same state. For example moving the robot from location A to B to C then back to A forms a cycle of length 3. Search algorithms that cannot remember duplicates will perform extremely poorly. Such is the case for IDA* CR, which failed to solve any instance within the time limit so we do not show results for it (in the sliding-tiles puzzle domain where IDA* CR is merely slow rather than failing catastrophically the shortest cycles are of length 12). PEDAL and BFHS-DDD were able to solve all instances. Table 3 shows a performance comparison between PEDAL and BFHS-DDD. Again, points below the diagonal represent instances where PEDAL had the faster solution time. We can see from the plot that all of the points lie below the y = x line and therefore PEDAL outperformed BFHS-DDD on every instance. These results provide evidence to suggest that a relaxed best-first search order is competitive in an external memory setting. It significantly reduces the number of nodes generated which corresponds to many fewer expensive I/O operations. BFHS uses a breadth-first search strategy to reduce the space complexity by removing the closed list. However, the performance bottleneck for the search problems examined in this section is the rapidly growing search frontier, not the closed list. Thus, the breadth-first search order of BFHS provides no advantage. In the next section, we show how PEDAL can be extended to outperform alternative approaches for large problems that have large branching factors and thus achieve a new state-of-the-art for the problem of Multiple Sequence Alignment. 4. External Memory Search With Large Branching Factors The branching factor for the sliding-tile puzzle is relatively small since there are few actions that can be taken from any state. Each time a node is expanded, the search generates at most 3 new nodes if the parent of a node is not generated as one of its children. In some domains there can be many possible actions to take at every state, resulting in a rapidly increasing search frontier. For domains with practical relevance, these actions can take on a wide range of costs and many of the new nodes that are generated are never expanded by the search because their costs exceed the cost of an optimal solution. For external memory search, this results in a lot of wasted I/O overhead as these nodes that are never expanded are read from and written to disk at each iteration of the search. One real-world application of heuristic search with practical relevance (Korf, 2012) is Multiple Sequence Alignment (MSA). MSA can be formulated as a shortest path problem where each sequence represents one dimension in a multi-dimensional lattice and a solution is a least-cost path through the lattice. To achieve biologically plausible alignments, great 249

18 Hatem, Burns, & Ruml care must be taken in selecting the most relevant cost function. The scoring of gaps is of particular importance. Altschul (1989) recommends affine gap costs, described in more detail below, which increase the size of the state space by a factor of 2 k for k sequences. Whereas the dockyard problem has a fixed set of actions, MSA has a large branching factor of 2 k 1, a value which increases rapidly as the number of sequences to be aligned grows. This means that the performance bottleneck for MSA is the memory required to store the frontier of the search. Although dynamic programming is the classic technique for solving MSA (Needleman & Wunsch, 1970), heuristic search algorithms can achieve better performance than dynamic programming by pruning much of the search space, computing alignments faster and using less memory (Ikeda & Imai, 1999). Unfortunately, for challenging MSA problems, the memory required to store the open list makes A* impractical. Yoshizumi, Miura, and Ishida (2000) present a variant of A* called Partial Expansion A* (PEA*) that reduces the memory needed to store the open list by storing only the successor nodes that appear most promising. This technique can significantly reduce the size of the open list. However, like A*, PEA* is limited by the memory required to store the open and closed list and for challenging alignment problems PEA* can still exhaust memory. One previously proposed alternative to PEA* is Iterative Deepening Dynamic Programming (IDDP, Schroedl, 2005), a form of bounded dynamic programming that relies on an uninformed search order to reduce the maximum number of nodes that need to be stored during search. The memory savings of IDDP comes at the cost of repeated search effort and divide-and-conquer solution reconstruction. IDDP forgoes a best-first search order and, as a result, it is possible for IDDP to visit many more nodes than a version of A* with optimal tie-breaking. Moreover, because of the wide range of edge costs found in the MSA domain, IDDP must rely on the bound setting technique of IDA* CR (Sarkar et al., 1991). With this technique, it is possible for IDDP to visit four times as many nodes as A* (Schroedl, 2005). And, even though IDDP reduces the size of the frontier, it is still limited by the amount of memory required to store the open nodes. For large MSA problems, this can exhaust main memory. Rather than suffer the overhead of an uninformed search order and divide-and-conquer solution reconstruction, we propose solving large MSA problems by using external memory search. In this section we present an extension of PEDAL, called Parallel External Partial Expansion A* (PE2A*), that combines the external memory search of PEDAL with the best-first partial expansion technique of PEA*. We compare PE2A* with in-memory A*, PEA* and IDDP for solving challenging instances of MSA. As in the previous section, the results show that parallel external memory best-first search can outperform serial in-memory search and is capable of solving large problems that cannot fit in main memory. Contrary to the assumptions of previous work, we find that storing the open list is much more expensive than storing the closed list. We also demonstrate that PE2A* is capable of solving, for the first time, the entire Reference Set 1 of the BAliBASE benchmark for MSA (Thompson, Plewniak, & Poch, 1999) using a biologically plausible cost function that incorporates affine gap costs. And just as with PEDAL, PE2A* shows that a relaxed external best-first search can effectively use heuristic information to surpass methods that rely on uninformed search orders. 250

Heuristic Search for Large Problems With Real Costs

Heuristic Search for Large Problems With Real Costs Matthew Hatem and Ethan Burns and Wheeler Ruml Department of Computer Science University of New Hampshire Durham, NH 03824 USA mhatem, eaburns and ruml