Parallelizing A Convergent Approximate Inference Method

Size: px

Start display at page:

Download "Parallelizing A Convergent Approximate Inference Method"

Silas Bishop
5 years ago
Views:

1 Parallelizing A Convergent Approximate Inference Method Ming Su (1) and Elizabeth Thompson (2) Departments of (1) Electrical Engineering and (2) Statistics University of Washington {mingsu, eathomp}@u.washington.edu Abstract. The ability to efficiently perform probabilistic inference task is critical to large scale applications in statistics and artificial intelligence. Dramatic speedup might be achieved by appropriately mapping the current inference algorithms to the parallel framework. Parallel exact inference methods still suffer from exponential complexity in the worst case. Approximate inference methods have been parallelized and good speedup is achieved. In this paper, we focus on a variant of Belief Propagation algorithm. This variant has better convergent property and is provably convergent under certain conditions. We show that this method is amenable to coarse-grained parallelization and propose techniques to optimally parallelize it without sacrificing convergence. Experiments on a shared memory systems demonstrate that near-ideal speedup is achieved with reasonable scalability. Keywords: Graphical Model, Approximate Inference, Parallel Algorithm 1 Introduction The ability to efficiently perform probabilistic inference task is critical to large scale applications in statistics and artificial intelligence. In particular, such problems arise in the analysis of genetic data on large and complex pedigrees [1] or data at large numbers of markers across the genome [2]. The ever-evolving parallel computing technology suggests that dramatic speed-up might be achieved by appropriately mapping the existing sequential inference algorithms to the parallel framework. Exact inference methods, such as variable elimination (VE) and the junction tree algorithm, have been parallelized and reasonable speedup achieved [3 7]. However, the complexity of exact inference methods for a graphical model is exponential in the tree-width of the graph. For graphs with large tree-width, approximate methods are necessary. While it has been demonstrated empirically that loopy and generalized BP work extremely well in many applications [8], Yedidia et al. [9] have shown that these methods are not guaranteed to converge for loopy graphs. Recently a promising parallel approximate inference method was presented by Gonzalez et al., [10], where loopy Belief Propagation (BP)

2 2 Ming Su and Elizabeth Thompson was optimally parallelized, but without guarantee of convergence. The UPS algorithm [11] has gained popularity due to its reasonably good performance and ease of implemention [12, 13]. More important, the convex relaxation method which incorporates UPS as a special case, is guaranteed to converge under mild conditions [14]. In this paper, we develop an effective parallel generalized inference method with special attention to the UPS algorithm. Even though the generalized inference method possesses a structural parallelism that is straightforward to extract, problems of imbalanced load and excessive communication overhead can result from ineffective task partitioning and sequencing. We focus on solving these two problems and demonstrating the performance of efficiently paralleled algorithms on large scale problems using a shared memory system. 2 Convex Relaxation Method and Subproblem Construction The convex relaxation method relies on the notion of region graphs to faciliate the Bethe Approximation. In the Bethe approximation, one minimizes the Bethe free energy function and uses its solution to obtain an estimate of the partition function and true marginal distributions [14]. The Bethe free energy is a function of terms known as the pseudo-marginals. Definitions and examples of the Bethe approximation, Bethe region graphs and pseudo-marginals can be found in [9,15]. The UPS algorithm and the convex relaxation method were based on the fact that if the graphical model admits a tree-structured Bethe region graph, the associated Bethe approximation is exact [9, 15]. That is, minimization of the Bethe free energy is a convex optimization problem. We obtain a convex subproblem by fixing the pseudo-marginals associated with a selected subset of inner regions to a constant vector. The convex relaxation method works by first finding a sequence of such convex subproblems then repeatedly solving them until convergence. Graphically, the subproblems are defined over a sequence of tree-structured subgraphs. Simple schemes of finding these subgraphs in grid graphs are proposed in [11]. However, these schemes are not optimal and cannot be extended to general graphs. We present a hypergraph spanning tree algorithm that is more effective and is applicable to general graphs. With the hypergraph representation, the problem of finding these subgraphs, which otherwise requires ad hoc treatment in bipartite region graphs, becomes well-defined. The definition of hypergraphs, hyperedges, hypergraph spanning trees and hyperforests can be found in [16]. In the hypergraph representation, nodes and hyperedges correspond to outer regions and inner regions, respectively. Specifically, an inner region can be regarded as a set, whose elements are adjacent outer regions. In the Greedy Sequencing procedure developed by [14], all outer regions are included in each subproblem. The sequence of tree-structured subgraphs corresponds to a sequence of spanning hypertrees. In general, a spanning tree in a hypergraph may not exist and even determination of its existence is strongly NP-complete [16]. We

3 Parallelizing A Convergent Approximate Inference Method 3 T1 T2 Tm T1 T2 Barrier 1 Map Reduce Tm Barrier 2 4 3' (a) (b) Fig. 1. (a) MapReduce flowchart for a sequence of size 2; (b) Coarsening by contracting edge 3, 4 and 5. develop a heuristic, hyperspan, by extending Kruskal s minimum spanning tree algorithm for ordinary graphs. We apply hyperspan repeatedly to obtain a sequence of spanning hyperforests. In this context, the convergence crierion of [14] translates to a condition that every hyperedge has to appear in at least one spanning forest. The Greedy Sequencing procedure guarantees that, in the worst case, the convergence criterion is still satisfied. Interestingly, for a grid graph model with arbitrary size, the greedy sequencing procedure returns a sequence of size two, which is optimal. 3 Parallel and Distributed Inference In the greedy sequencing procedure, if a subproblem is defined on a forest rather than on a tree, we can run Iterative Scaling (IS) on disconnected components, independently and consequently in parallel. This suggests a natural way of extracting coarse-grained parallelism uniformly across the sequence of subproblems. The basic idea is to partition the hypertree or even the hyperforest into a prescribed number, t, of components and assign the computation associated with each component to a separate processing unit. There is no communication cost incurred among the independent computation tasks. This maps to a coarse-grained MapReduce framework [17] as shown in Figure 1(a). Note that synchronization, accomplished by software barriers, is still required at the end of each inner iteration. In this paper, we only focus on mapping the algorithm to a shared memory system. Task partitioning is performed using a multilevel hypergraph partitioning program hmetis [18]. Compared to alternative programs, it has much shorter solution time and more importantly, it produces balanced partitions with a significantly fewer cut edges. The convergence crierion states that every hyperedge has to appear in at least one spanning forest [14]. This means no hyperedge is allowed to be always a cut edge. A simple technique, edge contraction, prevents a hyperedge from being a cut edge. When a hyperedge is contracted, it is replaced by a super node, containing this edge and all nodes that are adjacent to this

4 4 Ming Su and Elizabeth Thompson edge. All other edges that are previously adjacent to any of these nodes become adjacent to the super node (Figure 1(b)). After we partition once, we can contract a subset of cut edges, resulting in a coarsened hypergraph, repartitioning on which will not have any cut placed on the contracted edges. Near optimal speedup is only achieved when we have perfect load balancing. Knowing that IS solution time is proportional to the number of nodes, we perform weighted partitionings. The weight of a node is 1 for a regular node. For a super node, the weight is the number of contained regular nodes. Reasonable load balance is achieved through weighted partitioning when the average interaction between adjacent random variables is not too high. For high interaction, partitioning-based static load balancing (SLB) performs poorly. In Section 4, we show this effect and propose some techniques to accommodate it. We adopted the common multithreading scheme, where in general, n threads are created in a n-core system and each thread is distributed to a separate core. Thread synchronization ensures that all subproblems converge. We use nonblocking send and blocking receive because they are more efficient for the implementation. For efficiency purpose, pseudo-marginals are sent and received in one package rather than individually. Sender and receiver, respectively, use the predefined protocol to packing and unpacking the aggregate into individual pseudo-marginal messages. Our experimenting environment is a shared memory 8-core system with 2 Intel Xeon Quad Core E GHz processors with Debian Linux. We implemented the algorithms in the Java programming language using MPJ Express, an open source Java message passing interface (MPI) library that allows application developers to write and execute parallel applications for multicore processors and computer clusters/clouds. 4 Experiments and Results The selected class of test problems are Ising models, with joint distribution P (x) e i V αixi+ (i,j) E βijxixj,wherev and E are nodes and edges of graph. α i s are uniformly drawn from [ 1, 1] and β ij s are uniformly drawn from [-β, β]. When β>1, loopy BP fails to converge even for small graphs. Due to synchronization, the slowest task will determine the overall performance. The SLB introduced in Section 3 performs worse as β increases. In practice, we apply two runtime techniques to mitigate the problem. First, a dynamic load balancing (DLB) scheme is developed. Instead of partitioning the graph into n components and distributing them to n threads, we partition the graphs into more components and put them into a task pool. At runtime, each thread fetches a task from the pool onces it finishes with its current task. The use of each core is maximized and the length of bottleneck task is shortened. The second technique is the bottleneck task early termination (ET). A thread is terminated when all other threads become idle and no task is left in the pool. However terminating a task prematurely has two undesirable effects. First, it breaks the convergence requirement. Second, it may change the convergence rate. In order to ensure

5 Parallelizing A Convergent Approximate Inference Method , $ #$%&'!" #$%&'!" $ () Fig. 2. (a) Load balance: DLB & ET vs. SLB. Normalized load (w.r.t. the largest) shown for each core. 3 cases listed: 2 cores (upper left), 4 cores (upper right) and 8 cores (bottom). (b) Speedup: DLB & ET vs. SLB. convergence, we can occasionally switch back to non-et mode, especially when oscillation of messages is detected. With β = 1.1, we randomly generated 100 problems. The number of cores ranges from 2 up to 8 to demonstrate both raw speedup and scalability. Speedup is defined as the ratio between sequential and parallel elapsed time. At this interaction level, the sequential run time exceeds 1 minute giving rise to parallelization, and SLB starts performing poorly. Figure 2(a) shows that with SLB, poor balance results irrespective of the number of cores used. This is dramatically mitigated by DLB and ET. Notice that almost perfect balance is achieved for a small number of cores (2,4), but with 8 cores the load is less balanced. The average speedup over 100 problems is shown in Figure 2(b), both for using SLB and for using DLB and ET. DLB and ET universally improved the speedup and the improvement became more prominent as the number of cores increased. With DLB and ET, the speedup approaches the ideal case until the number of cores reaches 6. We attribute this drop-in-speedup trend to two factors. First, as shown in Figure 2(a), even with DLB and ET, load becomes less balanced as the number of cores increases. Second, there is an increased level of resource contention in terms of memory bandwidth. The BP algorithm frequently accesses memory. As more tasks are running in parallel, the number of concurrent memory accesses also increases. 5 Discussion In the paper, we proposed a heuristic for subproblem construction. This heuristic has been shown to be effective and is provably optimal for grid graphs. Thorough testing on a complete set of benchmarking networks will be important in evaluating the performance of the heuristic. Our parallel implementation is at the algorithmic level, which indicates that it can be combined with other lower level parallelization techniques proposed by other researchers. Experiments on a shared memory system exhibit near-ideal speedup with reasonable scalability.

6 6 Ming Su and Elizabeth Thompson Further exploration is necessary to demonstrate that the speedup scales up in practice on large distributed memory systems, such as clusters. Acknowledgments. This work is supported by NIH grant HG References 1. Cannings C, Thompson EA, and Skolnick MH. (1978) Probability functions on complex pedigrees. Advances in Applied Probability 10: Abecasis GR, Cherny SS, Cookson WO, and Cardon LR. (2002) Merlin rapid analysis of dense genetic maps using sparse gene flow trees. Nature Genetics 30: Shachter RD, and Andersen SK. (1994) Global Conditioning for Probabilistic Inference in Belief Networks. UAI. 4. Pennock D.(1998) Logarithmic Time Parallel Bayesian Inference. UAI, Kozlov A, and Singh J. (1994) A Parallel Lauritzen-Spiegelhalter Algorithm for Probabilistic Inference. In Proceedings of the 1994 Conference on Supercomputing, Namasivayam VK, Pathak A, and Prasanna VK. (2006) Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. In 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 06), Xia Y, and Prasanna VK. (2008) Parallel exact inference on the cell broadband engine processor. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp Botetz B. (2007) Efficient belief propagation for vision using linear constraint nodes. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 9. Yedidia JS, Freeman WT, and Weiss Y. (2000) Generalized belief propagation. In NIPS, pages MIT Press. 10. Gonzalez J, Low Y, Guestrin C, and O Hallaron D. (2009b) Distributed Parallel Inference on Large Factor Graphs. UAI. 11. Teh YW, and Welling M. (2001) The unified propagation and scaling algorithm. In NIPS, pages Carbonetto P, de Freitas N, and Barnard K. (2004) A statistical model for general contextual object recognition. In ECCV, pages Xie Z, Gao J, and Wu X. (2009) Regional category parsing in undirected graphical models. Pattern Recognition Letters, 30(14): Su M. (2010) On the Convergence of Convex Relaxation Method and Distributed Optimization of Bethe Free Energy. In Proceedings of the 11th International Symposium on Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, Florida. 15. Heskes T. (2002) Stable fixed points of loopy belief propagation are local minima of the Bethe free energy. In NIPS, pages Tomescu I, and Zimand M. (1994) Minimum spanning hypertrees. Discrete Applied Mathematics, 54: Dean J, and Ghemawat S, (2004) MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation, San Francisco, CA. 18. Karypis G, and Kumar V. (1998) hmetis: A Hypergraph Partitioning Package.

Parallelizing A Convergent Approximate Inference Method

Parallelizing A Convergent Approximate Inference Method Ming Su () and Elizabeth Thompson (2) Departments of () Electrical Engineering and (2) Statistics University of Washington Department of Statistics