A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

Similar documents
Parallel Algorithms for Bayesian Networks Structure Learning with Applications in Systems Biology

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Building Classifiers using Bayesian Networks

BAYESIAN NETWORKS STRUCTURE LEARNING

Dependency detection with Bayesian Networks

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Machine Learning. Sourangshu Bhattacharya

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Summary: A Tutorial on Learning With Bayesian Networks

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Hybrid Feature Selection for Modeling Intrusion Detection Systems

COMP90051 Statistical Machine Learning

ACO and other (meta)heuristics for CO

Parallel Exact Inference on the Cell Broadband Engine Processor

STA 4273H: Statistical Machine Learning

Survey of contemporary Bayesian Network Structure Learning methods

Directed Graphical Models (Bayes Nets) (9/4/13)

Probabilistic Graphical Models

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Expectation Propagation

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Learning Equivalence Classes of Bayesian-Network Structures

Bayesian Machine Learning - Lecture 6

Multi-Way Number Partitioning

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

3 : Representation of Undirected GMs

FUTURE communication networks are expected to support

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Bayesian Networks Inference (continued) Learning

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Lecture 5: Exact inference

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Node Aggregation for Distributed Inference in Bayesian Networks

Mini-Buckets: A General Scheme for Generating Approximations in Automated Reasoning

Lecture 4: Undirected Graphical Models

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Joint Entity Resolution

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

COMMUNICATION IN HYPERCUBES

A New Combinatorial Design of Coded Distributed Computing

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Probabilistic Graphical Models

Learning decomposable models with a bounded clique size

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

The Encoding Complexity of Network Coding

Algorithms and Applications

Self-Organizing Maps for cyclic and unbounded graphs

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Unit 4: Formal Verification

Geometric data structures:

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Theorem 2.9: nearest addition algorithm

Integer Programming Theory

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

MOST attention in the literature of network codes has

Parallel Performance Studies for a Clustering Algorithm

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Introduction to Graphical Models

Sorting (Chapter 9) Alexandre David B2-206

Layout Search of a Gene Regulatory Network for 3-D Visualization

A more efficient algorithm for perfect sorting by reversals

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

6 : Factor Graphs, Message Passing and Junction Trees

Directed Graphical Models

STAT 598L Learning Bayesian Network Structure

Chapter 2 PRELIMINARIES. 1. Random variables and conditional independence

A Row-and-Column Generation Method to a Batch Machine Scheduling Problem

Graphs: Introduction. Ali Shokoufandeh, Department of Computer Science, Drexel University

Byzantine Consensus in Directed Graphs

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Machine Learning

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Learning Directed Probabilistic Logical Models using Ordering-search

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Discrete mathematics , Fall Instructor: prof. János Pach

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

An Improved Lower Bound for Bayesian Network Structure Learning

What is Parallel Computing?

Probabilistic Graphical Models

An Effective Upperbound on Treewidth Using Partial Fill-in of Separators

5 MST and Greedy Algorithms

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Lecture 9: Undirected Graphical Models Machine Learning

Efficient Bayesian network modeling of systems

Figure : Example Precedence Graph

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Efficient Probabilistic Inference Algorithms for Cooperative Multiagent Systems

Transcription:

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu Abstract Given n random variables and a set of m observations of each of the n variables, the Bayesian network (BN) structure learning problem is to infer a directed acyclic graph on the n variables such that the implied joint probability distribution best explains the set of observations. In this paper, we present a parallel algorithm for exact BN structure learning that is work-optimal and communication-efficient 1. We demonstrate the applicability of our method by an implementation on the IBM Blue Gene/L and an AMD Opteron cluster, and report extended experimental results that exhibit near perfect scaling. 1 Exact Estimation of a Bayesian Network Structure Bayesian networks (BNs) are probabilistic graphical models which allow for a compact representation of the joint probability distribution (JPD) of a set of interacting variables of a given domain. The pair (N, P ) of a directed acyclic graph (DAG) N = (X, E) and a JPD P over X defines a BN if each variable X i X is independent of its non-descendants, given its parents. To evaluate the posterior probability of a considered graph given a set of observations the Bayesian approach utilizes a statistically motivated scoring function. Decomposable scoring functions have been widely used, which can be defined as the sum of the individual score-contributions s (X i, P a (X i )) of each variable X i X given its parents, i.e. Score (N) = X i X s (X i, P a (X i )). Optimizing such a scoring criterion facilitates the discovery of a structure that best represents the observed data. The problem of exact structure learning quickly becomes intractable due to its combinatorial searchspace, and has been shown to be NP-hard even for the case of bounded node in-degree. Parallel algorithms have been developed for heuristics-based BN learning, particularly using meta-heuristics and sampling techniques. 2 Such methods trade off optimality for the ability to learn larger networks. Constructing exact BNs without additional assumptions nevertheless remains valuable. In this paper, we present a parallel algorithm for exact BN learning with nearly perfect load-balancing and optimal parallel efficiency. To our knowledge, this is the first such algorithm for exact structure learning. The presented work in exact Bayesian learning is intended to push the scale of networks for which optimal structures can be estimated. Given a set of observations D n m and a decomposable scoring function, the BN structure learning problem is to find a DAG N on the n random variables that optimizes Score (N), or equivalently, finding the parents of each node in the DAG. Adopting a canonical representation of the DAGs in conjunction with a decomposable scoring function permits a dynamic programming 1 This algorithm was originally presented at HiPC (see footnote 4). Here, novel experimental results are reported. 2 O. Nikolova and S. Aluru. Parallel Discovery of Direct Causal Relations and Markov Blankets. Under review. 1

(DP) approach first explored by Ott et al. 3 Our parallel algorithm is based on the sequential method of Ott et al., which we briefly describe here. Let A X and X i X A. Using D n m, a scoring function s(x i, A) determines the score of choosing A to be the set of parents for X i given D n m. Let F (X i, A) denote the highest possible score that can be obtained by choosing parents of X i from A; i.e., F (X i, A) = max B A s(x i, B). A set B which maximizes the preceding equation is an optimal parents set for X i from the candidate parents set A. While F (X i, A) can be computed by directly evaluating all 2 A subsets of A, it is advantageous to compute it based on the following recursive formulation. For any non-empty set A: F (X i, A) = max{s (X i, A), max Xj A F (X i, A {X j })}. An ordering of a set A X is a permutation π(a) of elements in A. A network on A is said to be consistent with a given order π(a) if and only if X i A, the parents P a (X i ) precede X i in π. An optimal order π (A) is an order with which an optimal network on A is consistent. Let X i be the last element in π (A). Then, the permutation π (A {X i }) obtained by leaving out the last element X i is consistent with an optimal network on A {X i }. We write this as π (A) = π (A {X i }) X i. Given A X and an order π(a), let Q (A, π(a)) denote the optimal score of a network on A that is consistent with π(a): Q (A, π(a)) = X F (X i A i, {X j X j precedes X i in π (A)}). Optimal score of a network on A X is then given by Q (A) = Q (A, π (A)). Our goal is to find π (X ) and the optimal network score Q (X ), and estimate the corresponding DAG. Like the F function, functions π and Q can be computed using a recursive formulation. Consider a subset A. To find π (A), we consider all possible choices for its last element: Xi = argmax Xi AQ (A, π (A {X i })X i ) = argmax Xi A (F (X i, A {X i }) + Q (A {X i })). Then: Q (A) = F (Xi, A {X i }) + Q (A {Xi }), and π (A) = π (A Xi ) X i. By keeping track of the optimal parents sets for each of the variables in X, an optimal network is easily reconstructed. Using these recursive formulations, a DP algorithm to compute an optimal BN can be easily derived. The algorithm considers all subsets of X in increasing size order, starting from the empty subset. When considering A, the goal is to compute F (X i, A) for each X i / A and Q (A, π (A {X i })X i ) for each X i A. All the F and Q values required for computing these have already been computed when considering subsets A {X i } X i A. 2 Parallel Algorithm {} 000 {1} 001 {2} 010 {3} 100 {1,2} 011 {1,3} 101 {2,3} 110 {1,2,3} 111 Figure 1: A lattice for 3 variables and its partitioning into 4 levels. The correspondence with a 3-dimensional hypercube is also shown. To develop the parallel algorithm, it is helpful to visualize the DP algorithm as operating on the lattice L formed by the partial order set inclusion on the power set of X. The lattice L is a directed graph (V, E), where V = 2 X and (B, A) E if B A and A = B + 1. It is naturally partitioned into levels, where level l [0, n] contains all subsets of size l (Fig. 1). A parallel algorithm can be derived by mapping the nodes to processors and letting edges represent 3 S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small gene networks. In Pacific Symposium on Biocomputing, pages 7 67. 2

communication if the incident nodes are assigned to different processors. A node A at level l has l incoming edges from nodes A {X i } for each X i A, and n l outgoing edges to nodes A {X i } for each X i / A. There are (n l) F functions, and Q (A) and π (A) computed at node A. All of these values need to be sent along each of the outgoing edges. On an outgoing edge to node A {X i }, the F (X i, A) value is used in computing Q (A {X i }), and the remaining F (X j, A) (X j / A and X j X i ) values are used in computing F (X j, A {X i }) values at node A {X i }. Note that each of the (n l) F values at A are used in computing the Q value at one of the n l nodes connected to A by outgoing edges. Each level in the lattice can be computed concurrently, with data flowing from one level to the next. Mapping to an n-dimensional Hypercube Observe that the undirected version of L is equivalent to an n-dimensional (n-d) hypercube. Let (X 1, X 2,..., X n ) be an arbitrary ordering of the nodes in the BN. A subset A can be represented by an n-bit string ω, where ω[i] = 1 if X i A, and ω[i] = 0 otherwise. As lattice edges connect pairs of nodes that differ in the existence of one element, they naturally correspond to hypercube edges (Fig. 1). This suggests an obvious parallelization on an n-d hypercube. While we expect the number of processors p 2 n, we describe this parallelization in slightly greater detail due to its use as a module in the development of our parallel algorithm. The n-d hypercube algorithm runs in n + 1 steps. Let ω denote the id of a processor and let µ(ω) denote the number of 1 s in ω. Each processor is active in only one time step processor ω is active in time step µ(ω). It receives (n µ(ω) + 1) F values and one Q value from each of its µ(ω) neighbors obtained by inverting one of its 1 bits to zero. It then computes its own F and Q values, and sends them to each of its n µ(ω) neighbors obtained by inverting one of its zero bits to 1. Partitioning into k-dimensional Hypercubes Let p = 2 k be the number of processors, where k < n. We assume that the processors can communicate as in a hypercube. This is true of parallel computers connected as a hypercube, hypercubic networks such as the butterfly, multistage interconnection networks such as the Omega, and pointto-point communication models such as the permutation network model or the MPI programming model. Our strategy is to decompose the n-d hypercube into 2 n k k-d hypercubes and map each k-d hypercube to the p = 2 k processors as described previously. Although the efficiency of such a mapping by itself is suboptimal (Θ(1/k)), note that each processor is active in only one time step and p 2 n. Therefore, we pipeline the execution of the k-d hypercubes to complete the parallel execution in 2 n k + k time steps such that all processors are active except for the first k and last k time steps during the build up and finishing off of the pipeline. For convenience of presentation, we use the lattice L and the n-d hypercube interchangeably and use A to denote the lattice node for subset A, and use ω A to denote the binary string denoting the corresponding hypercube node. We number the positions of a binary string using 1... n, and use ω A [i, j] to denote the substring of ω A between and including positions i and j. We partition the n-d hypercube into 2 n k k-d hypercubes based on the first n k bits of node ids. For a lattice node ω A, ω A [1, n k] specifies the k-d hypercube it is part of and ω A [n k +1, n] specifies the processor it is assigned to. Conversely, a processor with id r computes for all lattice nodes ω A such that r = ω A [n k + 1, n]. Pipelining Hypercubes Each k-d hypercube is specified by an (n k) bit string, which is the common prefix to the 2 k lattice/k-d hypercube nodes that are part of this k-d sub-hypercube. The k-d hypercubes are processed in the increasing order of the number of 1 s in their bit string specifications, and in lexicographic order within the group of hypercubes with the same number of 1 s. This total order is used to initiate the 2 n k k-d hypercube evaluations, starting from time step 0. If T denotes the time step in which H i is initiated and A is a lattice node mapped to H i, then processor with id ω A [n k+1, n] computes for the lattice node A in time step T + µ(ω A [n k + 1, n]). This algorithm is correct, work-optimal, and space efficient. (For proofs and details see Nikolova et al. 4 ) 3 Experimental Validation To assess the performance of the presented algorithm we developed a C++ and MPI implementation. Experiments were performed on a 1,024 dual-core CPU Blue Gene/L (BG/L) and up to 26 8-core 4 O. Nikolova, J. Zola, and S. Aluru. A parallel algorithm for exact Bayesian network inference. In HiPC, pages 342 349, 2009. 3

Table 1: Run-time results (in seconds) for: (i) varying number of variables n and fixed number of observations m = 1000, and (ii) fixed number of variables n = 24 and varying number of observations m. Experimental results are shown on both Blue Gene/L and AMD Opteron Cluster. Blue Gene/L AMD Opteron Cluster #CPU n = 24 m = 1000 #CPU n = 24 m = 1000 Cores m = 200 m = 1000 n = 16 n = 24 Cores m = 200 m = 1000 n = 16 n = 24 32 166 74 2 74 64 90 4237 12 4237 128 44 193 0.48 193 26 223 104 3 104 12 12 3 0.1 3 1024 264 0.79 264 1024 8 29 0.08 29 2048 28 133 0.42 133 2048 4 17 0.06 17 Relative Speedup 30 2 20 1 10 0 m=200 m=1000 Linear Speedup 64 1024 2048 No. CPU Cores (a) Varying sample size with n = 24 Relative Speedup 30 2 20 1 10 0 n=16 n=24 Linear Speedup 64 1024 2048 No. CPU Cores (b) Varying domain size with m = 1000 Figure 2: Relative speedup on Blue Gene/L for n = 24 variables and varying number of observations m (a), and varying number of variables n and fixed number of observations m = 1000 (b). nodes of an Infiniband AMD Opteron cluster (AMDO). Our focus is exclusively on performance obtained by the parallel algorithm and the resulting ability to learn larger size networks fast. We evaluated a gene regulatory network learning application on a synthetic experimental data generated using the SynTReN package and our implementation of the MDL scoring criterion. We examine scalability with respect to (i) varying number of observations m for a fixed number of variables n = 24, and (ii) varying number of variables n for a fixed number of observations m = 1000, on both platforms. Results are summarized in Tab. 1 and the corresponding relative speedup graphs for BG/L is shown in Fig. 2. The number of observations required to derive a meaningful network grows exponentially in the number of variables. Therefore the ability to compute for datasets with large number of observations is important. We test our implementation on both platforms for m = 200 and m = 1000. The experimental results show that with increasing number of observations, execution time increases linearly on BG/L and approximately linearly on AMDO, while maintaining linear scalability (Fig. 2(a)). This indicates efficient communication as speedup is maintained even in the cases of small number of observations, as computational complexity decreases with decreasing number of observations while communication complexity remains unaffected. In the case of varying number of variables we tested our implementation for n = 16 and n = 24 while fixing m = 1000. The resulting run-times reflect the exponential complexity in the number of variables. As can be expected, better speedup is observed for the larger dataset of n = 24 (Fig. 2(b)). Note that memory requirements also grow exponentially in the numbed of variables, which indicates that going parallel is advantageous from the perspective of both run-time and the ability to solve larger problems. Our implementation was capable of learning a network of 33 variables in 1 hour and 14 minutes. Note that Ott et al. reported that sequentially a network of 24 variables took multiple days (see footnote 3). T. Van den Bulcke, K. Van Leemput, B. Naudts, et al. SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics, 7:43, 2006. 4

4 Conclusions Heuristics-based BN learning methods trade off optimality for the ability to learn larger networks. This work is intended to push the scale of BN structure learning without additional assumptions. In this paper, we presented a parallel algorithm for exact structure learning that is work-optimal and scalable, which achieves near linear speedup in practice.