IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER"

Charleen King
6 years ago
Views:

1 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER Probabilistic Image Modeling With an Extended Chain Graph for Human Activity Recognition and Image Segmentation Lei Zhang, Member, IEEE, Zhi Zeng, and Qiang Ji, Senior Member, IEEE Abstract Chain graph (CG) is a hybrid probabilistic graphical model (PGM) capable of modeling heterogeneous relationships among random variables. So far, however, its application in image and video analysis is very limited due to lack of principled learning and inference methods for a CG of general topology. To overcome this limitation, we introduce methods to extend the conventional chain-like CG model to CG model with more general topology and the associated methods for learning and inference in such a general CG model. Specifically, we propose techniques to systematically construct a generally structured CG, to parameterize this model, to derive its joint probability distribution, to perform joint parameter learning, and to perform probabilistic inference in this model. To demonstrate the utility of such an extended CG, we apply it to two challenging image and video analysis problems: human activity recognition and image segmentation. The experimental results show improved performance of the extended CG model over the conventional directed or undirected PGMs. This study demonstrates the promise of the extended CG for effective modeling and inference of complex real-world problems. Index Terms Activity recognition, Bayesian networks (BNs), chain graph (CG), factor graph (FG), graphical model learning and inference, image segmentation, Markov random fields (MRFs). I. INTRODUCTION P ROBABILISTIC graphical models (PGMs) have been developed as a powerful modeling tool. They provide a systematic way to capture various probabilistic relationships among random variables and to provide principled methods for learning and inference. PGMs can be divided into two classes: undirected PGMs and directed acyclic PGMs. Examples of undirected PGMs include Markov random fields (MRFs) [1], [2] and conditional random fields (CRFs) [3], and they mainly capture the mutually dependent relationships such as the spatial correlations among random variables. On the other hand, other PGMs such as Bayesian Networks (BNs) [4], [5] and hidden Markov models (HMMs) [6] are directed acyclic PGMs, and they typically model the causal relationships among random variables. Both types of PGMs have been exploited to solve Manuscript received December 29, 2009; revised October 01, 2010; accepted February 26, Date of publication March 17, 2011; current version August 19, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ferran Marques. L. Zhang is with the UtopiaCompression Corporation, Los Angeles, CA USA ( leizhang2009@gmail.com). Z. Zeng and Q. Ji are with the Rensselaer Polytechnic Institute, Troy, NY USA ( zengz@rpi.edu, qji@ecse.rpi.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP image and video analysis problems. In fact, MRFs have become a de facto modeling framework for image segmentation, while HMMs have become standard tools for motion analysis and activity modeling. Despite their widespread use in image and video analysis, both undirected PGMs and directed PGMs have certain limitations regarding their modeling capability. Neither of them can effectively capture heterogenous relationships. For example, undirected PGMs usually capture mutual interactions among random variables, while directed PGMs typically model cause-and-effect relationships (i.e., causality). However, for many image modeling problems, the relationships among random variables are often heterogeneous. For example, for multiscale image segmentation, the relationships among related image entities in different layers (corresponding to different scales) may be best modeled by directed links, while the relationships among the entities in the same layer can be best modeled by undirected links. Considering these limitations and the fact that there are typically complex and heterogeneous relationships among many entities involved in image modeling, there is a need for a single unified framework that can simultaneously capture all of these relationships and exploit them to solve problems in a systematic and principled manner. Chain graph (CG) [7] is a natural solution to the aforementioned problems. It is a hybrid PGM that consists of both directed and undirected links. CG therefore subsumes both directed and undirected PGMs. Its representation is powerful enough to capture heterogeneous relationships [8]. While a CG theoretically can assume any graphical topology, the conventional CG, however, typically assumes the chain-like structure, which tends to limit its application scope. Thus far, CG applications in real-world problems are still very limited. Among those CG models ever used, most of them have simplified structures and cannot fully exploit the modeling potential of CG. Moreover, the lack of principled methods for parameter learning and inference in a complex CG model further limits its practical utility. To overcome these limitations, this research extends the conventional chain-like CG to the CG of more general topology. In addition, we introduce principled methods for learning and inference in such an extended CG. We demonstrate the effectiveness of this extended CG model in two different image and video analysis applications. II. RELATED WORKS Despite CG s powerful representation capability and its generalization over directed and undirected PGMs, prior CG applications in real-world problems are very limited. Based on their /$ IEEE

2 2402 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 topology, existing hybrid graphical models can be divided into two main categories. The first category consists of a directed graphical model and an undirected graphical model, typically connected in series or stacked on top of each other through some common nodes [9], [10]. The learning and inference for such models are typically done separately for the directed and undirected models. In [9], Liu et al. combine a BN with a MRF for image segmentation. A naïve BN is used to transform the image features into a probability map in the image domain. The MRF enforces the a priori spatial relationships and the local homogeneity among image labels. In our previous work [10], we introduced a hybrid model by stacking a CRF on top of a BN for image segmentation, where the BN is used to capture the causal relationships and the CRF is used to capture the correlation among image entities. In both [9] and [10], the model parameters are learnt separately for the BN and MRF (or CRF) parts. Moreover, the inference in [9] is performed sequentially, with inference results from the BN part fed into the the MRF part to perform further inference. The separate and sequential inference is not theoretically justified, and represents only an approximation to the simultaneous inference. The second category of existing models are chain-like models, typically involving a chain of undirected models connected via directed links. Murino et al. [11] formulate a chain of MRF model connected by a BN for image processing. The BN is used to capture the a priori constraints between different abstraction levels. A coupled MRF is used to solve the combined restoration and segmentation problem at each level. They do not address the parameter learning issue and empirically specify the potential functions. For inference, they combine belief propagation with simulated annealing through sampling in a sequential manner from layer to layer. Such a sequential inference process represents an approximation to simultaneous inference in the whole model. Chardin and Pérez proposed a similar model in [12] and used a directed quadtree to connect MRF models on lattice to form a hybrid hierarchical model. They developed an EM-based algorithm to learn the model parameters, where they leverage its specific tree structure to develop the learning algorithm and use Gibbs sampling to approximately update some parameters for modeling the spatial relationships. For inference, they employ Gibbs sampling to maximize the posterior marginal probability. Hinton et al. developed a deep belief net (DBFN 1 )inseveral works (e.g., [13]). Their DBFN uses restricted Boltzmann machine (RBM) as the building blocks for each layer. Multiple layers are sequentially connected through directed links, where the lowest layer is the layer of observable variables and other layers are hidden layers. Typically, there are no connections within a layer. However, some specific DBFN models allow the top two layers to be connected by undirected links [13]. In [14], they further combine the idea of DBFN with MRFs to formulate a deep network with causally linked MRFs, which allows each layer to be a MRF. In these specific cases, the deep network becomes a hybrid model. In [13] and [15], they studied the learning issue for such deep networks. Hinton s model differs from the CG model introduced in this paper in several aspects. First, the DBFN model is constructed 1 The abbreviation is traditionally DBN. We use DBFN instead in order to avoid confusion with another DBN (i.e., dynamic Bayesian network). by connecting several RBMs or MRFs at different layers using directed links. While in our CG model, both the directed and undirected links can be within the same layer or between different layers, and they do not have to be linked like a hierarchical chain. Second, the DBFN approximates the true posterior distribution of the hidden nodes by implicitly assuming the posterior distribution of each node is independent of each other due to lack of lateral links and exploits this independence to learn the weights in the RBMs. In contrast, we do not have this assumption. We derive the joint probability distribution based on the CG structure using the general global Markov property [7]. Third, the DBFN is usually learned by a greedy layer-by-layer learning [13], [15]. The learning starts from the bottom layer since the bottom layer comprises all observed variables. The learned hidden states of one RBM are then used as the observed variables for learning the next layer and this process iterates until all layers are learned. In contrast, our approach learns all model parameters together. Fourth, a variational approach is usually used to approximately perform inference in the DBFN. In contrast, we convert our CG model into a Factor Graph (FG) [16] representation so that we can apply various principled methods, either exact or approximate, to perform inference. In summary, the current CG models for real applications are limited. Their topology either consists of simply stacked directed and undirected models or a chain structure that connects several layers of undirected graphs. More importantly, the parameter learning for these models are typically performed separately for the undirected part and the directed part of the model. It often ignores the fact that the global partition function in their formulations couples all of the model parameters. For inference, the existing methods tend to perform either separate or sequential inference instead of simultaneous inference. Besides, some works simply apply the belief propagation theories developed for either directed or undirected PGMs to CGs without theoretical justification. In this paper, we intend to overcome these limitations. III. EXTENDED CG MODEL Here, we formally introduce the extended CG model as well as the associated methods for learning and inference in such a hybrid graphical model. A. Extended CG Modeling We first illustrate the construction and parametrization of the extended CG model with a relatively simple example and then provide the general formulation. 1) Model Construction: There are two principal strategies to construct a graphical model. One is to automatically learn the model structure (i.e., structure learning) based on certain criteria. The other is to manually construct the model structure based on the human prior knowledge about the specific problem at hand. While automatic structure learning might find a better model structure and improve the performance, it is generally a very difficult problem for even a simple type of graphical model (e.g., Bayesian Network). In this work, we will focus on manual construction of the CG model. To manually construct the CG model, we choose either directed links or undirected links to capture the relationships be-

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2403 Fig. 2. Example of a CG model and their parametrization. Fig. 1.

layer. In addition, the intra-layer undirected links capture the spatial correlations between region labels.

3 ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2403 Fig. 2. Example of a CG model and their parametrization. Fig. 1. Multiscale segmentation model consists of the intra-layer directed links to capture the hierarchical causalities and the inter-layer directed links pointing from the fine layer to the next coarse layer. In addition, the intra-layer undirected links capture the spatial correlations between region labels. tween random variables based on the nature of these relationships and their semantic meanings. For example, if the relationships can be characterized as causal or one way, directed links can be used to capture such relationships. On the other hand, if the relationships are mutual or both ways, undirected links can be chosen to capture such relationships. In cases where both directed and undirected links are applicable, we choose the one that can simplify the overall model structure. Other types of relationships can be selectively modeled according to the complexity of the constructed model and the consideration of parameter learning difficulty. We use a multiscale image segmentation problem to explain the construction of a CG model (see Fig. 1). Different image entities are involved in image segmentation, such as regions, edges, vertices, etc., and their relationships are heterogeneous. We capture these relationships in the multiscale CG model through different types of links. There are some natural causalities between the image entities. First, two adjacent regions intersect to form an edge. If these regions have different labels, they form (cause) a boundary between them. Second, multiple edges intersect to form a vertex. Third, the region labels at the fine layer induce the region labels at the coarse layer. We use directed links to capture these causalities. Besides the causalities, there are other useful contextual relationships between image entities such as the spatial relationships, and we use undirected links to model them. 2) Model Parametrization: Model parametrization consists of parameterizing the links in the CG model and deriving the represented joint probability distribution (JPD). A CG model consists of both directed links and undirected links. We can parameterize the relationships represented by these links using either potential functions or conditional probabilities. In general, if the links are undirected links, they are parameterized by local potential functions. If the links are directed, they are parameterized by local conditional probabilities. However, there are more complex cases that need to be specifically considered during parametrization. We use the example in Fig. 2 for illustration. In this model, we use local conditional probabilities to parameterize the directed links. For example, the relationship between and its parents is parameterized by the conditional probability Fig. 3. Directed master graph for the CG model in Fig. 2.. On the other hand, we use pairwise potentials to parameterize the undirected links. For example, the relationship between and is parameterized by the pairwise potential function. Other directed links and undirected links can be similarly parameterized. Some links such as the link between and (or and ) are more complex to parameterize because, which is a child node of, is associated with undirected links as well. We can group and together and use a conditional probability to parameterize the links between this group and its parents. Similarly, we use to parameterize the links between nodes and nodes. Given the CG structure and its parametrization, we need to derive the JPD of all random variables. We use a method, analogous to the one used in [17], to derive the JPD. The main idea is to first create a directed master graph and then create undirected component subgraphs for some terms in the JPD of the master graph. The component subgraphs (denoted by ) are a coarse partition of the variables in the CG, where the set of subgraphs induced by the partition are maximally locally connected undirected subgraphs. The master graph is a directed graph whose nodes are component subgraphs (or singleton nodes) and whose directed arcs connect from component subgraphs to if a variable in has a child in in the graph. The JPD of the CG model can be finally derived from the JPD of the master graph and those of the component subgraphs. Overall, there are three steps to derive the JPD. We still use the example in Fig. 2 to illustrate these steps. 3) Step 1: We create a directed master graph, whose nodes are resulted from maximally grouping subsets of undirectedly connected nodes in the original graph. We call these grouped nodes as composite nodes. In Fig. 2, we group into one node since these nodes are connected by undirected links. Similarly, we group and into one node and group and into one node. The created master graph is illustrated by Fig. 3.

2404 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 where,, and are local normalization functions that can be calculated by marginalization. For example, Fig. 4.

4 2404 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 where,, and are local normalization functions that can be calculated by marginalization. For example, Fig. 4. Subgraph for each term of the composite node in the JPD of the master graph for the example in Fig. 2. We then derive the JPD of the master graph based on the Markov property in a directed graphical model (e.g., a BN). For the master graph in Fig. 3, its JPD is factored as where are the parameters of each pairwise potential function. By marginalization, will be a function of. and will be functions of the conditional variables and the related parameters of potential functions. 5) Step 3: Finally, we can derive the JPD of the original CG model by substituting the JPDs of subgraphs into the JPD of the master graph. The JPD is factored as (4) (1) We can further simplify the third, fourth, and fifth terms based on the conditional independence in the original graph, as ascertained by the global Markov property (see [7]). The JPD of the master graph is further simplified as 4) Step 2: We create a subgraph for each term in the JPD of the master graph that corresponds to a composite node. This term corresponds to either the joint probability or the local conditional probability of the composite node. The subgraph is an undirected graphical model. For the joint probability term in (2), we construct an undirected subgraph, as shown in Fig. 4(a). For other conditional probability terms of composite nodes in (2), we construct the conditional networks, as shown in Fig. 4(b) and (c). Here the conditional network is an undirected graph with shaded nodes representing the conditional variables. Please note that the relationships between conditional variables and nonconditional variables (e.g., between and ) are undirected even though the links in the original graph may be directed, and hence these links are generally parameterized by potential functions. Such a representation and parametrization also applies to the link between the pair,, and. We then derive the JPD in each subgraph. For the undirected subgraphs in Fig. 4, their JPDs can be derived based on the Hammersley Clifford theorem [18] as a product of potential functions normalized by a local normalization function (2) where and represent the factors for local normalization Apparently, the JPD of the CG model is factored as the product of local potential functions and local conditional probabilities normalized by the global partition function. Please note that the global partition function here is only a function of the parameters of some potential functions (i.e., ). This is different from ignoring the local normalization functions ( and ) and then doing the global normalization only through the global partition function, which is calculated by marginalizing the product over all random variables. In the latter case, will couple all of the parameters of potential functions and is much more difficult to calculate. 6) General CG Model: Now, we can present a general formulation of the proposed CG model. We first introduce our notations. Let denote the set of random variables in the model. As illustrated in the above example, the JPD of all random variables in our CG model has a factored formulation. Specifically, it is the product of conditional probabilities and potential functions normalized by the (local and global) partition functions. We assume our CG model follows the fundamental property of a CG, i.e., it complies with the global Markov property (see [7]) that ascertains the conditional independence among random variables based on the graphical model structure. Since it is a CG model, it also follows the global acyclicity property, i.e., there will be no directed cycles in the graph. Under these assumptions, without loss of generality and considering up to pairwise potential functions, the JPD of our CG model can be generally formulated as (5) (3) (6)

5 ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2405 where is the pairwise potential function that models the interaction between adjacent nodes and. is the parameter of the th pairwise potential function. The factor function is resulted from the local normalization in the undirected subgraphs and it is a function of a subset of variables (i.e., the conditional variables) and a subset of potential function parameters. The last term is the local conditional probability of the node, where denotes the parent nodes of. It should be mentioned that the extended CG model is not limited to pairwise potential functions. Actually, in the human activity recognition application (cf. Section IV), we have used triplet potential functions (i.e., using triplet cliques) as well. The nodes in our CG models can be grouped into three types. The first type involves nodes associated with only potential functions. The second type involves nodes associated with the local normalization factor, i.e., the nodes in the set, which correspond to the conditional variables. The third type involves nodes associated with the local conditional probability. This type of node is usually the child of directed links and is not connected with any undirected links. It should be noticed that some nodes could be associated with multiple terms in the above general formulation. For a specific problem, it is also possible that not all of the aforementioned terms exist. The JPD therefore might have a simpler formulation in such cases. Finally, our following discussion of parameter learning and inference and the current implementation of the general CG model in (6) assume discrete random variables. B. Parameter Learning So far, we have introduced principled methods to construct the proposed CG model and derive its JPD. The next important issue is to learn the model parameters. Parameter learning in undirected PGMs or directed PGMs has been separately studied in many previous works [5], [19], [20]. Specifically, parameter learning in BNs can usually be simplified as learning in a local graphical structure. Parameter learning in MRFs is more difficult since the partition function (or its derivative) is usually difficult to calculate because it requires marginalization over all random variables. Many approaches have been proposed to address this difficulty [19] [22]. They basically simplify the optimization objective function by approximation and avoid exact calculation of the partition function or its derivative. Frey et al. [23] has given a detailed comparison of several learning and inference methods for PGMs. Despite much work on parameter learning for undirected PGMs and directed PGMs, there are very few studies addressing parameter learning in a hybrid graphical model. Lauritzen has shown how to derive the maximum likelihood estimation (MLE) in CG [7]. Buntine discussed how to approximate the global partition function and showed an example in learning a CG [24]. The work in [15] introduces an approach for parameter learning based on minimizing the variational free-energy. This variational approach requires a factorial approximation of the true posteriori distribution of hidden variables given the observed variables. Another work [25] presents a learning method for a FG that requires a special canonical parametrization of the JPD. In the following, we will present a principled parameter learning method for the proposed CG model. We will show that, for a generally structured CG model whose JPD is factored as the product of local conditional probability distributions (CPDs) and potential functions, we can learn the model parameters by combining an analytical learning and a numerical learning based on contrastive divergence (CD) using Gibbs sampling. In the CG model formulated by (6), the conditional probability distributions and the parameters of potential functions are the model parameters that should be learned. An important property of this model is that the global partition function is not a function of the CPDs and it is only a function of some potential function parameters. We will leverage this property for parameter learning. We perform parameter learning using MLE. Assume we have a set of i.i.d training data, where denote the values of all random variables in the th sample. The MLE aims at maximizing the log-likelihood of parameters where denotes all of the parameters in the model, including all CPDs and potential function parameters. Below, we show the detailed parameter learning approach assuming discrete random variables. The CPDs associated with discrete random variables become conditional probability tables (CPTs). Let denote the CPT value when the random variable is in its th state and its parents in their th state. The log-likelihood in (7) can be rewritten as where is the number of times that is in its th state and its parents are in their th states, which can be directly counted from the training data. We shall note that the term can be a function of some potential function parameters. The loglikelihood is split into two parts. The first part is a function of the parameters of potential functions. The second part is a function of the CPTs. When we maximize the log-likelihood w.r.t, only the second part in (8) matters. On the other hand, when we maximize w.r.t the parameters of potential functions (e.g., ), only the first part in (8) has influence but we need to consider both the potential functions and the log partition function. 1) Learning Conditional Probability Distributions: Parameter learning in the proposed CG model consists of two parts. First, the log-likelihood can be maximized with respect to the CPT parameter, i.e., (7) (8) (9)

6 2406 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 where the constraint comes from the definition of a conditional probability. Maximizing (9) leads to the following analytical solution of the optimal parameters: parameters. Ignoring the detailed derivation, we can calculate the derivative of contrastive divergence as (10) where is the total number of counts of with its parents in their th states in the training data. 2) Learning Parameters of Potential Functions: The loglikelihood should also be maximized with respect to the parameters of potential functions. There are two cases for learning the potential function parameters. 3) Case 1: In this case, the global partition function is a function of only a few potential function parameters and can be analytically calculated. In the previous example of (5), is only a function of. In such special cases, we can simply use exact MLE to learn the parameter since and its derivative can be easily calculated. The partial derivative of the log-likelihood w.r.t is calculated as (11). In this case, both and its derivative can be analytically calculated. Thus, we can use exact MLE to learn the parameter (11) 4) Case 2: In this case, the global partition function is too complex and is a function of many potential function parameters, making it difficult to be analytically calculated, very much like the partition function difficulty in learning undirected PGMs. We resort to an approximate learning approach, i.e., the contrastive divergence (CD) learning [19], [26], to alleviate this difficulty. Instead of maximizing the exact log-likelihood, CD learning minimizes an alternative objective function, i.e., the contrastive divergence that is the difference between two Kullback Leibler (KL) divergences (12) where denotes the empirical distribution represented by the training data and denotes the distribution over the -step reconstruction of the sampled data, which are generated by fullstep Markov chain Monte Carlo (MCMC) sampling via Gibbs sampling. is the true distribution represented by the model in (6). Please note that (12) assumes we already know all CPT parameters during the sampling. We can use the gradient descent approach to minimize the contrastive divergence to solve for the optimal potential function (13) where the operator means the expectation w.r.t the distribution indicated by the subscript. In this equation, we only need to calculate the expectation of the derivative w.r.t the distribution represented by samples (either the training data or the reconstructed samples through efficient Gibbs sampling). We calculate this expectation by substituting the values of random variables from the samples and then normalize the derivatives w.r.t the total number of samples. CD learning finally updates the potential function parameters as (14) where is the learning rate to update the parameters. We shall note that learning the potential function parameters requires the knowledge of CPTs. Hence, the model parameters are learned jointly. Specifically, when we perform Gibbs sampling for the node, we need the conditional probability, whose computation requires the CPTs that have been learned by the analytical learning. In general, we can calculate as (15), shown at the bottom of the page, where is a possible state of the random variable and is the set of all possible states for. The symbol and denote the children and parents of, respectively. Parameter learning in the CG model is summarized in Algorithms 1 and 2. Algorithm 1 Parameter Learning in the Extended CG Model Input: a set of complete training data Step 1: learn CPDs by counting the joint configurations of parent nodes and a child node (10). Step 2: randomly initialize to certain values. for to (the maximum iteration) do for to do Initialize MCMC sampling from the training data. (15)

7 ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2407 for to do Perform one full-step Gibbs sampling based on the conditional probability (15) to update the states of random variables. This process is summarized in Algorithm 2. end for Extract the last joint states of random variables as one sample to represent the sample distribution. end for Use extracted samples and the training data to update according to (14). end for Output: all CPDs and the finally estimated parameters of potential functions. Algorithm 2 One full-step Gibbs Sampling Step 1: Initialize the joint probability distribution to certain values of. For example, these values can come from a training data or the previously sampled data. Step 2: Do the Gibbs sampling process. for to (the number of unobserved random variables) do (1) Compute the conditional probability. (2) Set to the state with the probability. end for We can roughly estimate the computation required for parameter learning. Let be the number of nodes whose CPDs should be learned. Learning of these CPDs will take calculations by counting. The counting process s time consumption can often be ignored. For learning parameters of potential functions, assuming we actually run iterations and have training data, we can estimate the required computation as below. Let (typically 1 or 2) be the number of Gibbs sampling based on (15) that should be run for each of totally unobserved variables involved in the potential functions. We further assume the calculation of (15) requires maximally arithmetic calculation. Thus, the total computation required for learning parameters of potential functions will be. It dominates the entire parameter learning process. Finally, CD learning avoids running MCMC sampling into the equilibrium, which is typically very time consuming. In addition, empirical studies [27] have shown that CD learning typically converges well and the estimated parameters are close to exact MLE results. CD learning therefore has been applied in many other works [19], [26] as well due to its efficiency and good empirical performance. C. Probabilistic Inference Given the CG model and its learned parameters, we perform probabilistic inference to solve the problem. The CG model consists of both directed links and undirected links, and to the best of our knowledge, direct inference in such a model is very difficult since existing inference methods for either directed or undirected PGMs may not be applicable. To solve this problem, we Fig. 5. FG representation of the CG model in Fig. 2. by adding variable nodes that correspond to the vari- and factor nodes that correspond to the local functions. Undirected links are then added to connect a factor node with its arguments. Our CG model represents a JPD that is factored as the product propose to convert the CG model into an FG representation [16] so that principled inference methods for FGs can be employed. 1) FG Representation of the CG Model: An FG is a bipartite graph that expresses a global function factored as the product of local functions over a set of variables. The FG consists of two types of nodes: the variable nodes and the factor nodes. Following the convention, each variable node will be represented as a circle and each factor node will be represented as a filled square. Assuming a global function defined on a set of variables is factored as, where is a subset of all variables. We can construct an FG to represent ables of potential functions and conditional probabilities. Given this factored JPD, we can easily convert the CG model into a FG representation. For example, we convert the model in Fig. 2 into an FG shown in Fig. 5 based on the derived JPD in (5). Each square in the FG corresponds to the factor that is a local function of its associated variables. For the general model represented by (6), the factors include the pairwise potentials, the factor functions and the conditional probabilities. 2) Probabilistic Inference in FG: Given the FG representation, we can leverage principled inference methods developed for FGs to perform inference. There are two major approaches to perform probabilistic inference in FGs. First, the sum-product algorithm can be used to efficiently calculate various marginal probabilities for either a single variable or a subset of variables [8], [16]. Let denote all of the variables in the FG and is one of the variables. We can use the sum-product algorithm to calculate the marginal probability of, i.e.,, where the summation is over all variables excluding and is the joint probability distribution. Given the marginal probability of each variable, the optimal state of this variable can be found by using the Maximum Posterior Marginal (MPM) criterion [28], i.e.,. Second, the max-product algorithm [8] can be used to find a setting of all variables that corresponds to the largest joint probability. If denotes some observed variables, we can find the optimal states of all other unobserved variables by maximizing their joint posteriori probability, i.e.,. We refer readers to [8] for

8 2408 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 Fig. 6. CG model for multisubject activity recognition: (a) structure of the static CG model (or prior model) and (b) structure of the transition model. more detailed discussions of the sum-product and max-product algorithms. Besides the max-product algorithm, there are other algorithms [29], [30] that can also find the Most Probable Explanation (MPE) solution given certain evidence. In general, the computational complexity of inference depends on the specific method used and the type of inference to be performed. For example, Hutter et al. [30] has compared several methods for MPE inference and showed their computational difference. For the sum-product algorithm, we roughly estimate the required computation as below. Let be the total number of iterations before the message-passing process converges, i.e., the beliefs of all nodes are not changing any more. If there are totally links between variables and factors in the FG, we need to calculate two directional messages for each link. Let be the maximal number of factors linked to a variable, be the maximal number of variables linked to a factor, and be the maximal number of states for a variable, the calculation of two directional messages for one link take calculation. The total computation after iterations will be. IV. APPLICATION TO HUMAN ACTIVITY RECOGNITION To demonstrate the relevance of the proposed CG model for different image and video analysis applications, we applied it to two real-world problems: human activity recognition and 2-D image segmentation. The objectives of these applications are mainly to demonstrate the ability of the extended CG model to take into account of different types of objects and their heterogeneous relationships for solving these challenging problems. These experiments, however, are not aimed at demonstrating the advantages of the proposed models over state-ofthe-art methods in human activity recognition and image segmentation domains. Such a comparative study, though important, is however beyond the scope of this paper. We first applied the CG model to the human activity recognition problem. Recognizing complex activities involving interactions among multiple subjects is challenging due to both the large variations of visual observations and the complex semantics of the human activity. To alleviate these difficulties, it is essential to have an activity model that can explicitly capture and model heterogeneous relationships among elements of an activity and between different subjects at different levels of abstraction in both space and time domain. HMMs [31], [32] and dynamic Bayesian networks (DBNs) [33], [34], although widely used in human activity recognition, are not well suited to effectively capture these heterogeneous relationships. In this section, we apply the proposed CG model and its dynamic extension, i.e., dynamic CG, for modeling and recognizing complex human activities. A. Model Construction For this study, we are interested in recognizing activities involving interactions between two human subjects. Specific activities include shaking hands, talking while standing, chasing, boxing, and wrestling. We first introduce a static CG for modeling spatial relationships among elements of such activities. The static model is subsequently extended to capture their dynamic dependencies. Fig. 6(a) shows a static CG for activity modeling, where the human activity is abstracted at four levels: activity, individual actions of the subjects, states of the subjects and image observations. The node denotes the activity. and represent the actions of two subjects. denote the states of shape, appearance, and motion of the subjects. The shaded nodes are, respectively, the observations of shape, appearance, and motion. Relationships among these nodes are heterogeneous. Some relationships are asymmetrically causal, while others are mutually affecting each other. The causal relationships include: 1) Complex multisubject activity induces individual basic actions. Such relationships are captured by the directed links between and. 2) An individual action of a subject leads to the specific shape, appearance, and motion of the subject. This type of relationship is naturally captured by the directed links between the action node and the states of a subject. 3) The basic states of the subjects generate their observations. Such relationships are captured by the directed links between states and their corresponding measurements. In addition, relationships that usually represent the mutual interactions can be captured by undirected links among the states of the subjects. They include: 1) the interactions between the actions of multiple subjects, which are captured by the links among the actions (e.g., and ) of different subjects and 2) the relationships among the shape, appearance, and motion of one subject under a specific action. Moreover, to capture the dynamic aspect of an activity, the static CG model is further extended to a dynamic CG (see Fig. 6), which can be represented by a two-slice CG to reflect the dynamic evolution of an activity. The evolution is captured by the directed temporal links between corresponding nodes at time and (e.g., ).

9 ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2409 B. Joint Probability Distribution To facilitate the discussion on the joint probability factorization, learning and inference, the dynamic CG model can be broken down into two parts: the prior model and the transition model, as shown in Fig. 6. We use the theory introduced in Section III-A2 to derive the JPDs for both the prior model and the transition model. Ignoring the detailed derivation, the JPD of the transition model can be factored as (16) where and denote all random variables in the slice and, respectively. Please note that we use triplet potential functions for this activity modeling. and are the local normalization functions that are calculated as Similarly, the JPD of the prior model can be factored as (17) (18) where denote all random variables in the first time slice and and are the local normalization functions. We learn the CPT parameters by counting the frequency of joint configurations of a child node and its parent nodes in the training samples. We also learn the potential function parameters as the first case explained in Section III-B2. After parameter learning, we use the sum-product algorithm to perform inference and estimate the marginal probability distributions of the activity node and action nodes in every time slice. We finally find the optimal activity and action states by maximizing the marginal probabilities. C. Experimental Results We evaluate our activity model on the task of recognizing five complex activities in a daily life: shaking hands (SH), talking (TK), chasing (CH), boxing (BX) and wrestling(wr). These activities are conducted by two interactive subjects who perform five basic individual actions: standing, running, making a fist, clinching, and reaching out. The activity dataset 2 consists of 15 video sequences. The five complex activities are sequentially performed in each sequence, so there are 15 samples for each activity. To obtain the observations for the subject states, we first perform motion detection to obtain their silhouettes. The shape of the subject is then measured by the width and height of the bounding box, filling ratio (the area of the silhouette w.r.t. the area of the bounding box) and the moments of the motion silhouette. To compute the shape feature, the current approach assumes the silhouette is available from motion detection, for which the video data we used has relatively static background. The measurement of the motion state is the the global velocity of the subject. The appearance is captured by the histogram of oriented gradient (HOG) features and histogram of optical flow (HOF) features. Before learning the activity models, we first cluster the observations to obtain the labels for all subject states. The numbers of shape, motion and appearance states are set to 3, 2, and 5, respectively, through experiments. Fig. 7 shows the influence of the number of states on the activity recognition performance. We can see that the performance is not sensitive to the number of shape or motion states. But when the number of appearance states is small, the performance will drop. When studying the influence of the number of shape states, we fix the number of states for motion and appearance as 2 and 5, respectively. Similar strategy is applied to the sensitivity analysis on the number of states for motion and appearance. The proposed CG activity model is compared with two DBNs and one static CG model. The first DBN [Fig. 8(a)] is constructed by removing all undirected links in the dynamic CG model, so the interactions between the subjects and the dependencies among the shape, motion and appearance states are totally ignored. The second DBN [Fig. 8(b)] keeps all directed links, but replaces the undirected links with directed links learned by the constrained hill-climbing algorithm (other parts of the model are fixed in the structure learning). Thus, the relationships between two subjects or among the shape, motion and appearance states, although not causal, are represented by directed links. In addition, to demonstrate the importance of 2 [Online]. Available:

2410 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 TABLE I COMPARISON OF THE DYNAMIC CG MODEL WITH TWO DBNS AND A STATIC CG MODEL FOR HUMAN ACTIVITY RECOGNITION Fig. 7.

(D) DBN2 Fig. 8. Structures of the two DBN models for comparative experiments. (a) DBN1. (b) DBN2.

10 2410 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 TABLE I COMPARISON OF THE DYNAMIC CG MODEL WITH TWO DBNS AND A STATIC CG MODEL FOR HUMAN ACTIVITY RECOGNITION Fig. 7. Sensitivity analysis on the number of states for shape, motion, and appearance of the subject. TABLE II CONFUSION TABLES OF FOUR ACTIVITY MODELS. (A) DYNAMIC CG MODEL. (B) STATIC CG MODEL. (C) DBN1. (D) DBN2 Fig. 8. Structures of the two DBN models for comparative experiments. (a) DBN1. (b) DBN2. modeling the temporal relationships, we also perform experiments using a static CG model, which has exactly the same structure as the prior model [Fig. 6(a)]. In the experiments, we use fivefold cross validation to evaluate the proposed CG activity model. Through online inference, we can obtain the activity label and action label of each subject at each frame for the testing sequences. In order to evaluate the recognition accuracy for a video sequence, we vote each activity using the activity label for each frame and assign the activity class that receives the highest vote. Table I compares the recognition performance of the dynamic CG model with two DBNs and the static CG model. We can find that modeling the heterogeneous relationships in human activity with the dynamic CG model significantly improves the recognition accuracy at both activity level and action level. Compared with DBN1, which completely ignores the interaction between two subjects and the dependencies among states, our dynamic CG model achieves 21.3% higher recognition accuracy at the activity level and 18.7% higher accuracy at the action level. On the other hand, if we capture these dependencies with directed links learned from data, the recognition rate is 12% better than ignoring these dependencies. However, approximately representing these mutual interactions with directed links makes the recognition rate 9.3% worse than the dynamic CG model at the activity level and averagely 14% worse at the action level. This result shows the importance of modeling different relationships in human activity recognition with appropriate link types, as well as the capability and flexibility of our CG model for this task. The static CG model, which ignores the temporal relationships in human activity, performs 5.3% worse than the dynamic CG model at the activity level and 6% worse at the action level. However, it still has higher recognition rates than both DBNs. The detailed activity recognition results of the four models are summarized in Table II. We can observe that the dynamic CG model achieves almost perfect recognition result except for three misclassifications between boxing and wrestling, which have quite similar attributes in some examples. In comparison, DBN1 has more misclassifications between these two activities, and besides, there are even several misclassifications between shaking hands and boxing, chasing and wrestling, despite their differences in appearance and motion. DBN2 also has a few misclassifications between shaking hands and boxing. These tables demonstrate that the CG model not only yields improved overall recognition accuracy but also leads to improved discrimination among individual activities. V. APPLICATION TO IMAGE SEGMENTATION Besides human activity recognition, we also apply the proposed CG model to 2-D image segmentation problems. Specifically, we deal with a bi-layer segmentation problem, which segments the image into the foreground and background. We use the CG model to capture heterogeneous relationships among multiple image entities, including regions, edges and junctions, for effective image segmentation. Given an image, it is first oversegmented into a set of smaller regions (i.e., superpixels). A region of pixels with the same label in the oversegmentation form a superpixel. Multiple (more than two) superpixels with different labels intersect at a junction. Adjacent junctions are connected by edge segments. Our CG model captures the heterogeneous relationships among these superpixels, edges, and junctions. Let denote the labels of superpixels and the superpixel features extracted from the image. denotes the labels of

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2411 (19) Fig. 9. CG model for image segmentation.

11 ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2411 (19) Fig. 9. CG model for image segmentation. (a) Initially oversegmented image. (b) Part of the initial segmentation of image regions. (c) Part of the CG model for image segmentation that corresponds to (b). where denotes the spatial neighborhood of. is the unary potential. For simplicity, we use a three-layer perceptron classifier to define it. The pairwise potentials are defined as, where is the component-wise absolute value operator. is the weight vector and is the global partition function. The above image segmentation model is a special case of the general CG model in (6). Given the training data, we apply the approach described in Section III-B1 to learn the CPTs and the approach for case 2 described in Section III-B2 to learn the weight vector associated with pairwise potentials. After parameter learning, we further convert the model into a FG representation and perform probabilistic inference in the FG to find out the MPE solution, i.e., edges and their measurements. denote the labels of junctions and their measurements. The label variables, and are all binary random variables. has values: (foreground) or (background). has values: (on the object boundary) or (not on the boundary). has values: (being a corner on the boundary) or (otherwise). We extract average color features in each superpixel as its region measurement and calculate the average gradient magnitude of an edge as its measurement. We also calculate the measurement of a junction according to the response of Harris corner detector. The measurement is discretized as binary values ( or 1) by a fixed threshold (1000) that is empirically determined. There are various kinds of relationships among these image entities. We use Fig. 9 to explain them. First, adjacent superpixels with different labels can result in a boundary edge between them. For example, if two superpixels and have different labels, they form a boundary between them. Besides, multiple edges with different states result in a specific type of junction. For example, if and are boundary edges while and are not, then and form a corner on the object boundary. In addition, each image entity produces its own image measurements. These relationships represent the causalities between different entities and can be readily modeled by directed links. On the other hand, there are spatial correlations between adjacent superpixels, which can be modeled by undirected links. Our segmentation model captures all of these heterogeneous relationships and therefore forms a CG model, as illustrated by Fig. 9(c). Given the constructed CG model, image segmentation is formulated as a problem of finding the most probable labels of superpixels and edges. Let denote all random variables in the model. Their JPD is factored as (20) In the MPE solution, the superpixels with foreground labels form the foreground segmentation. Fig. 10 shows several typical image segmentation results on the Weizmann horse dataset [35]. The model successfully segmented the foreground objects (i.e., horses) in these images. We quantitatively evaluate our segmentation results and roughly compare with some results produced by other approaches [36] [38] as well as with the results produced by using our directed or undirected models alone. These results are summarized in Table III(a). For the overall labeling accuracies, our results are comparable to (or better than) the results produced by other related works. In Table III(a), we also show the performance using a CRF model alone and using a BN model alone. The CRF model has a structure corresponding to the undirected part in our CG model. Its unary potentials and pairwise potentials are similarly defined as in (19). The BN model has a structure corresponding to the directed part in our CG model. In addition, the region nodes are individually linked with their measurements. Comparing the performance of different PGMs, the CG model outperforms either our CRF model alone or our BN model alone. Since we discretize the vertex measurement by thresholding, it is also interesting to see how the threshold value will influence the overall performance of our model. We varied the threshold value within a large range and redid experiments on the Weizmann horse images. Table III(b) summarizes the quantitative results. We found within a large range of threshold values (from 200 to 6000), the average performance only changed by about 0.5%. These results showed the CG model was not very sensitive to this discretization process. VI. CONCLUSION In this paper, we propose an extended CG model that allows very general topology and introduce principled methods for learning and inference in this model. We systematically study

2412 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 Fig. 10. Some examples of typical segmentation results. The first row shows the original images.

(A) THE QUANTITATIVE RESULTS OF OUR CG MODEL AND SEVERAL RELATED WORKS FOR SEGMENTING THE WEIZMANN HORSE IMAGES. THE AVERAGE PERCENTAGE OF CORRECTLY LABELED PIXELS, I.E., OVERALL LABELING ACCURACY IS REPORTED.

12 2412 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 Fig. 10. Some examples of typical segmentation results. The first row shows the original images. The second row shows the corresponding segmentation. TABLE III QUANTITATIVE EXPERIMENTAL RESULTS FOR 2-D IMAGE SEGMENTATION. (A) THE QUANTITATIVE RESULTS OF OUR CG MODEL AND SEVERAL RELATED WORKS FOR SEGMENTING THE WEIZMANN HORSE IMAGES. THE AVERAGE PERCENTAGE OF CORRECTLY LABELED PIXELS, I.E., OVERALL LABELING ACCURACY IS REPORTED.(B) THE OVERALL LABELING ACCURACIES FOR THE WEIZMANN HORSE IMAGES WHILE WE VARIED THE THRESHOLD VALUE FOR DISCRETIZING THE VERTEX MEASUREMENT M several important issues on the proposed CG model, including its model construction, parametrization, derivation of the represented JPD, and most importantly, joint parameter learning and inference for this model. To demonstrate the capability of this extended CG model, we apply it to two challenging image and video analysis tasks: human activity recognition and image segmentation. Extended CG models are constructed to capture useful heterogeneous relationships among multiple entities for solving these problems. Our experiments show that the CG models outperform conventional undirected PGMs or directed PGMs. It demonstrates the applicability of the proposed CG model to different image and video analysis problems as well as its potential benefits over standard directed or undirected PGMs in improving classification and recognition performance. The benefits of the CGs over directed or undirected PGMs can be studied in terms of modeling accuracy and computational efficiency. For modeling accuracy, the proposed CG model is, in principle, superior to both directed and undirected PGMs. Both applications in this work demonstrate that the proposed CG model outperforms the models based on either pure undirected PGMs or pure directed PGMs. We speculate the main reason is that the proposed CG model can more correctly capture the complex and heterogeneous relationships in these applications. In contrast, the undirected or directed PGMs can only approximately model these heterogeneous relationships, resulting in their inferior performance compared to the CG model. This performance inferiority is especially apparent for the activity recognition problem (Table I) because DBN1 ignores some relationships among the activity entities and DBN2 uses inappropriate links to represent those relationships and therefore both models can only approximately model the relationships in human activity modeling. However, the exact benefits and the extent of the benefits of the CG model over a directed or undirected PGM for a particular application is yet hard to ascertain. They usually depend on the specific relationship that each link captures as well as the interactions among the links. Further research will still be needed to systematically study the benefits of the CG model over directed or undirected PGMs. In terms of computational efficiency, compared to undirected PGMs, CG models should be computationally more efficient in both learning and inference because of the presence of directed parts in the general CG models. The directed links separate the entire graph into smaller subsets of undirected graph. It factorizes the JPD as the product of simpler components that only require local normalization. This factorization and local normalization significantly simplifies the learning and inference in CGs. Finally, through this paper, we introduce a general CG model to the computer vision and image processing community and demonstrated its utility for some image and video analysis tasks. It is our hope that the research community can further investigate the potential of such a framework, improve it, and apply it to more image and video analysis applications. REFERENCES [1] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-6, no. 6, pp , Jun [2] C. Bouman and B. Liu, Multiple resolution segmentation of textured images, IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 2, pp , Feb [3] J. Lafferty, A. McCallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proc. Int. Conf. Mach. Learning, 2001, pp [4] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan-Kaufmann, [5] R. E. Neapolitan, Learning Bayesian Networks, 1st ed. Upper Saddle River, NJ: Prentice-Hall, [6] L. R. Rabiner, A tutorial on Hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, no. 2, pp , Feb [7] S. L. Lauritzen, Graphical Models. : Oxford University Press, [8] C. M. Bishop, Pattern Recognition and Machine Learning. Berlin, Germany: Springer, [9] F. Liu, D. Xu, C. Yuan, and W. Kerwin, Image segmentation based on Bayesian network-markov random field model and its application on in vivo plaque composition, in Proc. IEEE Int. Symp. Biom. Imaging: Nano to Macro, 2006, pp [10] L. Zhang and Q. Ji, Image segmentation with a unified graphical model, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp , Aug [11] V. Murino, C. S. Regazzoni, and G. Vernazza, Distributed propagation of a-priori constraints in a Bayesian network of Markov random fields, Proc. Inst. Electr. Eng. Commun., Speech Vis., vol. 140, no. 1, pp , 1993.

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2413 [12] A. Chardin and P.

Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, no. 7, pp. 1527 1554, 2006. [14] S. Osindero and G. E.

Bao, Learning causally linked Markov random fields, in Proc. 10th Int. Workshop Artif. Intell. Statistics, 2005, pp. 128 135. [16] F. Kschischang, B. J. Frey, and H.-A.

13 ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2413 [12] A. Chardin and P. Pérez, Unsupervised image classification with a hierarchical EM algorithm, in Proc. Int. Conf. Comput. Vis., 1999, pp [13] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, no. 7, pp , [14] S. Osindero and G. E. Hinton, Modeling image patches with a directed hierarchy of Markov random fields, in Proc. Adv. Neural Inf. Process. Syst., 2008, vol. 20. [15] G. E. Hinton, S. Osindero, and K. Bao, Learning causally linked Markov random fields, in Proc. 10th Int. Workshop Artif. Intell. Statistics, 2005, pp [16] F. Kschischang, B. J. Frey, and H.-A. Loeliger, Factor graphs and the sum-product algorithm, IEEE Trans. Inf. Theory, vol. 47, no. 2, pp , Feb [17] W. L. Buntine, Chain graphs for learning, in Proc. Conf. Uncertainty Artif. Intell., 1995, pp [18] J. Hammersley and P. Clifford, Markov Fields on Finite Graphs and Lattices. Oxford, U.K.: Oxford Univ., [19] G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, vol. 14, no. 8, pp , [20] C. Sutton and A. Mccallum, Piecewise training for undirected models, in Proc. 21st Conf. Uncertainty Artif. Intell., 2005, pp [21] J. Besag, Efficiency of pseudolikelihood estimation for simple Gaussian fields, Biometrika, vol. 64, no. 3, pp , [22] D. J. C. MacKay, J. S. Yedidia, W. T. Freeman, and Y. Weiss, A conversation about the Bethe free energy and sum-product, Cambridge Univ., Cambridge, U.K., Tech. Rep. MERL TR , [23] B. J. Frey and N. Jojic, A comparison of algorithms for inference and learning in probabilistic graphical models, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 9, pp , Sep [24] W. L. Buntine, Operations for learning with graphical models, J. Artif. Intell. Res., vol. 2, pp , [25] P. Abbeel, D. Koller, and A. Y. Ng, Learning factor graphs in polynomial time and sample complexity, J. Mach. Learning Res., vol. 7, pp , [26] M. Welling, Learning in Markov random fields with contrastive free energies, in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp [27] M. A. Carreira-Perpinan and G. E. Hinton, On contrastive divergence learning, in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp [28] J. Marroquin, S. Mitter, and T. Poggio, Probabilistic solution of illposed problems in computational vision, J. Amer. Stat. Assoc., vol. 82, no. 397, pp , [29] J. Park, Using weighted MAX-SAT engines to solve MPE, in Proc. 18th Nat. Conf. Artif. Intell., 2002, pp [30] F. Hutter, H. H. Hoos, and T. Stutzle, Efficient stochastic local search for MPE solving, in Proc. Int. Joint Conf. Artif. Intell., 2005, pp [31] J. Yamato, J. Ohaya, and K. Ishii, Recognizing human action in timesequential images using hidden Markov model, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1992, pp [32] T. Xiang and S. Gong, Beyond tracking: Modelling activity and understanding behavior, International Journal of Computer Vision, vol. 67, no. 1, pp , [33] B. Laxton, J. Lim, and D. Kriegman, Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp [34] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. Regh, A scalable approach to activity recognition based on object use, in Proc. IEEE Int. Conf. Comput. Vis., 2007, pp [35] E. Borenstein, E. Sharon, and S. Ullman, Combining top-down and bottom-up segmentation, in Proc. CVPR Workshop Perceptual Org. Comput. Vis., 2004, pp [36] T. Cour and J. Shi, Recognizing objects by piecing together the segmentation puzzle, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp [37] J. Winn and N. Jojic, LOCUS: Learning object classes with unsupervised segmentation, in Proc. IEEE Int. Conf. Comput. Vis., 2005, pp [38] X. Ren, C. C. Fowlkes, and J. Malik, Cue integration in figure/ground labeling, in Proc. Adv. Neural Inf. Process. Syst., 2005, pp Lei Zhang (M 09) received the Ph.D. degree from Rensselaer Polytechnic Institute (RPI), Troy, NY. He is currently with UtopiaCompression Corporation, Los Angeles, CA. His research area includes machine learning, computer vision, pattern recognition, and image processing. He has designed different probabilistic graphical models for solving various problems, including image segmentation, human body tracking, facial expression recognition, human activity recognition, medical image processing, multi-modal sensor fusion, etc. He has authored or coauthored over 22 papers in several international journals and conferences and book chapters in different domains. He serves as a reviewer for several computer vision and image processing journals. Dr. Zhang is a member of Sigma Xi. Zhi Zeng received the B.S. degree in electronic engineering from Fudan University, Shanghai, China, in 2003, the M.S. degree in electronic engineering from Tsinghua University, Beijing, China, in 2006, and the M.Eng. degree in electrical engineering from the Rensselaer Polytechnic Institute, Troy, NY, in 2009, where he is currently working toward the Ph.D. degree. His research interests include machine learning, pattern recognition, computer vision and operations research. Qiang Ji (SM 04) received the Ph.D. degree in electrical engineering from the University of Washington, Seattle. He is currently a Professor with the Electrical, Computer, and Systems Engineering Department, Rensselaer Polytechnic Institute (RPI), Troy, NY. He recently served as a Program Director with the National Science Foundation, where he managed computer vision and machine learning programs. He also held teaching and research positions with the Beckman Institute at University of Illinois at Urbana-Champaign, the Robotics Institute at Carnegie Mellon University, the Department of Computer Science at the University of Nevada, and the U.S. Air Force Research Laboratory. He currently serves as the Director of the Intelligent Systems Laboratory (ISL), RPI. His research interests are in computer vision, probabilistic graphical models, information fusion, and their applications in various fields. He has authored or coauthored over 150 papers in journals and conferences. His research has been supported by major governmental agencies including NSF, NIH, DARPA, ONR, ARO, and AFOSR as well as by major companies including Honda and Boeing. He is an associate editor for several related IEEE and international journals and he has served as a chair, technical area chair, and program committee in numerous international conferences and workshops.

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x