IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST

Size: px

Start display at page:

Download "IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST"

Jared Ward
6 years ago
Views:

1 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST Distributed Clustering Using Wireless Sensor Networks Pedro A. Forero, Student Member, IEEE, Alfonso Cano, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE Abstract Clustering spatially distributed data is well motivated and especially challenging when communication to a central processing unit is discouraged, e.g., due to power constraints. Distributed clustering schemes are developed in this paper for both deterministic and probabilistic approaches to unsupervised learning. The centralized problem is solved in a distributed fashion by recasting it to a set of smaller local clustering problems with consensus constraints on the cluster parameters. The resulting iterative schemes do not exchange local data among nodes, and rely only on single-hop communications. Performance of the novel algorithms is illustrated with simulated tests on synthetic and real sensor data. Surprisingly, these tests reveal that the distributed algorithms can exhibit improved robustness to initialization than their centralized counterparts. Index Terms Clustering methods, distributed algorithms, expectation maximization (EM) algorithms, iterative methods, wireless sensor networks. I. INTRODUCTION T HE development of small, low-cost, intelligent sensors with communication capabilities has prompted the emergence of wireless sensor networks (WSNs) in applications including environmental monitoring, surveillance, tracking, and inference tasks in bio-informatics [20], [25], [26]. When using a WSN as an exploratory infrastructure, it is often desired to infer hidden structures in distributed data collected by the sensors. With each sensor having available a set of unlabeled observations drawn from a known number of classes, the goal of the present paper is to design local clustering rules that perform at least as well as global ones, which rely on all observations being centrally available. Because low-cost sensors must operate under stringent power constraints, transmitting all observations to a central location may be infeasible. This motivates Manuscript received June 01, 2010; revised November 15, 2010; accepted January 25, Date of publication February 14, 2011; date of current version July 20, This work was supported in part by the National Science Foundation (NSF) under Grants CCF and CON and also in part through collaborative participation in the Communications and Networks Consortium sponsored by the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies of the Army Research Laboratory or the U.S. Government. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Anna Scaglione. The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN USA ( forer002@umn.edu; alfonso@umn.edu; georgios@umn.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /JSTSP looking for in-network clustering algorithms requiring information exchanges among single-hop neighbors only. Focus is placed on partitional (as opposed to hierarchical) clustering algorithms, which yield a single partitioning of the data described by a fixed number of parameters [30]. With these parameters being less than the available data, partitional clustering can afford parsimonious distributed implementations of deterministic and probabilistic approaches. A popular centralized deterministic partitional clustering approach is offered by the K-means algorithm, which features simple, and fast-convergent iterations [19]. Alternatively, clustering can be viewed as the byproduct of a density estimation problem by introducing a parametric probabilistic model governing the data generation; e.g., a Gaussian mixture model (GMM) [9, Ch. 10]. Density estimation problems are of further interest in the clustering context, because they provide extra information in the form of confidence on the data-to-cluster assignment. When the sought density is described by a finite number of parameters, a popular scheme for estimating them using the maximum-likelihood (ML) approach is the centralized expectation maximization (EM) algorithm. The EM algorithm has well-documented merits because it is computationally affordable, and offers convergence guarantees [7]. Parallel and distributed implementations of the K-means (DKM) and EM (DEM) algorithms have risen most often because of the need to deal with large data sets. However, most existing schemes are agnostic to the network communication constraints [8], [22], [31]. In the WSN context, various probabilistic approaches have been reported leading to: an incremental (I-) DEM scheme [23]; a gossip-based scheme [18]; a scheme based on consensus averaging [14]; a scheme based on junction trees and related topologies [29]; and a scheme based on the alternating direction method of multipliers [12]. Except for [12] and [29], all these works are confined to parameter estimation when the data probability density function (pdf) is modeled as a finite mixture of Gaussian density functions, a case local estimators are available in closed form. In addition, [23] and [29] are confined to specific communication network topologies (loops or trees). This paper presents and analyzes novel distributed algorithms for clustering observations collected by spatially distributed resource-aware sensors, which exchange only sufficient information with their one-hop neighbors. Viewing first the data as deterministic, a distributed version of the centralized K-means algorithm is developed. In par with the centralized K-means algorithm, the novel DKM scheme iterates over the variables of a consensus-based decentralized version of the global classification cost. Subsequently, viewing the data as random draws from a probabilistic model, the underlying data pdf is modeled /$ IEEE

2 708 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 as a finite mixture of (not necessarily Gaussian) pdfs. Then, the clustering problem reduces to a distributed parameter estimation problem, for which a distributed version of the EM algorithm is introduced. Using the variational formulation of [21], the resulting DEM algorithm relies on alternating iterations over the mixture parameters, and the unknown cluster labels. Compared to [18], [23], and [14], the novel algorithm does not require closed-form expressions for local (i.e., per sensor) estimators in terms of sufficient statistics. Compared to [11] and [12], the novel DKM and DEM algorithms can afford reduced communication overhead and are provably convergent. Interestingly, numerical tests reveal that these algorithms can be less sensitive to initialization, while achieving improved clustering results relative to their centralized counterparts with random initializations. The remainder of this paper is organized as follows. Section II states the problem, and reviews briefly the centralized K-means and EM algorithms to establish context and notation. The corresponding distributed versions are developed in Sections III and IV. Section V presents numerical tests. The paper concludes with final remarks in Section VI. Notational conventions are as follows: upper (lower) bold face letters are used for matrices (column vectors); denotes matrix and vector transposition; ( ) the -entry of a matrix ( -entry of a vector); set cardinality; a set of variables with appropriate indexing; stands for a diagonal matrix with on its diagonal; for a block-diagonal matrix with on its diagonal; ( ) for a 1 vector of ones (zeros); for a identity matrix; for the Euclidean norm; for the trace operation; and for the multivariate Gaussian density function of vector with mean, and covariance matrix. II. PRELIMINARIES AND PROBLEM STATEMENT Consider a WSN with nodes, node is allowed to communicate only with its one-hop neighbors. Communication links between neighboring nodes are assumed symmetric; thus, if, then. The WSN is modeled via an undirected graph with vertex set, and the edge set comprising the connections between pairs of nodes. Graph is assumed connected, meaning that data at any node can become available to any other node generally through a multi-hop path of. The WSN is deployed to gather data, and perform an unsupervised classification task. Every node collects a set of observations, denotes the th observation at sensor. Each observation is assumed belonging to a class-set with, contains the indices of all possible classes, and denotes the total number of classes present. Although is assumed known a priori, the methodology presented in this paper can be complemented readily with model order selection criteria to estimate along the lines of, e.g., [17] or [10]. The goal is to assign each to a cluster based on a proper criterion chosen to quantify similarity among observations. Hard clustering assigns to a unique class, while soft clustering yields a probability-like score that belongs to class. Soft clustering is preferred whenever the boundaries between classes are not well defined; e.g., when neighboring classes overlap considerably. In a centralized setting, all observations are available at the same location for clustering. Centralized clustering schemes, outlined next, will be used to benchmark performance of the novel distributed algorithms presented in the ensuing section. A. Deterministic Partitional Clustering In addition to the similarity assessing criterion, deterministic partitional clustering (DPC) entails prototypical elements (a.k.a. cluster centroids) per class in order to avoid comparisons between every pair of observations. Let denote the prototype element for class, and the membership coefficient of to class. The clustering problem amounts to specifying the family of clusters with centroids such that the sum of squared-errors is minimized; that is denotes the convex set of constraints on all membership coefficients, and is a tuning parameter. If, the cost in (1) becomes linear with respect to (w.r.t.), and with fixed, (1) then is a linear program for which the optimal coefficients result in hard assignments, and yield clusters. On the other hand, if, then the optimal coefficients generally result in soft membership assignments, which turn out to be expressible in closed form. In this case, the optimal clusters are obtained as. Problem (1) is a well-known combinatorial problem with NP-hard complexity in the number of observations [6]. The K-means algorithm offers a low-complexity, iterative, suboptimum solver that minimizes in two steps per iteration the cost in (1) w.r.t. (S1) with fixed, and (S2) with fixed [19]. Convergence is guaranteed at least to a local minimum. Both steps require availability of global information centrally: (S1) requires knowledge of the global cluster centroids after each iteration; and, (S2) requires knowledge of which data belong to each cluster. Hence, the centralized K-means algorithm is not amenable to distributed implementation in its current form. B. Probabilistic Partitional Clustering Probabilistic partitional clustering (PPC) views clustering as a follow-up of estimating a mixture density, formed by class-conditional pdfs. Once the class-conditional pdfs are estimated, each node can construct a probabilistic rule for clustering data, e.g., using the maximum a posteriori (MAP), or, the Neyman Pearson criterion [9]. In PPC, the data are viewed as realizations of independent, identically distributed (i.i.d.) random vectors, and likewise are their labels. Consider the 1 random, binary-valued, hard-label vector indicating the class the random vector is drawn from; (1)

3 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 709 and and their corresponding realizations. Label vectors take values from the set, denotes the canonical vector with one in its th entry and zeros else. If belongs to class, then and,. Data are drawn by first selecting class at random with probability, and then drawing a vector according to the class-conditional density, collects unknown deterministic parameters of the pdf. A datum and its class label, i.e., the complete data, are jointly described by the pdf only one factor in (2) has exponent equal to 1 while all other factors have exponents equal to 0, and thus do not affect the product. Nodes do not know the class each belongs to; hence, the only observable datum is the realization. Let collect parameters and. The pdf of is a mixture of pdfs obtained by marginalizing (2) with respect to as (2) With available and denoting expectation w.r.t., the M-step updates as which is used in the E-step of the ensuing iteration. Iterative passes of the E- and M-steps proceed until the condition is satisfied for a prescribed tolerance. Notice that at iteration, the E- and M-steps require knowledge of the global parameter estimate ; thus, the EM algorithm is not amenable to a distributed implementation in its current form. III. DISTRIBUTED ALGORITHMS FOR DPC In this section, distributed algorithms are developed for DPC, and their convergence is analyzed. With the goal of minimizing (1) in a distributed fashion, per sensor define local prototype vectors, and formulate the distributed clustering problem as (7) (8) s.t. The ML estimate of function (3) is found by maximizing the likelihood. The maximizer of cannot be generally found in closed form since the (log-) likelihood in (4) is a nonlinear function of. To overcome this hurdle, the EM approach to ML estimation hinges on the idea that if the class labels for each were known, it would be easier to maximize the likelihood in (4). Specifically, the EM approach relies on the pdf of the complete data, whose loglikelihood is given by the set contains all observations with their corresponding labels. The EM algorithm starts with an initial guess to obtain an estimate per class. Given at iteration index, the E-step estimates class labels as, the last equality holds because is a Bernoulli random variable. By using Bayes rule on, and noticing that is conditionally independent of all other, with and, given, the E-step yields (4) (5) (6) the auxiliary variables will allow (8) to be solved in a distributed fashion, and the consensus constraints and guarantee that problems (8) and (1) are equivalent as formalized by the following lemma proved in Appendix A. Lemma 1: (Equivalence of (8) with (1)) For every, let denote a feasible solution of (8). If is connected, then problems (8) and (1) are equivalent, i.e.,, is a feasible solution of (1). Lemma 1 ensures agreement on the cluster centroids with common index across all nodes. The union of local clusters,or when performing soft clustering, forms, which constitutes a cluster equivalent to one that would be found if all observations were centrally available. Unfortunately, the computational complexity involved in solving (8) via exhaustive search remains exponential, which motivates pursuing low-complexity distributed solvers. A. Distributed K-Means Algorithm Consider what can be termed surrogate augmented Lagrangian of (8), given by (9)

4 710 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 and denote Lagrange multipliers corresponding to the constraints and, respectively, and is a positive scalar. The constraints in have not been included in (9), hence the name surrogate. Further, the quadratic terms and augment the ordinary definition of the Lagrangian. They render it strictly convex w.r.t. each variable with all other variables fixed, and, through the scalar, allow one to control the convergence properties of the resulting iterative algorithm (see Section V). In the same spirit of centralized clustering algorithms, consider minimizing cyclically one set of variables at a time w.r.t., and with all other variables fixed, followed by a gradient ascent step over the multipliers. With denoting iteration index, the resulting iterates are given by (10a) (10b) (10c) (10d) (10e) Due to the augmented terms, update (10c) entails minimizing which is linear-quadratic w.r.t., and thus can be obtained in closed form. With and initialized to zero, the five iterates in (10a) (10e) can be reduced to the following three (see Appendix B for the proof) (11a) (11b) (11c) denotes a local (per-node) aggregate Lagrange multiplier and. Solving (11a) with amounts to assigning observation to cluster if with.for, solving the constrained convex optimization problem in (11a), yields the membership coefficient updates in closed form as (12) The novel DKM clustering scheme is summarized as Algorithm 1. At iteration, each node randomly initializes its local cluster centroids, while Lagrange multipliers are initialized to. Per iteration, each node locally assigns membership coefficients to its observations via (11a). Next, each node updates its local cluster centroids via (11b), and broadcasts to its neighboring nodes. Subsequently, each node updates its local aggregate Lagrange multipliers via (11c), thus completing iteration. Algorithm 1 requires each node to wait for the information from all its neighbors, before updating in (11c); thus, it is a synchronous algorithm. At the startup phase, DKM requires no central information made available to the nodes other than agreeing a priori on and, which incurs minimal communication overhead. During the operational phase, DKM broadcasts vectors of length 1 per iteration per node. Every node stores the vectors representing the local cluster centroids, and the local Lagrange multipliers ; along with an matrix for the membership variables. Algorithm 1 Distributed K-means (DKM) Require: Node initializes and. 1: for do 2: for all do 3: Compute via (11a). 4: Compute via (11b). 5: end for 6: for all do 7: Broadcast to all neighbors. 8: end for 9: for all do 10: Compute via (11c). 11: end for 12: end for Remark 1: (Distributed vs. centralized -means) Per iteration, consider updating (11b) and (11c) times before updating again in (11a). For fixed, (11b) and (11c) solve a distributed least-squares problem using the alternating direction method of multipliers as in [32] and [11]. As, the vector converges to its centralized weighted average solution for the cluster centroids (13)

5 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 711 which is in turn the closed-form expression for (S2) in the equivalent centralized K-means algorithm. Different from DKM, the centralized solution reached in (13) satisfies,. In this case, the DKM inherits the convergence properties of the centralized K-means algorithm. However, (13) does not need to hold for any finite, and a different approach is necessary to prove convergence of the DKM Algorithm 1 as is shown in the ensuing section. B. Convergence Analysis Standard approaches to analyze convergence of centralized clustering algorithms capitalize on the fact that the cost is nonincreasing after each iteration; see, e.g., [3], [27]. For the iterations in (11), this holds true for updates (11a) and (11b), but not for (11c). For this reason, a different approach is pursued in this section to establish stability of (11). In particular, it is shown that (11b) (11c) are bounded-input bounded-output (BIBO) stable iterations regardless of the cluster assignment update in (11a), with the output iterates arbitrarily close to their fixed point. For the special case of the hard K-means cost, this suffices to guarantee convergence at least to a local minimum of (1). Iterations (11b) and (11c) are first-order linear difference equations with time-varying coefficients that depend on. To analyze their convergence, consider the weighted average, and the super-vectors (14) (15) (16) Let denote the graph Laplacian matrix, with denoting the degree matrix, and the adjacency matrix [13, Ch. 8]. Matrix describes the connectivity pattern of the WSN. If, then, which implies by the link symmetry. Using these definitions, it turns out that (11b) and (11c) can be compactly re-written as Consider further the diagonal matrix, (17) (18) (19) collects all per-node cluster cardinalities for class. Next, let, and denote the iterates at a fixed point of the iterations (17), (18), and (11a). At this fixed point (18) yields. Using this fact in (17) evaluated at the fixed point, yields (20) and are defined as in (16) and (19), respectively, with substituted by. Based on (14) (20), the following lemma is proved in Appendix C. Lemma 2: (Characterization of fixed points) The fixed points of iterations (11a), (17) and (18), namely,, and, are the Karush Kuhn Tucker (KKT) points of (1). Since (1) has multiple KKT points,,, and are not unique (although they always exist). Because iterations are deterministic, each fixed point of (17), (18), and (11a) is uniquely specified by the initialization point; that is, different initialization points generally lead to different fixed points but the same initialization point will always converge to the same fixed point. To proceed, define the error sequences,, and the corresponding supervectors and. From (17) (20), it follows that (21) (22). Since, the definition of implies that can be bounded as, for some finite constant. From (21) (22), Appendix D establishes the following result. Proposition 1: (Stability of the -means iterations) For sufficiently large, there exists a such that for any the distance between iterate in (17) and its fixed point is bounded; i.e., for some finite constant, it holds that (23) Proposition 1 asserts that remains bounded regardless of. For a sufficiently large, the bound on becomes arbitrarily small. Furthermore, for the special case of the K-means algorithm, the following stronger result holds (see Appendix E for the proof). Corollary 1: (Convergence of hard -means) For sufficiently large, the distributed clustering algorithm (11) with membership exponent converges to a local minimum of problem (1). Intuitively, for a sufficiently large, prototypes are so close to that remains fixed. This along with the fact that every fixed point of the centralized K-means algorithm is a KKT point of (1), suffices to prove Corollary 1. Remark 2: (On the selection of ) A sufficiently large value of is needed for Proposition 1 and Corollary 1 to hold. However, numerical tests in Section V will illustrate that large values of can affect the convergence speed of DKM and may lead iterations to KKT points of higher cost. The proof of Proposition 1 is based on the stability analysis in [28, Sec. C6], which is an existence result. Finding constructively the minimum that guarantees convergence depends on the WSN topology, and the set of observations, but goes beyond the scope of this paper. Nonetheless, simulated tests in Section V suggest that in practice relatively small values of suffice for the hard-dkm to converge, and for the soft-dkm to refrain from noticeable hovering-type behavior.

6 712 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Remark 3: (Node failures) The convergence claims in Proposition 1 and Corollary 1 still hold when a node fails, provided that the resultant network remains connected. In this case, the fixed points are those of a centralized cost function entailing all but the observations of the failed node. If node failure(s) lead to a disconnected network (it is a graph cut-vertex), then the algorithm would remain operational in each subnetwork. The fixed points of each subnetwork would be those of a centralized problem using the observations from all nodes in that subnetwork. IV. DISTRIBUTED ALGORITHMS FOR PPC In this section, the centralized PPC formulation of Section II-B is first re-cast in a variational form, and subsequently solved in a distributed fashion. Let denote the pdf of the 0 1 valued random label vector, denote expectation w.r.t., and the entropy of. Using this notation, the EM algorithm can be viewed as an alternating functional-parameter optimization solver that cyclically maximizes [21] (26) is a positive tunable scalar, and ( ) denotes the inner product (induced norm) over the space. As in the previous section, (26) will be minimized cyclically w.r.t.,,,, and. Then, gradient ascent updates on and will be performed. Consider the quantities,,,, and that are given by (cf. Appendix B) (27a) (24) first w.r.t. over the space of pdfs factorized as, with the parameters fixed; and next w.r.t. for the fixed pdf. At each iteration, the steps in the EM algorithm can be viewed as (see [21] for details): (E-step) find ; (M-step) find. To perform the (E-step) local information is needed, and all sensors must know from the previous (M-step). Similar to Section III, local parameter estimates comprising local and are introduced to allow distributing the objective function in (24). In this framework, the distributed PPC solves the following optimization problem: (25) the auxiliary variables and will play a role similar to that of in (8). Let Lagrange multipliers and, with 1, 2, correspond to the constraints,,, and, respectively. Also, let denote an inner product space such that,. The augmented Lagrangian associated with (25) is (27b) (27c) (27d) (27e) denotes the set of constraints on all parameters. Finding in (27b) requires solving a convex optimization problem which can be efficiently accomplished via, e.g., interior point methods. If the pdfs are log-concave w.r.t., then update (27c) is also a convex optimization problem. The DEM scheme described by iterations (27a) (27e) is summarized as Algorithm 2. DEM begins at iteration when each node randomly initializes its local parameter estimates, ; and its local aggregate Lagrange multipliers as, and. At iteration, every node updates its class label estimates via (27a). Then, every node updates its local estimates and via (27b) and (27c), respectively. Per node, the updated estimates and are broadcast to all nodes. Finally, every node updates its local Lagrange multipliers and via (27d) and (27e). Similar to Algorithm 1, Algorithm 2 requires each node to wait for

7 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 713 information all its neighbors before updating (11c); thus, it is also a synchronous algorithm. Algorithm 2 Distributed EM (DEM) from in Require: Node initializes,,, and. 1: for do 2: for all do 3: Compute via (27a). 4: Compute via (27b). 5: Compute via (27c). 6: end for 7: for all do 8: Broadcast and to all neighbors. 9: end for 10: for all do 11: Compute via (27d). 12: Compute via (27e). 13: end for 14: end for Remark 4: (Distributed vs. centralized EM) As in Remark 1, consider fixing, and cyclically computing (27b)-(27e) times before updating again. Because the alternating direction method of multipliers is provably convergent [2], iterates converge as to the centralized M-step solution in (7). In this case, DEM also inherits convergence guarantees from its centralized counterpart. Notice that when dealing with Gaussian distributions, DEM reduces for to the algorithm in [14]. The merits of DEM relative to [23] and [18] can be summarized as follows: DEM does not require finding a path traversing across all sensors. Incremental alternatives tacitly assume such a so-termed Hamiltonian cycle to ensure passing information once per node, and thus minimize communication overhead. However, Hamiltonian cycles do not always exist, and when they do, finding them is an NP hard task [24]. Because the alternating direction method of multipliers that DEM relies on is known to be resilient to quantization errors and additive noise affecting inter-sensor communications [32], unlike [23] and [18], the DEM is robust to imperfect links. DEM only requires one-hop connectivity among sensors, and thus keeps communication overhead per sensor at an affordable level even when the number of nodes grows. Without requiring estimators to be available in closed form, DEM applies to general mixture models provided that their class-conditional pdfs are differentiable and (log-) concave. A. Gaussian PDFs Without Consensus on The mixture parameters characterize the percentage of observations coming from a specific class at node.ina distributed setting, it is possible that the proportion in which data are mixed varies across nodes. This motivates enforcing consensus only on the parameters across the network, while allowing the to vary across nodes, thereby enabling each node to construct local clustering rules that best reflect the per-node observations available from different classes. Let, and are unknown. Each is fully characterized by, denotes the interior of the cone of positive semi-definite matrices. The space is endowed with the inner product, for any, and. The inner product in is given by, and the corresponding induced norm is,, denotes the Frobenious norm. If the coefficients are allowed to vary across nodes, then (27a) (27e) can be re-written as (28a) (28b) (28c) (28d) (28e) (28f) ; is a positive tuning scalar, and, are aggregate Lagrange multipliers pertaining to the consensus constraints and, respectively. Since no consensus constraints on the are present, the Lagrange multipliers are not included in the augmented Lagrangian (26), and (27b) can be solved in closed form yielding (28b). Note also that (28d) entails

8 714 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 solving a convex optimization problem per node. The first-order optimality condition for (28d) is (29), is a matrix of all zeros,, and Lemma 3: (Characterization of fixed points) The fixed points of iterations (27a) and (27c), namely and, are convergence points of the centralized EM iterations (6) and (7). As in Section III-C, consider the error vectors, and, which satisfy (31) For, (28d) can be solved in closed form. For, interior point methods or direct solution of the matrix (29) can be used to carry out the minimization in (28d)[4], [15]. B. Convergence Analysis Analyzing the convergence of centralized EM typically relies on the objective function in (25) being non-increasing as iterations proceed; see, e.g., [7]. This holds true for updates (27a), (27b), and (27c) too, but not for updates (27d) and (27e). Similar to Section III-C, it is proved here that iterations (27a) (27e) come arbitrarily close to a fixed point of the DEM algorithm, which in turn is a fixed point of the centralized EM. To this end, the following assumptions are necessary. AS1: Class-conditional pdf is log-concave and twice-differentiable w.r.t.,. Notice that AS1 holds true for various members of the widely used family of exponential distributions. In the ensuing derivations only convergence of in (27c) will be established. Convergence of from iteration (27b) follows similarly after combining (27b) with (27c), and considering as the new optimization variable (details are omitted). Under AS1, the first-order optimality conditions for (27c) yields (30) denotes the gradient of w.r.t. evaluated at ; and in deriving (30), was assumed per iteration to fall in the interior of the subspace, which holds, e.g., when in (28d) is a strictly positive-definite matrix. At a fixed point, (27e) implies, which upon substituting into (30) yields,. These points are indeed convergence points of the centralized EM as shown by the following lemma proved in Appendix F. Vectorizing (or equivalently assuming is a vector) and applying the mean-value theorem on each row of, it is shown in Appendix G that (32) matrix depends on,, and. Substituting (32) into (31), letting, and following steps similar to those used in Section III-C, the iterates can now be written as and (33) (34),, and is a new (bounded) error vector that depends on and. Matrix is clearly nonsingular for all. Using steps similar to those in the previous section, Appendix G establishes the following result. Proposition 2: (Stability of the DEM iterations) For sufficiently large, there exists a such that for any, the distance between iterates in (33) and its fixed point is bounded; i.e., (35) for some finite constant. For a sufficiently large, the iterates in (33) come arbitrarily close to a fixed point of the iterations in (27a) (27e), which by Lemma 3 are fixed points of the centralized EM. Similar to the DKM algorithm (cf. Remark 2), finding the minimum that guarantees stability is a challenging task, aggravated by the presence of different scalar, vector and matrix updates. However, numerical tests in Section V indicate that when choosing different values of for each set of optimization parameters, convergence can be guaranteed with relatively small step size values.

9 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 715 V. NUMERICAL RESULTS In this section, the performance of the proposed distributed clustering algorithms is evaluated via numerical simulations on both synthetic and real data sets. A. Distributed K-Means Consider a randomly generated WSN with nodes. The network is connected with algebraic connectivity , and average degree per sensor. Data are generated at random from classes, with vectors from each class generated from a two-dimensional Gaussian distribution with common covariance, and corresponding means:,,,,,,,,,,,,,,,,, and. Every node has available a set of observations with 30 observations per class. Each observation is drawn at random from class according to the density,. Thus, each sensor has observations available. The centralized K-means algorithm is simulated to benchmark performance. Letting, the figure of merit for both schemes is [cf. (8)] (36) Fig. 1. DKM for various values of and K = 18(top). DKM for various values of K and =10(bottom). with for hard clustering, and for soft clustering. In all tests, the cluster centroids in (11b) are randomly initialized per sensor, while the Lagrange multipliers in (11c) are initialized to. The centralized K-means algorithm is initialized using the initialization of a sensor in the WSN chosen at random. 1) Distributed Hard K-Means: Fig. 1 (top) depicts the evolution of the hard K-means cost with for different values of and the same initialization. Fig. 2 shows the minimum, mean, and variance of the cost in (36) obtained after iterations (the algorithm converges to the final solution after iterations in most cases) for 100 Monte Carlo runs. This test reveals that most often small values of lead to a lower value of the cost than large values of. Also larger values of lead to increased sensitivity to initialization. For the centralized K-means algorithm with the results obtained were: (min.) ; (mean) ; and (std. dev.) 5,904. Interestingly, the DKM algorithm outperforms the centralized K-means algorithm reaching both a lower minimum and a lower average cost. This could be intuitively explained by: 1) the weighted augmentation in (9), which allows DKM to more easily avoid local minima; and 2) the dual update in (11c), which offers an extra degree of freedom, compared to the centralized K-means, by tuning. The behavior of the DKM algorithm w.r.t. the initial guess for was tested for fixed. Fig. 1 (right) and Table I summarize the results obtained after iterations. Compared to the centralized K-means, DKM consistently finds im- Fig. 2. Error bars comparison between centralized K-means and DKM for K =18. TABLE I PERFORMANCE COMPARISON BETWEEN CENTRALIZED K-MEANS AND DKM FOR =10 proved clustering results regardless of the initialization. This fact is even more evident when. As mentioned in Section III, DKM can mimic the iterations of the centralized K-means if iterations (11b) and (11c) are performed cyclically a number of times for fixed memberships in (11a). Fig. 3 shows the performance of this ap-

716 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Fig. 3. DKM for different values of. Fig. 5. Error bars comparison between centralized soft K-means and soft-dkm for K =18.

10 716 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Fig. 3. DKM for different values of. Fig. 5. Error bars comparison between centralized soft K-means and soft-dkm for K =18. TABLE II PERFORMANCE METRICS FOR SOFT-DKM, =4AND t =800 Fig. 4. Soft-DKM for various values of and K =18(top). Soft-DKM for various values of K and =4(bottom). proach for various values of [note that corresponds to the DKM algorithm in (11a) (11c)]. As expected, the average performance of DKM degrades for larger values of approaching that of the centralized K-means. 2) Distributed Soft K-Means: Fig. 4 (top) and Fig. 5 show the performance of the distributed soft K-means (soft-dkm) for, after iterations, various values of and the same initialization. For the centralized soft K-means the performance metrics obtained were: (min.) ; (mean) ; and (std. dev.) 537. As in DKM, the choice of affects the convergence of the algorithm. For small values of, soft-dkm achieves lower costs than its centralized counterpart. The performance of the soft-dkm algorithm for various values of with fixed is depicted in Fig. 4 (bottom), and summarized in Table II. As seen, the distributed algorithm outperforms its centralized counterpart for small values of. B. DEM Algorithm In this section, the performance of DEM is tested and compared with the one of I-DEM in [23]. Consider a randomly generated WSN with nodes. The network is connected with algebraic connectivity , and average degree 3.80 per sensor. Observations are generated at random from classes, each class is modeled as a two-dimensional Gaussian distribution with covariance matrices given by,,,, and per class; and corresponding mean vectors given by,,,,, and. Each node has available a total of observations. The proportion of observations per class per node is determined by the local mixture parameters. For nodes 1, 2, 3, 5, 6, 7, 9, 10, the mixture parameters are. For nodes 4, 8, the mixture parameters are,, and ; and,,,, and. The figure of merit for the DEM algorithm is the negative log-likelihood of the complete data given by [cf. (5)] (37). As in the previous tests, iterates are randomly initialized and Lagrange multipliers are initialized to zero. It was found empirically that improved convergence speed could be obtained by using different values of for each of the parameters sought. Consequently, the individual penalty parameters are set to,, and,,, and are the penalty parameters associated with,, and, respectively. These values are chosen to balance out consensus optimization variables taking values

FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 717 Fig. 8. Error bars comparison between centralized EM and DEM with consensus on for K =6.

Fig. 7. Level set curves of the GMM fitted by DEM, with =4and K =6, at different nodes after 400 iterations with consensus on. that greatly differ in magnitude.

11 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 717 Fig. 8. Error bars comparison between centralized EM and DEM with consensus on for K =6. TABLE III PERFORMANCE METRICS FOR DEM WITH CONSENSUS ON AND =5 Fig. 6. DEM with consensus on for various values of and K =6(top). DEM with consensus on for various values of K and fixed =5(bottom). Fig. 7. Level set curves of the GMM fitted by DEM, with =4and K =6, at different nodes after 400 iterations with consensus on. that greatly differ in magnitude. Otherwise, the iterates with the largest absolute magnitudes will be impacted more by the consensus constraints hindering the convergence rate of the DEM algorithm. 1) DEM With Consensus on : Fig. 6 (top) and Fig. 8 show the performance of the DEM with for different values of. Update (27b) is found numerically via interior point methods. Since the pdfs are Gaussian, update (27c) is carried out by solving directly the quadratic matrix equation (29). The centralized EM algorithm is used as a benchmark. The performance metrics obtained for the centralized EM algorithm are: (min.) ; (mean) ; and (std. dev.) 452. Fig. 7 depicts the level curves for the GMM, and the local data point for nodes 1, 4, 8 with and after 500 iterations. Note that DEM correctly identifies the presence of the Gaussian components not present in the local observations at nodes and. The performance of DEM for various values of with fixed is shown in Fig. 6 (bottom) and Table III. As in the K-means iterations, smaller values of lead to faster convergence to a local minimum with lower cost, comparable with the centralized solution. Larger values of ensure that nodes achieve similar values per node faster, but also cause iterations to be trapped at less favorable local minima. 2) DEM Without Consensus on : In this test, the case of free mixture parameters per node is explored. As benchmarks, the centralized EM algorithm and the I-DEM in [23] are also implemented. For the centralized EM algorithm with the performance metrics obtained are those of the previous subsection. For the incremental EM algorithm with and after 80 cycles through the network, the performance metrics, averaged over 100 Monte Carlo runs, obtained are: (min.) ; (mean) ; and (std. dev.) Fig. 9 (top) depicts the average evolution of the DEM algorithm after 100 Monte Carlo runs with for different values of, and. Clearly, not having to consent on allows nodes to adapt their parameter estimates to the locally available observations. As in the previous cases, the choice of considerably impacts the algorithm s performance. In particular, smaller values of reach consensus in less iterations, and achieve results comparable to the centralized and incremental ones. Fig. 11 shows the performance metrics for DEM after iterations for various values of. The freedom of the mixture coefficients per node allows nodes to adapt their parameter estimates to the local behavior of the observations. This translates to larger values for the global

718 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 TABLE IV PERFORMANCE METRICS FOR DEM WITHOUT CONSENSUS ON AND =5 Fig. 10.

DEM for various values of K and fixed =5(bottom). log-likelihood of the data. Fig. 9 (bottom) shows the average evolution of DEM for various values of.

12 718 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 TABLE IV PERFORMANCE METRICS FOR DEM WITHOUT CONSENSUS ON AND =5 Fig. 10. Level set curves of the GMM fitted by DEM, with =2and K = 6, at different nodes for after 400 iterations and no consensus on the mixture coefficients. Fig. 9. DEM for various values of and K =6(top). DEM for various values of K and fixed =5(bottom). log-likelihood of the data. Fig. 9 (bottom) shows the average evolution of DEM for various values of. The performance metrics for the centralized EM, the DEM, and the I-DEM algorithms are shown in Table IV for various values of at. Note that the centralized EM algorithm automatically forces all with common to coincide across nodes. Fig. 10 depicts the level curves for the GMM, and the local data point for nodes 1, 4, 8 with 2 and 6, after 400 iterations. C. Clustering of Oceanographic Data Environmental monitoring is a potentially important application of WSNs. Such an example is one involving WSNs for oceanographic monitoring, in which the cost of computation per node is lower than the cost of accessing each nodes observations [1]. This makes the option of centralized processing less attractive, thus motivating decentralized processing. This section tests the proposed decentralized schemes on real data collected by multiple underwater sensors in the Mediterranean coast of Spain, available at the World Oceanic Database (WOD) [5], with the goal of identifying regions sharing common physical characteristics. A total of 5720 feature vectors were selected, Fig. 11. Error bars for DEM without consensus on for K =6. each with entries the temperature ( ) and salinity (psu) levels ( ). The measurements were normalized to have zero mean and unit variance. The data were grouped into blocks of measurements each. The algebraic connectivity of the WSN is and the average degree per node is 4.9. Fig. 12 (left) shows the performance of 25 Monte Carlo runs for the hard-dkm algorithm with different values. The best average convergence rate was obtained for yielding a performance attaining the average centralized performance after 300 iterations. Tests with different values of and are also included in Fig. 12 (left) for comparison. Notice that for and hard-dkm hovers around a point without converging.

13 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 719 Fig. 13. Performance of DEM on the WOD data set using a WSN with J = 20 nodes for various values of K (top). Level sets and clustering results of the GMM fitted by DEM for the WOD data set with K = 3 (bottom) at t = 4000 iterations. Fig. 12. Average performance of hard-dkm on the WOD data set using a WSN with J =20nodes for various values of and K (top). Clustering results with K =3and =5(bottom) at t = 400 iterations. Choosing a larger value of guarantees convergence of the algorithm to a unique solution. The clustering results of hard-dkm at iterations for and are depicted in Fig. 12 (bottom). Similarly, the DEM algorithm with and parameters and, was used to cluster the WOD data set. The DEM performance using different values of, shown in Fig. 13 (top), approaches the centralized one after a few thousand iterations. Fig. 13 (bottom) shows the GMM level sets obtained by the DEM fit, and the clustering results using the MAP rule for and iterations. VI. CONCLUSION This paper developed distributed algorithms for partitional clustering by capitalizing on a consensus-based formulation and parallel optimization tools. The novel algorithms are well suited for applications network-wide observations cannot become available to individual nodes due e.g., to stringent power constraints. Also, the algorithms guarantee homogeneous usage of power across nodes, thus increasing their battery lifetime. Both deterministic and probabilistic partitional clustering approaches were addressed. The DKM algorithm uses spatially distributed sets of observations to obtain a unified clustering rule across nodes. Both hard and soft versions were explored. The DEM uses local sets of observations to estimate the parameters defining a mixture of pdfs from which observations were drawn. Parameter estimates can be found for any family of log-concave, twice-differentiable, parametric pdfs even if their estimators cannot be written in closed form. Scenarios consensus on the mixture parameters across nodes is not enforced were also investigated. In this case, the global behavior of the observations determines the values of class parameters, while the local proportion in which observations appear per node determines the local mixture parameters. Convergence was analyzed in all cases, establishing that the iterates come arbitrarily close to a fixed point of the algorithm, this fixed point being a KKT point of the centralized cost function. For the special case of the hard K-means algorithm, this guarantees convergence at least to a local minimum. The proofs relied on stability analysis of linear and non-linear time varying systems. Numerical tests show that the novel algorithms can exhibit resilience to initialization since on average they find better local minima more often than their centralized counterparts. APPENDIX A PROOF OF LEMMA 1 It is shown that for every, the equality constraints are equivalent to for any feasible solution of (8). It follows readily that the constraints in (8) imply

14 720 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST Consider two arbitrary nodes. Since the network is connected, there exists a path of length at least one, connecting nodes and. Because for, it follows that. Since, are arbitrary, the latter implies. Since is also arbitrary, the result of the lemma follows. APPENDIX B DERIVATION OF (11a) (11c) The goal is to show that iterations (10a) (10e) reduce to (11a) (11c). First, note that (10a) yields (11a) since the first term of the Lagrangian (9) is the only one that depends on. Iteration (10c) requires solving an unconstrained minimization problem w.r.t. over a linear-quadratic cost function; thus, it admits a closed-form solution, given by APPENDIX C PROOF OF LEMMA 2 At a fixed point of (11a), (17) and (18), it holds that, is a consensus solution reached by the DKM algorithm, and [cf. (20)]. Left-multiplying the latter by and, it follows that, yields, but since (C.41) Since the are not affected by the consensus constraints in (8), in (C.41) coincides with a KKT optimal point of (1). Substituting (B.38) into (10d) and (10e) yields (B.38) (B.39a) APPENDIX D PROOF OF PROPOSITION 1 With denoting matrix pseudo-inverse, rewrite the iterations (21) and (22), as in the following lemma. Lemma 4: If and, then can be written in terms of an auxiliary variable as (D.42) (D.43) (B.39b) Let and be initialized to zero at every node; i.e.,. From (B.39a) (B.39b), it then holds that. Likewise, if, then by induction. Upon defining, iteration (11c) follows directly from (B.39a). Finally, notice that iteration (10b) solves an unconstrained quadratic optimization problem. The first-order optimality conditions for (10b) yields, with Proof: Use (21) and (22) to recognize that first-order difference equation (D.44) (D.45) (D.46) obeys the (D.47) (B.40) is defined in (D.46), and (D.48). Since and, then again by induction. Using the definition of and the fact that, iteration (11b) follows from (B.40). Consider now an auxiliary variable, and the system of equations in (D.42) and (D.43). From the definition of in (D.45), it follows that (D.49)

15 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 721 Also, from the definition of, and by exploiting the fact that is a symmetric matrix to write, it follows that (D.50) Next, it is shown by induction that if, then iterations (D.42) and (D.43) are equivalent to (D.47). For iteration, it follows readily from (D.42) that. Substituting back into (D.43) and using (D.49) and (D.50) yields, as desired. Next, suppose that at iteration, (D.42) and (D.43) are equivalent to (D.47). Consider iteration and substitute into, to obtain (D.51) the second equality follows from (D.49) and (D.50), and the third equality is due to (D.43). Note that (D.51) corresponds to (D.47) evaluated at. Hence, the system of equations in (D.42) and (D.43) is equivalent to (D.47),. Since is a bounded matrix, stability analysis of iterations (21) and (22) can be deduced from that of (D.42). To this end, consider the homogeneous linear time-varying system. If the latter is exponentially stable, then (D.42) is BIBO stable. Exponential stability of is ensured if [28, Sec. C6]: 1) is slowly time-varying; i.e.,, for small enough; 2) all eigenvalues are inside the unit circle, i.e., ; and 3). Condition 3) is readily satisfied since the entries of remain finite for all. Conditions 1) and 2) are satisfied by the updates, as shown in the following two lemmas. Lemma 5: For any fixed, there exists a sufficiently large such that (D.52) Using the Frobenius matrix norm and properties of Kronecker products, it follows that (D.55) is the norm of the matrix in (D.54) that does not depend on. In addition, recalling that,, yields (D.56) Combining (D.55) and (D.56) in (D.53) yields. Note that decreases with, and so approaches zero as. In other words, for any fixed one can always find a sufficiently large so that (D.52) holds. Lemma 6: The spectral radius of, denoted,is strictly smaller than unity. Proof: By the properties of Kronecker products, the eigenvalues of correspond to the eigenvalues of, each with algebraic multiplicity ; hence,. Matrix can be written as (D.57). For notational simplicity consider dropping the iteration index.if denotes an eigenvalue for with corresponding eigenvector, it clearly holds that. Splitting into and with appropriate dimensions, the latter gives rise to the following system of equations [cf. (D.57)]: (D.58) (D.59) Invoking properties of the matrix pseudo-inverse to manipulate (D.58), yields, which implies that for some. Solving for, and substituting the result back into (D.59) leads to Proof: Since, it clearly holds that (D.60) Matrix can be decomposed as [cf. (D.44)] (D.53) which is valid for all. Re-organizing terms in (D.60) and left-multiplying by, denotes Hermitian transposition, yields the quadratic equation (D.54) (D.61)

16 722 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Conditions on the values of solution of (D.61), given by can be inferred through the, (D.62) First, consider the case, which yields complex conjugate values for. Next, we prove that for this case. With, it follows that ; hence, it suffices to check whether. This follows readily by noticing that for every, since, which is a positive definite matrix. Next, consider, which yields two distinct real values for. Define the auxiliary variable, which by the definition of equals (D.63) Since both and are positive definite, it follows that. Moreover, consider matrix, and employ Gershgorin s circle theorem [16, Ch. 6] to deduce that it is positive definite for all. Hence, it follows that. The solution of (D.61) in terms of is. Consider the term and note that (D.64) Recalling the fact that (D.61) is valid for all values of, it follows from (D.64) that. Likewise, it follows readily that the smallest possible value for is. Next, consider the case and. From (D.58), it follows that. Upon substituting this value into (D.58) with, one arrives at (D.65) We claim that there is no satisfying (D.65). Arguing by contradiction, suppose that there is such a. Let,, and recall that has rank. Then using the eigen-decomposition of, (D.65) can be written as (D.66) is a diagonal matrix with the eigenvalues of on its diagonal, and is an orthogonal matrix whose th column is the eigenvector of corresponding to. Assume without loss of generality that, and. It then follows from (D.66) that implying that, which is a contradiction since. Hence, (D.65) does not have a solution, and for this case matrix cannot have as an eigenvalue. Finally, consider the case, and. Then, and must satisfy ; thus, with, implying that is an eigen-vector of.we claim that the only possible value for is. Indeed, if, then it must be true that which implies that (D.67) (D.68) Taking the Hermitian transpose on both sides of (D.68) yields, which contradicts the assumption. Hence, this case leads to an eigenvector of zeros, which is not allowed by the definition of eigenvectors [16, Ch. 1]. Therefore, matrix cannot have as an eigenvalue either; which completes the proof of the lemma. After proving in Lemma 5 that, for arbitrarily small, and in Lemma 6 that has all its eigenvalues inside the unit circle, the stability result for slowly time varying linear systems applies readily [28, Sec. C6] and establishes the validity of (23). APPENDIX E PROOF OF COROLLARY 1 When, the membership assignment in (11a) is such that if, with, and otherwise. At a fixed point of Algorithm 1, it thus holds that (E.69) Using the triangle inequality, the first term of (E.69) can be bounded as (E.70) the last inequality follows from Proposition 1. Using (E.70) in (E.69), one arrives at Adding now again the triangle inequality yields (E.71) to (E.71), and using (E.72) (E.73) (E.74)

17 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 723 For sufficiently large, we thus have, keeping membership assignments unchanged; i.e., for all. As a consequence, iterations (D.43) (D.42) become time-invariant ( ), and error-free ( ). Invoking again Lemma 6, we find,or equivalently (E.75) Lemma 2 established that any fixed point of (11a) (11c) is a KKT point of the centralized K-means in (1). Since for the hard K-means algorithm, any KKT point is a local minimum [27], it follows that the convergence point of the distributed hard K-means algorithm is a local minimum as well. APPENDIX F PROOF OF LEMMA 3 At any fixed point of the centralized EM algorithm, denoted by, and, it holds that [cf. (5) and (7)] (F.76a) (F.76b) Clearly, any fixed point of (27a) satisfies (F.76a) as well. Next, we prove that any fixed point in (27c) satisfies (F.76b). At a fixed point and for every, iteration (27c) yields APPENDIX G PROOF OF PROPOSITION 2 Letting, iterations (33) (34) can be written as (G.79) is defined as in (D.44), and. Next, we will show that iteration (G.79) is BIBO stable. Similar to Appendix D, the stability claim holds because satisfies the following. 1), for small enough. 2) All eigenvalues lie inside the unit circle. Arguing as in Lemma 5, 1) follows readily. To establish 2), apply the mean-value theorem to the rows of to arrive at (32), is a diagonal matrix with diagonal entries evaluated at. Under AS1, is a positive definite matrix implying that is positive definite as well. In order to show that the spectral radius, note that there exists a such that is positive definite. Using in lieu of, and mimicking the steps followed in the proof of Lemma 6, yields the desired result. Since, for arbitrarily small, and has all its eigenvalues inside the unit circle, the stability result in [28, Sec. C6] readily applies, and (35) holds. (F.77) Also, at a fixed point of (27e), we have, implying. Using this fact the set of problems in (F.77) can equivalently be solved as (F.78) the equality constraints are enforced at the consensus solution reached by DEM. Note that the quadratic term in the cost of (F.77) vanishes at the minimum. Introducing the equality constraints into the objective function by defining and ; and noting that by construction, it follows that (F.78) is equivalent to (F.76b). A similar reasoning applies to the iterates in (27b), and completes the proof of the lemma. REFERENCES [1] C. Albaladejo, P. Sánchez, A. Iborra, F. Soto, J. A. López, and R. Torres, Wireless sensor networks for oceanographic monitoring: a systematic review, Sensors, vol. 10, no. 7, pp , Jul [2] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Nashua, NH: Athena Scientific, [3] J. C. Bezdek, R. J. Hathaway, M. J. Sabin, and W. T. Tucker, Convergence theory for fuzzy c-means: Counterexamples and repairs, IEEE Trans. Syst., Man, Cybern., vol. SMC-17, no. 5, pp , Sep./Oct [4] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, MA: Cambridge Univ. Press, [5] T. P. Boyer, J. I. Antonov, H. E. Garcia, D. R. Johnson, R. A. Locarnini, A. V. Mishonov, M. T. Pitcher, O. K. Baranova, and I. V. Smolyar, World Ocean Database 2005, S. Levitus, Ed. Washington, D.C.: U.S. Government Printing Office, 2006, vol. 60, NOAA Atlas NESDIS, p [6] S. Dasgupta and Y. Freund, Random projection trees for vector quantization, IEEE Trans. Inf. Theory, vol. 55, no. 7, pp , Jul [7] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Statist. Soc. Ser. B (Methodological), vol. 39, pp. 1 38, Aug [8] I. Dhillon and D. Modha, A data-clustering algorithm on distributed memory multiprocessors, Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pp , [9] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, [10] P. A. Forero, A. Cano, and G. B. Giannakis, Distributed feature-based modulation classification using wireless sensor networks, in Proc. IEEE Military Commun. Conf., San Diego, CA, Nov , 2008, pp. 1 7.

724 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 [11] P. A. Forero, A. Cano, and G. B.

Cano, and G. B. Giannakis, Consensus-based distributed expectation-maximization algorithm for density estimation and classification using wireless sensor networks, in Proc. Int. Conf. Acoust.

Gu, Distributed EM algorithm for Gaussian mixtures in sensor networks, IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1154 1166, Jul. 2008. [15] N. J. Higham and H.

18 724 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 [11] P. A. Forero, A. Cano, and G. B. Giannakis, Consensus-based k-means algorithm for distributed learning using wireless sensor networks, in Proc. Workshop Sens., Signal, Inf. Process., Sedona, AZ, May 11-14, [12] P. A. Forero, A. Cano, and G. B. Giannakis, Consensus-based distributed expectation-maximization algorithm for density estimation and classification using wireless sensor networks, in Proc. Int. Conf. Acoust., Speech, Signal Process., Las Vegas, NV, Mar.-Apr. 30-4, 2008, pp [13] C. Godsil and G. Royle, Algebraic Graph Theory. New York: Springer, [14] D. Gu, Distributed EM algorithm for Gaussian mixtures in sensor networks, IEEE Trans. Neural Netw., vol. 19, no. 7, pp , Jul [15] N. J. Higham and H. Kim, Numerical analysis of a quadratic matrix equation, IMA J. Numer. Anal., vol. 20, pp , [16] R. A. Horm and C. R. Johnson, Matrix Analysis. Cambridge, MA: Cambridge Univ. Press, [17] R. E. Kass and L. Wasserman, J. Amer. Stat. Assoc., vol. 90, no. 431, pp , Sep [18] W. Kowalczyk and N. Vlassis, Newscast EM, in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec , 2005, pp [19] S. P. Lloyd, Least-squares quantization in PCM, IEEE Trans. Inf. Theory, vol. IT-28, no. 2, pp , Mar [20] C. E. Lopes, F. D. Linhares, M. M. Santos, and L. B. Ruiz, A multi-tier, multimodal wireless sensor network for environmental monitoring, in Proc. 4th Int. Conf. Ubiquitous Intell. Comput., Hong Kong, China, Jul , 2007, pp [21] R. Neal and G. E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models, M. I. Jordan, Ed. Norwell, MA: Kluwer, 1998, pp [22] M. K. Ng, K-means-type algorithms on distributed memory computer, Int. J. High Speed Comput., pp , Jun [23] R. D. Nowak, Distributed EM algorithms for density estimation and clustering in sensor networks, IEEE Trans. Signal Process., vol. 51, no. 8, pp , Aug [24] C. H. Papadimitriou, Computational Complexity. Reading, MA: Addison-Wesley, [25] M. Piórkowski and M. Grossglauser, Constrained tracking on a road network, in Proc. 3rd European Workshop on Wireless Sensor Networks, Berlin/Heidelberg, Germany, Feb , 2006, pp [26] L. Schwiebert, K. S. Gupta, and J. Weinmann, Research challenges in wireless networks of biomedical sensors, in Proc. 7th Annu. Int. Conf. Mobile Comput. Netw., Rome, Italy, Jul , 2001, pp [27] S. Z. Selim and M. A. Ismail, K-means-type algorithms: A generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Intell., vol. 6, no. 1, pp , Jan [28] V. Solo and X. Kong, Adaptive Signal Processing Algorithms: Stability and Performance, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, [29] J. Wolfe, A. Haghighi, and D. Klein, Fully distributed EM for very large datasets, in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 6 8, 2008, pp [30] R. Xu and D. Wunsch-II, Survey of clustering algorithms, IEEE Trans. Neural Netw., vol. 16, no. 3, pp , May [31] Y. Zhang, Z. Xiong, J. Mao, and L. Ou, The study of parallel K-means algorithm, in Proc. 6th World Congr. Intell. Control Autom., Dalian, China, Jun , 2006, vol. 2, pp [32] H. Zhu, G. B. Giannakis, and A. Cano, Distributed in-network channel decoding, IEEE Trans. Signal Process., vol. 57, pp , Oct Pedro A. Forero (S 07) received the Diploma in electronics engineering from Pontificia Universidad Javeriana, Bogota, Colombia, in 2003 and the M.Sc. degree in electrical engineering from Loyola Marymount University, Los Angeles, CA, in He is currently working towards the Ph.D. degree with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis. His research interests include statistical signal processing, machine learning, and sensor networks. His current research focuses on distributed and robust algorithms for unsupervised learning. Mr. Forero was a recipient of the Science, Mathematics, and Research for Transformation (SMART) fellowship in Alfonso Cano (M 07) received the electrical engineering degree and Ph.D degree with honors in telecommunications engineering from the Universidad Carlos III de Madrid, Madrid, Spain, in 2002 and 2006, respectively. During , he was with the Dept. of Signal Theory and Communications, Universidad Rey Juan Carlos, Madrid, Spain. Since 2007 he has been with the ECE Department at the Univ. of Minnesota, MN, USA, he is a post-doctoral researcher and lecturer. His general research interests lie in the areas of signal processing, communications and networking. Georgios B. Giannakis (F 97) received the Diploma in electrical engineering from the National Technical University of Athens, Athens, Greece, in 1981 and the M.Sc. degree in electrical engineering in 1983, the M.Sc. degree in mathematics in 1986, and the Ph.D. degree in electrical engineering in 1986, all from the University of Southern California (USC), Los Angeles. Since 1999, he has been a Professor with the University of Minnesota, Minneapolis, he now holds an ADC Chair in Wireless Telecommunications in the Electrical and Computer Engineering Department, and serves as Director of the Digital Technology Center. His general interests span the areas of communications, networking, and statistical signal processing subjects on which he has published more than 300 journal papers, 500 conference papers, two edited books, and two research monographs. Current research focuses on compressive sensing, cognitive radios, network coding, cross-layer designs, mobile ad hoc networks, wireless sensor, and social networks. He is the (co-)inventor of 20 patents issued. Prof. Giannakis is the (co-)recipient of seven paper awards from the IEEE Signal Processing (SP) and Communications Societies, including the G. Marconi Prize Paper Award in Wireless Communications. He also received Technical Achievement Awards from the SP Society (2000), from EURASIP (2005), a Young Faculty Teaching Award, and the G. W. Taylor Award for Distinguished Research from the University of Minnesota. He is a Fellow of EURASIP, and has served the IEEE in a number of posts, including that of a Distinguished Lecturer for the IEEE-SP Society.

Probabilistic Graphical Models

School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @